# Baglole Lab Tutorial Setup

This file contains the instructions to set up the tutorial.

If you haven't already done so, use wget to download the single cell data we'll be working with,
which is from the `scanpy` tutorial.

In [1]:
import scanpy as sc
import pandas as pd
from pathlib import Path

In [2]:
sc.settings.verbosity = 3  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.logging.print_header()
sc.settings.set_figure_params(dpi=80, facecolor="white")

scanpy==1.10.0rc2 anndata==0.10.6 umap==0.5.5 numpy==1.26.4 scipy==1.12.0 pandas==2.2.1 scikit-learn==1.4.1.post1 statsmodels==0.14.1 igraph==0.11.4 pynndescent==0.5.11


In [3]:
!wget http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz -O data/pbmc3k_filtered_gene_bc_matrices.tar.gz
!cd data; tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz
!gzip data/filtered_gene_bc_matrices/hg19/barcodes.tsv --keep --force
!gzip data/filtered_gene_bc_matrices/hg19/genes.tsv --keep --force
!gzip data/filtered_gene_bc_matrices/hg19/matrix.mtx --keep --force
!cd ..

--2024-03-24 10:15:55--  http://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
Resolving cf.10xgenomics.com (cf.10xgenomics.com)... 104.18.0.173, 104.18.1.173, 2606:4700::6812:ad, ...
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|104.18.0.173|:80... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz [following]
--2024-03-24 10:15:55--  https://cf.10xgenomics.com/samples/cell-exp/1.1.0/pbmc3k/pbmc3k_filtered_gene_bc_matrices.tar.gz
Connecting to cf.10xgenomics.com (cf.10xgenomics.com)|104.18.0.173|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 7621991 (7.3M) [application/x-tar]
Saving to: ‘data/pbmc3k_filtered_gene_bc_matrices.tar.gz’


2024-03-24 10:15:56 (21.0 MB/s) - ‘data/pbmc3k_filtered_gene_bc_matrices.tar.gz’ saved [7621991/7621991]



In [4]:
adata = sc.read_10x_mtx("data/filtered_gene_bc_matrices/hg19")
# adata = sc.read_10x_mtx('data')
adata

--> This might be very slow. Consider passing `cache=True`, which enables much faster reading from a cache file.


AnnData object with n_obs × n_vars = 2700 × 32738
    var: 'gene_ids'

Now we need to produce files to work with during the tutorial.  We will do this with a series of Python commands.

In [5]:
# first read in the data in comma separated value format
folks = pd.read_csv("data/folks.csv")
folks.head()

Unnamed: 0,Name,Occupation,Sex,Age,Weight,Income
0,Ted,Tinker,Male,33,70,50000
1,Carol,Tailor,Female,27,50,60000
2,Alice,Soldier,Female,48,60,70000
3,Bob,Spy,Male,61,80,55000


In [6]:
# now save the data in tab separated value format
folks_tsv_path = Path("data/folks.tsv")
if not folks_tsv_path.exists():
    folks.to_csv("data/folks.tsv", sep='\t')

In [2]:
# read the data in Excel format
# make sure to install openpyxl first using `conda install -c conda-forge openpyxl -y`
# This may require you to restart the kernel.
#
folks_xl = pd.read_excel("data/folks.xlsx")
folks_xl.head()

Unnamed: 0,Name,Occupation,Sex,Age,Weight,Income
0,Ted,Tinker,Male,33,70,50000
1,Carol,Tailor,Female,27,50,60000
2,Alice,Soldier,Female,48,60,70000
3,Bob,Spy,Male,61,80,55000


In [8]:
folks_tsv = pd.read_csv("data/folks.tsv", sep='\t')
folks_tsv.head()

Unnamed: 0.1,Unnamed: 0,Name,Occupation,Sex,Age,Weight,Income
0,0,Ted,Tinker,Male,33,70,50000
1,1,Carol,Tailor,Female,27,50,60000
2,2,Alice,Soldier,Female,48,60,70000
3,3,Bob,Spy,Male,61,80,55000


In [9]:
folks.columns

Index(['Name', 'Occupation', 'Sex', 'Age', 'Weight', 'Income'], dtype='object')

In [10]:
len(folks.columns)

6

In [11]:
df = pd.DataFrame(folks.iloc[:, 1])
df

Unnamed: 0,Occupation
0,Tinker
1,Tailor
2,Soldier
3,Spy
