<a href="https://colab.research.google.com/github/davemcg/scEiaD/blob/master/colab/cell_type_ML_labelling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Auto Label Retinal Cell Types

## tldr 

You can take your (retina) scRNA data and fairly quickly use the scEiaD ML model
to auto label your cell types. I say fairly quickly because it is *best* if you re-quantify your data with the same reference and counter (kallisto) that we use. You *could* try using your counts from cellranger/whatever....but uh...stuff might get weird.



# Install scvi and kallisto-bustools

In [1]:
import sys
import re
#if True, will install via pypi, else will install from source
stable = True
IN_COLAB = "google.colab" in sys.modules

if IN_COLAB and stable:
    !pip install --quiet scvi-tools[tutorials]==0.9.0

#!pip install --quiet python==3.8 pandas numpy scikit-learn xgboost==1.3

!pip install --quiet kb-python


[K     |████████████████████████████████| 184kB 17.9MB/s 
[K     |████████████████████████████████| 849kB 37.0MB/s 
[K     |████████████████████████████████| 133kB 57.5MB/s 
[K     |████████████████████████████████| 245kB 42.2MB/s 
[K     |████████████████████████████████| 634kB 57.1MB/s 
[K     |████████████████████████████████| 81kB 11.2MB/s 
[K     |████████████████████████████████| 204kB 58.9MB/s 
[K     |████████████████████████████████| 10.3MB 27.3MB/s 
[K     |████████████████████████████████| 51kB 7.9MB/s 
[K     |████████████████████████████████| 8.7MB 27.2MB/s 
[K     |████████████████████████████████| 3.2MB 58.7MB/s 
[K     |████████████████████████████████| 1.4MB 50.8MB/s 
[K     |████████████████████████████████| 184kB 63.1MB/s 
[K     |████████████████████████████████| 829kB 57.1MB/s 
[K     |████████████████████████████████| 276kB 30.7MB/s 
[K     |████████████████████████████████| 112kB 58.7MB/s 
[K     |████████████████████████████████| 51kB 8.4MB/s 


In [3]:
!pip install --quiet pandas numpy scikit-learn xgboost==1.3.1

[K     |████████████████████████████████| 157.5MB 90kB/s 
[?25h

# Download our kallisto index
As our example set is mouse, we use the  Gencode vM25 transcript reference.

The script that makes the idx and t2g file is [here](https://github.com/davemcg/scEiaD/raw/c3a9dd09a1a159b1f489065a3f23a753f35b83c9/src/build_idx_and_t2g_for_colab.sh). This is precomputed as it takes about 30 minutes and 32GB of memory.

There's one more wrinkle worth noting: as scEiaD was built across human, mouse, and macaque unified gene names are required. We chose to use the *human* ensembl ID (e.g. CRX is ENSG00000105392) as the base gene naming system. 


(Download links):
```
# Mouse
https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/gencode.vM25.transcripts.idx
https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/vM25.tr2gX.humanized.tsv
# Human
https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/gencode.v35.transcripts.idx
https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/v35.tr2gX.tsv
```


In [4]:
%%time
!wget -O idx.idx https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/gencode.vM25.transcripts.idx
!wget -O t2g.txt https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/vM25.tr2gX.humanized.tsv

--2021-04-29 12:05:21--  https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/gencode.vM25.transcripts.idx
Resolving hpc.nih.gov (hpc.nih.gov)... 128.231.2.150, 2607:f220:418:4801::2:96
Connecting to hpc.nih.gov (hpc.nih.gov)|128.231.2.150|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 2662625893 (2.5G) [application/octet-stream]
Saving to: ‘idx.idx’


2021-04-29 12:05:58 (72.8 MB/s) - ‘idx.idx’ saved [2662625893/2662625893]

--2021-04-29 12:05:58--  https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/vM25.tr2gX.humanized.tsv
Resolving hpc.nih.gov (hpc.nih.gov)... 128.231.2.150, 2607:f220:418:4801::2:96
Connecting to hpc.nih.gov (hpc.nih.gov)|128.231.2.150|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22502749 (21M) [application/octet-stream]
Saving to: ‘t2g.txt’


2021-04-29 12:06:00 (90.2 MB/s) - ‘t2g.txt’ saved [22502749/22502749]

CPU times: user 377 ms, sys: 71.2 ms, total: 448 ms
Wall time: 39.2 s


# Quantify with kbtools (Kallisto - Bustools wrapper) in one easy step.

Going into the vagaries of turning a SRA deposit into a non-borked pair of fastq files is beyond the scope of this document. Plus I would swear a lot. So we just give an example set from a Human organoid retina 10x (version 2) experiment.

The Pachter Lab has a discussion of how/where to get public data here: https://colab.research.google.com/github/pachterlab/kallistobustools/blob/master/notebooks/data_download.ipynb

If you have your own 10X bam file, then 10X provides a very nice and simple tool to turn it into fastq file here: https://github.com/10XGenomics/bamtofastq

To reduce run-time we have taken the first five million reads from this fastq pair.

This will take ~3 minutes, depending on the internet speed between Google and our server

You can also directly stream the file to improve wall-time, but I was getting periodic errors, so we are doing the simpler thing and downloading each fastq file here first.

 

In [5]:
%%time
!wget -O sample_1.fastq.gz https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/SRR11799731_1.head.fastq.gz
!wget -O sample_2.fastq.gz https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/SRR11799731_2.head.fastq.gz
!kb count --overwrite --h5ad -i idx.idx -g t2g.txt -x DropSeq -o output --filter bustools -t 2 \
  sample_1.fastq.gz \
  sample_2.fastq.gz

--2021-04-29 12:06:31--  https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/SRR11799731_1.head.fastq.gz
Resolving hpc.nih.gov (hpc.nih.gov)... 128.231.2.150, 2607:f220:418:4801::2:96
Connecting to hpc.nih.gov (hpc.nih.gov)|128.231.2.150|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 103529059 (99M) [application/octet-stream]
Saving to: ‘sample_1.fastq.gz’


2021-04-29 12:06:33 (90.3 MB/s) - ‘sample_1.fastq.gz’ saved [103529059/103529059]

--2021-04-29 12:06:33--  https://hpc.nih.gov/~mcgaugheyd/scEiaD/colab/SRR11799731_2.head.fastq.gz
Resolving hpc.nih.gov (hpc.nih.gov)... 128.231.2.150, 2607:f220:418:4801::2:96
Connecting to hpc.nih.gov (hpc.nih.gov)|128.231.2.150|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 245302496 (234M) [application/octet-stream]
Saving to: ‘sample_2.fastq.gz’


2021-04-29 12:06:35 (97.8 MB/s) - ‘sample_2.fastq.gz’ saved [245302496/245302496]

[2021-04-29 12:06:36,630]    INFO Using index idx.idx to generate BUS f


# Download models
(and our xgboost functions for cell type labelling)

The scVI model is the same that we use to create the data for plae.nei.nih.gov

The xgboost model is a simplified version that *only* uses the scVI latent dims and omits the Early/Late/RPC cell types and collapses them all into "RPC"

In [8]:
!wget -O scVI_scEiaD.tgz https://hpc.nih.gov/~mcgaugheyd/scEiaD/2021_03_17/2021_03_17__scVI_scEiaD.tgz
!tar -xzf scVI_scEiaD.tgz

!wget -O celltype_ML_model.tar https://hpc.nih.gov/~mcgaugheyd/scEiaD/2021_03_17/2021_cell_type_ML_all.tar
!tar -xf celltype_ML_model.tar

!wget -O celltype_predictor.py https://raw.githubusercontent.com/davemcg/scEiaD/master/src/cell_type_predictor.py



--2021-04-29 12:12:38--  https://hpc.nih.gov/~mcgaugheyd/scEiaD/2021_03_17/2021_03_17__scVI_scEiaD.tgz
Resolving hpc.nih.gov (hpc.nih.gov)... 128.231.2.150, 2607:f220:418:4801::2:96
Connecting to hpc.nih.gov (hpc.nih.gov)|128.231.2.150|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12851811 (12M) [application/octet-stream]
Saving to: ‘scVI_scEiaD.tgz’


2021-04-29 12:12:40 (36.9 MB/s) - ‘scVI_scEiaD.tgz’ saved [12851811/12851811]

--2021-04-29 12:12:40--  https://hpc.nih.gov/~mcgaugheyd/scEiaD/2021_03_17/2021_cell_type_ML_all.tar
Resolving hpc.nih.gov (hpc.nih.gov)... 128.231.2.150, 2607:f220:418:4801::2:96
Connecting to hpc.nih.gov (hpc.nih.gov)|128.231.2.150|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12359680 (12M) [application/octet-stream]
Saving to: ‘celltype_ML_model.tar’


2021-04-29 12:12:40 (39.5 MB/s) - ‘celltype_ML_model.tar’ saved [12359680/12359680]

--2021-04-29 12:12:40--  https://raw.githubusercontent.com/davemcg

# Python time

In [38]:
import anndata
import sys
import os
import numpy as np
import pandas as pd
import random
import scanpy as sc
from scipy import sparse
import scvi
import torch
# 2 cores
sc.settings.n_jobs = 2
# set seeds
random.seed(234)
scvi.settings.seed = 234

# set some args
org = 'mouse'
n_epochs = 15
confidence = 0.5

# Load adata
And process (mouse processing requires a bit more jiggling that can be skipped if you have human data)

In [39]:
# load query data
adata_query = sc.read_h5ad('output/counts_filtered/adata.h5ad')
adata_query.layers["counts"] = adata_query.X.copy()
adata_query.layers["counts"] = sparse.csr_matrix(adata_query.layers["counts"])


# Set scVI model path
scVI_model_dir_path = 'scVIprojectionSO_scEiaD_model/n_features-5000__transform-counts__partition-universe__covariate-batch__method-scVIprojectionSO__dims-8/' 
# Read in HVG genes used in scVI model
var_names = pd.read_csv(scVI_model_dir_path + '/var_names.csv', header = None)
# cut down query adata object to use just the var_names used in the scVI model training

if org.lower() == 'mouse':
    adata_query.var_names = adata_query.var['gene_name']
    n_missing_genes = sum(~var_names[0].isin(adata_query.var_names))
    dummy_adata = anndata.AnnData(X=sparse.csr_matrix((adata_query.shape[0], n_missing_genes)))
    dummy_adata.obs_names = adata_query.obs_names
    dummy_adata.var_names = var_names[0][~var_names[0].isin(adata_query.var_names)]
    adata_fixed = anndata.concat([adata_query, dummy_adata], axis=1)
    adata_query_HVG = adata_fixed[:, var_names[0]]


# Run scVI (trained on scEiaD data) 
Goal: get scEiaD batch corrected latent space for *your* data

In [40]:
adata_query_HVG.obs['batch'] = 'New Data'

scvi.data.setup_anndata(adata_query_HVG, batch_key="batch")
vae_query = scvi.model.SCVI.load_query_data(
    adata_query_HVG, 
    scVI_model_dir_path
)
# project scVI latent dims from scEiaD onto query data
vae_query.train(max_epochs=n_epochs,  plan_kwargs=dict(weight_decay=0.0))
# get the latent dims into the adata
adata_query_HVG.obsm["X_scVI"] = vae_query.get_latent_representation()


Trying to set attribute `.obs` of view, copying.


[34mINFO    [0m Using batches from adata.obs[1m[[0m[32m"batch"[0m[1m][0m                                               
[34mINFO    [0m No label_key inputted, assuming all cells have same label                           
[34mINFO    [0m Using data from adata.X                                                             
[34mINFO    [0m Computing library size prior per batch                                              
[34mINFO    [0m Successfully registered anndata object containing [1;36m1285[0m cells, [1;36m5000[0m vars, [1;36m1[0m batches, 
         [1;36m1[0m labels, and [1;36m0[0m proteins. Also registered [1;36m0[0m extra categorical covariates and [1;36m0[0m extra
         continuous covariates.                                                              
[34mINFO    [0m Please do not further modify adata until model is trained.                          
[34mINFO    [0m Using data from adata.X                                                   

GPU available: True, used: True
TPU available: False, using: 0 TPU cores
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


Epoch 15/15: 100%|██████████| 15/15 [00:01<00:00,  8.50it/s, loss=245, v_num=1]


# Get Cell Type predictions
(this xgboost model does NOT use the organim or Age information, but as those field were often used by use, they got hard-coded in. So we will put dummy values in).

In [41]:
# extract latent dimensions
obs=pd.DataFrame(adata_query_HVG.obs)
obsm=pd.DataFrame(adata_query_HVG.obsm["X_scVI"])
features = list(obsm.columns)
obsm.index = obs.index.values
obsm['Barcode'] = obsm.index
obsm['Age'] = 1000
obsm['organism'] = 'x'
# xgboost ML time
from celltype_predictor import *


CT_predictions = scEiaD_classifier_predict(inputMatrix=obsm, 
                               labelIdCol='ID', 
                               labelNameCol='CellType',  
                               trainedModelFile= os.getcwd() + '/2021_cell_type_ML_all',
                               featureCols=features,  
                               predProbThresh=confidence)


Loading Data...

Predicting Data...

19 samples Failed to meet classification threshold of 0.5


# What do we have?

In [42]:
CT_predictions['CellType'].value_counts()

Rods                        707
Bipolar Cells               224
Amacrine Cells              115
Muller Glia                 101
Cones                        50
Retinal Ganglion Cells       36
None                         19
Endothelial                  14
Rod Bipolar Cells             9
Red Blood Cells               3
Fibroblasts                   2
RPE                           1
Photoreceptor Precursors      1
Macrophage                    1
Horizontal Cells              1
Vein                          1
Name: CellType, dtype: int64