## Prototype LABELATOR

### overview.
This notebook protypes a "labelator".  The purpose of a "labelator" is to easily create _cell types_ for "Probe" data.

Currently we are prototyping with `scarches` to enable  SCVI / SCANVI models.  Crucially it is their anndata loader which is especially useful. AND to state our confirmation bias impliments the SCVI models which we like.

We will validate potential models and calibrate them with simple expectations using a typical "Train" "Test" ("Validate") and "Probe" approach.  

Definitions:
- "Train": data samples on which the model being tested is trained.
- "Validate":  held-out samples, to test out-of-sample prediction fidelity.
- "Probe": data generated elsewhere, which is _probing_ the fidelity of the model.

### Models:
- SCVI
- logistic regression
- boosted trees (e.g. xgboost)

### Modules:

- `data`: gene x barcode count matrices.  These will be packaged in anndata objects and leverage the scvi-tools / scarches framework for loading
- `train`
- `validate`
- `probe`

### Imports and scvi-tools installation (colab)

In [1]:
import sys

IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    !pip uninstall -y typing_extensions
    !pip install --quiet scvi-colab
    from scvi_colab import install
    install()
    !pip install --quiet scrublet

In [2]:
import sys
import warnings

import anndata as ad
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import scanpy as sc
import scrublet as scr
import scvi
from pathlib import Path


  self.seed = seed
  self.dl_pin_memory_gpu_training = (


In [3]:
if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')

In [4]:
warnings.simplefilter(action="ignore", category=FutureWarning)


sc.set_figure_params(figsize=(4, 4))
scvi.settings.seed = 94705

%config InlineBackend.print_figure_kwargs={'facecolor' : "w"}
%config InlineBackend.figure_format='retina'

Global seed set to 94705


## Data

### Load

In [45]:
XYLENA_ANNDATA = "brain_atlas_anndata.h5ad"
XYLENA_METADATA = "final_metadata.csv"
XYLENA_ANNDATA2 = "brain_atlas_counts.h5ad"


In [46]:
if IN_COLAB:
    root_path = Path("/content/drive/MyDrive/")
    data_path = root_path / "SingleCellModel/data"
else:
    root_path = Path("../")
    data_path = root_path / "data/scdata/xylena_raw"



data_file = data_path / XYLENA_ANNDATA


In [47]:

raw_ad = ad.read_h5ad(data_file)


In [48]:
ogfeatures = raw_ad.var_names.to_list()
raw_ad.var_names_make_unique()

In [49]:

features = raw_ad.var_names.tolist()


In [50]:
shared_feats = list(set(features) & set(ogfeatures))
len(shared_feats), len(features), len(ogfeatures)

(3000, 3000, 3000)

In [51]:
data_path / XYLENA_METADATA, Path.cwd()


(PosixPath('../data/scdata/xylena_raw/final_metadata.csv'),
 PosixPath('/home/ergonyc/Projects/SingleCell/labelator/nbs'))

In [68]:
metadat = pd.read_csv(data_path / XYLENA_METADATA)
og_metadat = raw_ad.obs.copy()

In [71]:
metadat.head()


Unnamed: 0,cells,doublet_score,nCount_RNA,nFeature_RNA,percent.mt,percent.rb,batch,sample,S.Score,G2M.Score,...,ExN1,InN2,MG3,Astro4,Oligo5,OPC6,VC7,type,UMAP_1,UMAP_2
0,GGCCTAATCGATTTAG-1_1,0.163312,21670,6217,0.687587,0.56299,batch1,KEN-1070-ARC,0.003546,-0.010272,...,0.74697,0.02923,-0.121564,-0.421587,-0.665052,-0.169264,-0.119527,Mature neurons,1.518145,-11.242935
1,TAGTAACGTAGTCAAT-1_1,0.143924,20190,5488,0.029718,0.307083,batch1,KEN-1070-ARC,0.034954,-0.022838,...,0.761065,0.0385,-0.098816,-0.45502,-0.384784,-0.388421,0.015812,Mature neurons,1.569603,-1.677851
2,GAAAGCCAGCAGCTCA-1_1,0.168777,17677,5687,0.797647,0.543079,batch1,KEN-1070-ARC,-0.021208,-0.012252,...,0.879119,0.083963,-0.122479,-0.364199,-0.294441,-0.305501,-0.124843,Mature neurons,6.405315,4.732371
3,ACTCACCTCCTCCCTC-1_1,0.097057,17612,4954,0.062457,0.255508,batch1,KEN-1070-ARC,-0.045867,0.005147,...,0.893122,0.067002,-0.10179,-0.407095,-0.665777,-0.354619,-0.102641,Mature neurons,1.445644,-1.882242
4,CTTCATCCAATCGCAC-1_1,0.120637,17250,4837,0.011594,0.202899,batch1,KEN-1070-ARC,-0.056202,-0.019759,...,0.867374,0.120805,-0.09397,-0.422024,-0.742585,0.131618,-0.095371,Mature neurons,0.464842,-10.888965


In [110]:
obs = raw_ad.obs.copy()
obs.head()

Unnamed: 0_level_0,seurat_clusters,cell_type,sample
cells,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
GGCCTAATCGATTTAG-1_1,8,ExN,KEN-1070-ARC
TAGTAACGTAGTCAAT-1_1,2,ExN,KEN-1070-ARC
GAAAGCCAGCAGCTCA-1_1,2,ExN,KEN-1070-ARC
ACTCACCTCCTCCCTC-1_1,2,ExN,KEN-1070-ARC
CTTCATCCAATCGCAC-1_1,8,ExN,KEN-1070-ARC


In [111]:


newmeta = obs.join(metadat.set_index("cells"),lsuffix='', rsuffix='_other')
# tmp.head()
newmeta.head()

Unnamed: 0_level_0,seurat_clusters,cell_type,sample,doublet_score,nCount_RNA,nFeature_RNA,percent.mt,percent.rb,batch,sample_other,...,ExN1,InN2,MG3,Astro4,Oligo5,OPC6,VC7,type,UMAP_1,UMAP_2
cells,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GGCCTAATCGATTTAG-1_1,8,ExN,KEN-1070-ARC,0.163312,21670,6217,0.687587,0.56299,batch1,KEN-1070-ARC,...,0.74697,0.02923,-0.121564,-0.421587,-0.665052,-0.169264,-0.119527,Mature neurons,1.518145,-11.242935
TAGTAACGTAGTCAAT-1_1,2,ExN,KEN-1070-ARC,0.143924,20190,5488,0.029718,0.307083,batch1,KEN-1070-ARC,...,0.761065,0.0385,-0.098816,-0.45502,-0.384784,-0.388421,0.015812,Mature neurons,1.569603,-1.677851
GAAAGCCAGCAGCTCA-1_1,2,ExN,KEN-1070-ARC,0.168777,17677,5687,0.797647,0.543079,batch1,KEN-1070-ARC,...,0.879119,0.083963,-0.122479,-0.364199,-0.294441,-0.305501,-0.124843,Mature neurons,6.405315,4.732371
ACTCACCTCCTCCCTC-1_1,2,ExN,KEN-1070-ARC,0.097057,17612,4954,0.062457,0.255508,batch1,KEN-1070-ARC,...,0.893122,0.067002,-0.10179,-0.407095,-0.665777,-0.354619,-0.102641,Mature neurons,1.445644,-1.882242
CTTCATCCAATCGCAC-1_1,8,ExN,KEN-1070-ARC,0.120637,17250,4837,0.011594,0.202899,batch1,KEN-1070-ARC,...,0.867374,0.120805,-0.09397,-0.422024,-0.742585,0.131618,-0.095371,Mature neurons,0.464842,-10.888965


In [112]:
newmeta['sample'].value_counts()


KEN-1159-ARC      11317
UMARY-5088-ARC    11247
KEN-1066-ARC      10280
SH-92-05-ARC       9325
SH-03-15-ARC       9055
                  ...  
UMARY-4727-ARC     1554
UMARY-1465-ARC     1173
UMARY-4263-ARC     1103
UMARY-5028-ARC      828
UMARY-1789-ARC      255
Name: sample, Length: 138, dtype: int64

In [113]:
clean_samples_path = data_path / "Model Combinations - clean_samples_138.csv"
clean_samples = pd.read_csv(clean_samples_path)

# all_samples_path = "/content/drive/MyDrive/SingleCellModel/Model Combinations - all_samples_199.csv"
# all_samples = pd.read_csv(all_samples_path)

# dirty_samples_path = "/content/drive/MyDrive/SingleCellModel/Model Combinations - dirty_samples_61.csv"
# dirty_samples = pd.read_csv(dirty_samples_path)

test_samples_path = data_path / "Model Combinations - testing_set_41.csv"
test_samples = pd.read_csv(test_samples_path)

train_samples_path = data_path / "Model Combinations - training_set_98.csv"
train_samples = pd.read_csv(train_samples_path)
clean_samples.head()



Unnamed: 0,sample,batch
0,KEN-1070-ARC,batch1
1,KEN-1092-ARC,batch1
2,KEN-1095-ARC,batch1
3,KEN-1127-ARC,batch1
4,KEN-1132-ARC,batch1


In [114]:
newmeta['clean'] =  [s in set(clean_samples['sample']) for s in newmeta['sample'] ]
newmeta['test'] =  [s in set(test_samples['sample']) for s in newmeta['sample'] ]
newmeta['train'] =  [s in set(train_samples['sample']) for s in newmeta['sample'] ]



In [115]:
newmeta = newmeta.drop(columns=['seurat_clusters_other', 'sample_other'])
newmeta.head()


Unnamed: 0_level_0,seurat_clusters,cell_type,sample,doublet_score,nCount_RNA,nFeature_RNA,percent.mt,percent.rb,batch,S.Score,...,Astro4,Oligo5,OPC6,VC7,type,UMAP_1,UMAP_2,clean,test,train
cells,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GGCCTAATCGATTTAG-1_1,8,ExN,KEN-1070-ARC,0.163312,21670,6217,0.687587,0.56299,batch1,0.003546,...,-0.421587,-0.665052,-0.169264,-0.119527,Mature neurons,1.518145,-11.242935,True,False,True
TAGTAACGTAGTCAAT-1_1,2,ExN,KEN-1070-ARC,0.143924,20190,5488,0.029718,0.307083,batch1,0.034954,...,-0.45502,-0.384784,-0.388421,0.015812,Mature neurons,1.569603,-1.677851,True,False,True
GAAAGCCAGCAGCTCA-1_1,2,ExN,KEN-1070-ARC,0.168777,17677,5687,0.797647,0.543079,batch1,-0.021208,...,-0.364199,-0.294441,-0.305501,-0.124843,Mature neurons,6.405315,4.732371,True,False,True
ACTCACCTCCTCCCTC-1_1,2,ExN,KEN-1070-ARC,0.097057,17612,4954,0.062457,0.255508,batch1,-0.045867,...,-0.407095,-0.665777,-0.354619,-0.102641,Mature neurons,1.445644,-1.882242,True,False,True
CTTCATCCAATCGCAC-1_1,8,ExN,KEN-1070-ARC,0.120637,17250,4837,0.011594,0.202899,batch1,-0.056202,...,-0.422024,-0.742585,0.131618,-0.095371,Mature neurons,0.464842,-10.888965,True,False,True


In [117]:
assert newmeta['clean'].sum() == raw_ad.shape[0]



True

In [118]:
# update anndata

adat = raw_ad.copy()
adat.obs = newmeta


In [101]:

adat[adat.obs['sample'] == adat.obs['sample'][0]].obs

Unnamed: 0_level_0,seurat_clusters,cell_type,sample,doublet_score,nCount_RNA,nFeature_RNA,percent.mt,percent.rb,batch,S.Score,...,ExN1,InN2,MG3,Astro4,Oligo5,OPC6,VC7,type,UMAP_1,UMAP_2
cells,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
GGCCTAATCGATTTAG-1_1,8,ExN,KEN-1070-ARC,0.163312,21670,6217,0.687587,0.562990,batch1,0.003546,...,0.746970,0.029230,-0.121564,-0.421587,-0.665052,-0.169264,-0.119527,Mature neurons,1.518145,-11.242935
TAGTAACGTAGTCAAT-1_1,2,ExN,KEN-1070-ARC,0.143924,20190,5488,0.029718,0.307083,batch1,0.034954,...,0.761065,0.038500,-0.098816,-0.455020,-0.384784,-0.388421,0.015812,Mature neurons,1.569603,-1.677851
GAAAGCCAGCAGCTCA-1_1,2,ExN,KEN-1070-ARC,0.168777,17677,5687,0.797647,0.543079,batch1,-0.021208,...,0.879119,0.083963,-0.122479,-0.364199,-0.294441,-0.305501,-0.124843,Mature neurons,6.405315,4.732371
ACTCACCTCCTCCCTC-1_1,2,ExN,KEN-1070-ARC,0.097057,17612,4954,0.062457,0.255508,batch1,-0.045867,...,0.893122,0.067002,-0.101790,-0.407095,-0.665777,-0.354619,-0.102641,Mature neurons,1.445644,-1.882242
CTTCATCCAATCGCAC-1_1,8,ExN,KEN-1070-ARC,0.120637,17250,4837,0.011594,0.202899,batch1,-0.056202,...,0.867374,0.120805,-0.093970,-0.422024,-0.742585,0.131618,-0.095371,Mature neurons,0.464842,-10.888965
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
GCAGGACCATTAAACC-1_1,5,MG,KEN-1070-ARC,0.070043,1001,764,0.999001,1.598402,batch1,-0.041604,...,-0.287908,-0.118699,0.216093,0.128704,-0.088677,-0.164486,-0.087786,Microglial cells,2.066342,-15.089943
ACTAATCCAGGTATTT-1_1,0,Oligo,KEN-1070-ARC,0.163312,1008,692,0.000000,0.595238,batch1,0.033096,...,-0.369182,-0.112303,-0.026922,0.071956,1.165813,-0.149190,-0.043781,Oligodendrocytes,5.601439,9.173177
AGTAGGATCACGCGGT-1_1,12,Oligo,KEN-1070-ARC,0.168777,1001,685,0.000000,0.599401,batch1,-0.028036,...,-0.413633,-0.091690,-0.024294,-0.248098,0.812266,-0.167344,-0.012075,Oligodendrocytes,-5.065182,-2.298816
TGTCCTGGTTGGTTAG-1_1,5,MG,KEN-1070-ARC,0.174528,1001,771,0.000000,0.999001,batch1,-0.004867,...,-0.038414,-0.142348,0.231241,0.299692,0.465222,-0.153135,-0.036226,Microglial cells,0.696901,-15.475661


In [119]:
# do this to hedge against empty var isues

adat.var['feat'] = adat.var_names

In [120]:
outfilen = data_path / XYLENA_ANNDATA.replace(".h5ad", "_updated.h5ad")

adat.write_h5ad(outfilen)

# make train and validation adatas

In [122]:
train_ad = adat[adat.obs['train']].copy()
val_ad = adat[adat.obs['test']].copy()

outfilen = data_path / XYLENA_ANNDATA.replace(".h5ad", "_train.h5ad")
train_ad.write_h5ad(outfilen)

outfilen = data_path / XYLENA_ANNDATA.replace(".h5ad", "_val.h5ad")
val_ad.write_h5ad(outfilen)

NameError: name 'test_ad' is not defined

## Preprocessing

We might need to make some auxillary metadata for the dataset. We will use the cell type annotations from the original paper, and we will also make a cell type annotation for the doublets.
    

### Train, Validate

## Model

### Train

### Validate with 

### Reference mapping with SCVI