## LBL8R create adata embeddings

### overview.
This notebook is a simplified version of `lbl8r_scvi.ipynb` which will not train any of the model but will load and prep `anndata` ("annotated data") files to be used downstream by the `LBL8R`.



### Models and Embeddings

We will use a variety of models to "embed" the scRNAseq counts into lower dimension.
- scVI latents
- PCA. We interpret this as a linear embedding
- etc.  in the future non-variational Auto Encoders, or other "compressions" 

### files
We will make 5 sets of files from Xylena's dataset from both the "test" and "train" subsets:
- raw counts (0)
    - PCA embedding (1.)
    - scVI embeddings 
        - mean latent only (2. )
        - mean and var latents (concatenated) (3. )
- normalized expression (scVI)
    - normalized expression @ 1e4 `library_size`(4. )
    - PCA embeddings of above (5. )

In [1]:
import sys

IN_COLAB = "google.colab" in sys.modules
if IN_COLAB:
    !pip uninstall -y typing_extensions
    !pip install --quiet scvi-colab
    from scvi_colab import install
    install()

In [2]:


from pathlib import Path
from scvi.model import SCVI

import scanpy as sc

import numpy as np
import anndata as ad


if IN_COLAB:
    from google.colab import drive
    drive.mount('/content/drive')
    
from lbl8r.utils import mde, make_latent_adata, make_scvi_normalized_adata, make_pc_loading_adata


%load_ext autoreload
%autoreload 2


### Load Train, Validate Data 

In [3]:
if IN_COLAB:
    root_path = Path("/content/drive/MyDrive/")
    data_path = root_path / "SingleCellModel/data"
else:
    root_path = Path("../")
    if sys.platform == "darwin":
        data_path = root_path / "data/xylena_raw"
    else:
        data_path = root_path / "data/scdata/xylena_raw"
        
XYLENA_ANNDATA = "brain_atlas_anndata.h5ad"
XYLENA_METADATA = "final_metadata.csv"
XYLENA_ANNDATA2 = "brain_atlas_anndata_updated.h5ad"

XYLENA_TRAIN = XYLENA_ANNDATA.replace(".h5ad", "_train.h5ad")
XYLENA_TEST = XYLENA_ANNDATA.replace(".h5ad", "_test.h5ad")


cell_type_key = 'cell_type'

## load scVI model 

In [4]:
model_path = root_path / "lbl8r_models"
scvi_path = model_path / "scvi_nobatch"

labels_key = 'cell_type'



### setup train data for scVI

In [5]:
outfilen = data_path / XYLENA_TRAIN
train_ad = ad.read_h5ad(outfilen)


In [16]:
train_ad.obs.cell_type.cat.categories

Index(['Astro', 'ExN', 'InN', 'MG', 'OPC', 'Oligo', 'VC', 'Unknown'], dtype='object')

In [15]:
train_ad.obs[['seurat_clusters','cell_type','type','tmp']][train_ad.obs.cell_type == "Unknown"]

Unnamed: 0_level_0,seurat_clusters,cell_type,type,tmp
cells,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
ATTGCTCGTTTGGGTA-1_3,1,Unknown,Oligodendrocytes,nanOligodendrocytes
AAGCGAATCCTGAGTG-1_8,1,Unknown,Oligodendrocytes,nanOligodendrocytes
CAGTATGGTCACCTAT-1_15,1,Unknown,Oligodendrocytes,nanOligodendrocytes
GTTCTTGTCACAGGAA-1_16,1,Unknown,Oligodendrocytes,nanOligodendrocytes
TGCAGGCTCCTCACTA-1_19,1,Unknown,Oligodendrocytes,nanOligodendrocytes
CTTAAGATCCTCCTAA-1_19,1,Unknown,Oligodendrocytes,nanOligodendrocytes
GATGCGACACCGGCTA-1_19,1,Unknown,Oligodendrocytes,nanOligodendrocytes
CCTTCGTAGGATGATG-1_39,1,Unknown,Oligodendrocytes,nanOligodendrocytes
GCTAACAGTCACACCC-1_41,1,Unknown,Oligodendrocytes,nanOligodendrocytes
CTTGCATGTTCGGGAT-1_66,1,Unknown,Oligodendrocytes,nanOligodendrocytes


In [12]:
train_ad.obs[['seurat_clusters','cell_type','type','tmp']].drop_duplicates()



Unnamed: 0_level_0,seurat_clusters,cell_type,type,tmp
cells,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
GGCCTAATCGATTTAG-1_1,8,ExN,Mature neurons,ExNMature neurons
TAGTAACGTAGTCAAT-1_1,2,ExN,Mature neurons,ExNMature neurons
GTTAATGTCAAGCTAC-1_1,11,ExN,Mature neurons,ExNMature neurons
ATTTGCAAGGACCTTG-1_1,1,Oligo,Oligodendrocytes,OligoOligodendrocytes
TCAGTAATCCCGCCTA-1_1,13,ExN,Mature neurons,ExNMature neurons
...,...,...,...,...
TGAGGTGCAAGCCACT-1_77,11,OPC,Mature neurons,OPCMature neurons
CACTTAAAGTATGTGC-1_106,13,InN,Mature neurons,InNMature neurons
ATTGCGCCATCGCTCC-1_115,13,Oligo,Mature neurons,OligoMature neurons
TGTGAAACACCTCGCT-1_117,5,ExN,Microglial cells,ExNMicroglial cells


In [12]:
SCVI.setup_anndata(train_ad,labels_key=labels_key, batch_key=None) #"dummy")


### load trained scVI

In [14]:

vae = SCVI.load(scvi_path.as_posix(),train_ad.copy())


[34mINFO    [0m File ..[35m/lbl8r_models/scvi_nobatch/[0m[95mmodel.pt[0m already downloaded          


--------------
## make scVI normalized adata for further testing... i.e. `pcaLBL8R`

In [27]:
norm_train_ad = make_scvi_normalized_adata(vae, train_ad)
norm_train_ad.write_h5ad(data_path / XYLENA_ANNDATA.replace(".h5ad", "_train_scvi_normalized.h5ad"))


(502085, 3000)


## Now make on the latent anndata

In [28]:

scvi_train_ad = make_latent_adata(vae,train_ad, return_dist=True)
scvi_train_ad.write_h5ad(data_path / XYLENA_ANNDATA.replace(".h5ad", "_train_scVI_lat.h5ad"))
del scvi_train_ad


(502085, 20)


In [29]:

scvi_train_ad_mu = make_latent_adata(vae,train_ad, return_dist=False)
scvi_train_ad_mu.write_h5ad(data_path / XYLENA_ANNDATA.replace(".h5ad", "_train_scVImu_lat.h5ad"))
del scvi_train_ad_mu


(502085, 10)


## PCA `AnnData` files

In [30]:
loadings_train_ad = make_pc_loading_adata( train_ad)
loadings_train_ad.write_h5ad(data_path / XYLENA_ANNDATA.replace(".h5ad", "_train_pca.h5ad"))




(502085, 50)


In [31]:
norm_loadings_train_ad = make_pc_loading_adata( norm_train_ad)
norm_loadings_train_ad.write_h5ad(data_path / XYLENA_ANNDATA.replace(".h5ad", "_train_scvi_normalized_pca.h5ad"))




(502085, 50)


------------------
Now test data

1. setup anndata
2. get scVI normalized expression
3. get scVI latents



In [17]:
filen = data_path / XYLENA_TEST
test_ad = ad.read_h5ad(filen)


In [18]:
test_ad

AnnData object with n_obs × n_vars = 502085 × 3000
    obs: 'seurat_clusters', 'cell_type', 'sample', 'doublet_score', 'nCount_RNA', 'nFeature_RNA', 'percent.mt', 'percent.rb', 'batch', 'S.Score', 'G2M.Score', 'Phase', 'RNA_snn_res.0.3', 'ExN1', 'InN2', 'MG3', 'Astro4', 'Oligo5', 'OPC6', 'VC7', 'type', 'UMAP_1', 'UMAP_2', 'clean', 'test', 'train', 'tmp'
    var: 'feat'

In [33]:
SCVI.setup_anndata(test_ad.copy(),labels_key=labels_key, batch_key=None) #"dummy")


In [34]:

norm_test_ad = make_scvi_normalized_adata(vae, test_ad)
norm_test_ad.write_h5ad(data_path / XYLENA_ANNDATA.replace(".h5ad", "_test_scvi_normalized.h5ad"))



[34mINFO    [0m Input AnnData not setup with scvi-tools. attempting to transfer AnnData
         setup                                                                  
(502085, 3000)


In [35]:
scVIqzmd_test_ad = make_latent_adata(vae,test_ad, return_dist=True)
scVIqzmd_test_ad.write_h5ad(data_path / XYLENA_ANNDATA.replace(".h5ad", "_test_scVI_qzmv.h5ad"))

del scVIqzmd_test_ad



(502085, 20)


In [36]:
scVIz_test_ad = make_latent_adata(vae, test_ad, return_dist=False)
scVIz_test_ad.write_h5ad(data_path / XYLENA_ANNDATA.replace(".h5ad", "_test_scVI_z.h5ad"))

del scVIz_test_ad


(502085, 10)


## PCA `AnnData` files

In [37]:
loadings_test_ad = make_pc_loading_adata( test_ad)
loadings_test_ad.write_h5ad(data_path / XYLENA_ANNDATA.replace(".h5ad", "_test_pca.h5ad"))




(502085, 50)


In [38]:
norm_loadings_test_ad = make_pc_loading_adata( norm_test_ad)
norm_loadings_test_ad.write_h5ad(data_path / XYLENA_ANNDATA.replace(".h5ad", "_test_scvi_normalized_pca.h5ad"))




(502085, 50)
