Author: Erno Hänninen

Created: 16.02.2023

Title: run_scvi.ipynb

Description: 
- BAtch correction using scVI
- scVI-tools recommend correcting batch effect using scVI and then fine-tuning the integration with scANVI
- Therefore, even scANVI was the best-performing method in our benchmarking we need to initialize it pretrained using scVI model

Procedure
- Read data to be integrated
- Train scVI model
- Annotate tanycyte and radial glia populations based on marker gene expressin. THese populations were missannotated in Zhou dataset

List of non-standard modules:
- scanpy, scvi, scib, matplotlib, pandas

Conda environment used:
- PYenv

Usage:
- The script was executed using Jupyter Notebook web interface. All the dependencies required by Jupyter are installed to PYenv Conda environment. See README file for further details

In [1]:
import scanpy as sc
import scvi
import scib
import matplotlib.pyplot as plt
import pandas as pd

[rank: 0] Global seed set to 0


In [16]:
# Read the data
adata = sc.read("Data/adata_ready_for_scvi.h5ad")
adata

# scVI

In [20]:
# Setup data for scvi and initialize model 
scvi.model.SCVI.setup_anndata(adata, layer="counts", batch_key="sample")
vae = scvi.model.SCVI(adata, n_layers=2, n_latent=30,n_hidden=256,dispersion="gene-cell", gene_likelihood="zinb")
# Train the model
vae.train(early_stopping=True, max_epochs=120, train_size=0.75)

In [None]:
# Plot convergence to ensure the model doesn't overfit
train_elbo = vae.history['elbo_train'][1:]
test_elbo = vae.history['elbo_validation']
ax = train_elbo.plot()
test_elbo.plot(ax=ax)

In [24]:
# Use the embedding from scvi to compute neighbors and umap
adata.obsm["X_scVI"] = vae.get_latent_representation()
sc.pp.neighbors(adata, use_rep="X_scVI")
sc.tl.umap(adata)

In [None]:
# Plot integration results and tanycyte marker (CRYM)
sc.pl.umap(adata,color=["sample", "Cell_types_3", "CRYM"],
           frameon=False,ncols=1)

# Annotate tanycytes from zhou data

In [None]:
# The CRYM (tanycyte marker) expression revails that tanycytes are missanotated as neural progenitros in Zhou dataset
adata_subset = adata[adata.obs["Cell_types_3"].isin(["Astrocyte", "Tanycytes", "RadialGlia", "NP"])].copy()
sc.pl.umap(adata_subset, color=["Cell_types_3", "CRYM"])

In [69]:
# Reclustering subsetted data
sc.tl.pca(adata_subset)
sc.pp.neighbors(adata_subset)
sc.tl.leiden(adata_subset, resolution=2.5)


In [None]:
# Plot cyrm expression and leiden cluster, to identify the cluster which needs to be reannotated
sc.pl.umap(adata_subset, color=["leiden", "CRYM"], wspace=0.45, legend_loc="on data", legend_fontsize="xx-small")

In [None]:
# In the original dataset cluster 23 is annotated as NP. However as they are CRYM+ and RAX+ (tanycyte markers),
# and as NP should be APOE- (tanycyte marker) -> we annotate this cluster as tanycytes
sc.pl.umap(adata_subset[adata_subset.obs["leiden"].isin(["23"])], color=["CRYM","RAX","APOE","Cell_types_3","leiden"], wspace=0.4)

In [137]:
# rename cells from cluster 23 as tanycytes
adata.obs["Cell_types_4"] = adata.obs["Cell_types_3"]
tanycyte_cells = pd.Series(list(adata_subset[adata_subset.obs["leiden"].isin(["23"])].obs.index), dtype="category")
#Updates the cell types in Cell_types_4 and Cell_subpopulations column 
adata.obs["Cell_types_4"].loc[tanycyte_cells] = "Tanycytes"
adata.obs["Cell_subpopulations"].loc[tanycyte_cells] = "Tanycytes"

#Write the tanycyte population to file, for reproducibility
with open('tanycytes.txt', 'w') as f:
    for cells in list(tanycyte_cells):
        f.write(f"{cells}\n")

# Annotate Radial glia from Zhou data

In [235]:
# Radial glia is missing in Zhou dataset -> annotating it
# According to Herb et al 2022, HOPX+ is radial glia and HOPX- is astrocyte, in Zhou dataset HOPX+ cells have 
# been annotated as Radial glia. 
# The dotplot in Herb et al 2022 Fig1 indicates that in addition to HOPX, also EGRF and OLIG1 genes are expressed
# in radial glia but not in astrocytes

In [None]:
# plotting radial glia and astrocytes from Herb dataset and the markers listed above
adata_subset_herb = adata[adata.obs["source"].isin(["Herb"])].copy()
sc.pl.umap(adata_subset_herb[adata_subset_herb.obs["Cell_types_4"].isin([ "Astrocyte", "RadialGlia"])], color=["Cell_types_4","HOPX","EGFR","OLIG1"], legend_loc="on data", legend_fontsize="xx-small")

In [None]:
# Then plotting radial glia (missing) and astrocytes from Zhou dataset and the markers listed above
adata_subset_zhou = adata[adata.obs["source"].isin(["Zhou"])].copy()
sc.pl.umap(adata_subset_zhou[adata_subset_zhou.obs["Cell_types_4"].isin([ "Astrocyte", "RadialGlia"])], color=["Cell_types_4","HOPX", "EGFR","OLIG1"], wspace=0.4)

In [239]:
# From the umap were Herb data is plotted we can see that these markers overlap the radial glia population
# Additionally from the ZHou population we can see that the expression of these markers aren't consistent with the 
# astrocyte population 
# THerfore we can assume that that some of the astrocyte cells are Radial glia

# Extract the astrocyte population from Zhou data
adata_zhou_astrocytes = adata_subset_zhou[adata_subset_zhou.obs["Cell_types_4"].isin([ "Astrocyte"])]
# Reclustering data
sc.tl.pca(adata_zhou_astrocytes)
sc.pp.neighbors(adata_zhou_astrocytes)
sc.tl.leiden(adata_zhou_astrocytes, resolution=0.4)

In [None]:
# Plot radial glia markers to see the which leiden cluster needs to be re-annotated
sc.pl.umap(adata_zhou_astrocytes, color=["leiden","HOPX", "EGFR","OLIG1"], legend_loc="on data", legend_fontsize="xx-small")

In [241]:
# Based on plots above we can annotate zhou's astrocyte clusters 0, 2, 5 as radial glia
astrocytes = pd.Series(list(adata_zhou_astrocytes[adata_zhou_astrocytes.obs["leiden"].isin(["0", "2", "5"])].obs.index), dtype="category")
#Updates the cell types in Cell_types_4 column
adata.obs["Cell_types_4"].loc[astrocytes] = "RadialGlia"
adata.obs["Cell_subpopulations"].loc[astrocytes] = "RadialGlia"

#Write the radialglia population to file, for reproducibility
with open('radialglia.txt', 'w') as f:
    for cells in list(astrocytes):
        f.write(f"{cells}\n")

In [243]:
# Saving scVI integrated data containing updated cell type and scVI model to file
vae.save("scvi_model", overwrite=True)
adata.write("Data/scvi_adata.h5ad")