Author: Erno Hänninen

Created: 11.02.2023

Title: prepare_data_scvi.ipynb

Description:

- The benchmarking study revealed inconsistent clustering of some cell types, here the data is prepared to the best-performing integration method
- Additionally the scripts makes adjustments to the cell type names to ensure the naming pattern is consistent among the datasets
- Add the hypothalamic nuclei to the dataset (available only for Herb dataset, later these are annotated from Zhou as well)

Procedure:
- Read unintegrated data
- Harmonize cell type naming between the two publication
- Remove the cell types which caused inconsistent clustering (NE, Blood, Dividing, Fibroblasts)
- The hypothalamic nuclei are available for Herb dataset, therefore add those to the unintegrated dataset
- Save resulting adata object for further use

List of non-standard modules:
- scanpy, matplotlib, pandas, seaborn

Conda environment used:
- PYenv

Usage:
- The script was executed using Jupyter Notebook web interface. All the dependencies required by Jupyter are installed to PYenv Conda environment. See README file for further details

In [1]:
import scanpy as sc
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [None]:
# Read unintegrated data and hypohtalamic nuclei which are available for herb dataset
adata = sc.read("../DataProcessing_pipeline/Data/Processed/merged_zhou_herb.h5ad")
EmbryoAdultNuclei = sc.read("Data/EmbryoAdultNuclei.h5ad")
EmbryoAdultNuclei.obs.head(5)

In [None]:
# adata.X contains log-normalized data and adata.layers["counts"] contains raw counts
adata

In [6]:
# After our data integration benchmarking we noted that some cell names among publications didn't make sense -> harmonize the names again
# Create new cell type column
adata.obs['Cell_types_3'] = (
    adata.obs["Cell_types"]
    .map(lambda x: {"Oligodendrocyte Progenitors_1": "OPC", "Oligodendrocyte Progenitors_2": "OPC","Oligodendrocytes [Immature]": "OPC", "Oligodendrocytes [Dividing]":"OPC", "Oligodendrocytes [Maturing]":"Oligo", "Oligodendrocytes [Mature]":"Oligo","OL": "Oligo", "vSMC":"Mural", "Ependymal":"Ependy",
                    "Neural Progenitors_1":"NP", "Neural Progenitors_2":"NP", "Neurons":"Neuron", "Dividing":"Dividing", "Astrocyte Progenitors":"Astrocyte", "Astrocytes":"Astrocyte", "Endothelial [Venous]":"Endoth", "Endothelial [Arterial_2]": "Endoth", "Endothelial [Arterial_1]":"Endoth", "Pericytes_1":"VLMC", "Pericytes_2":"Mural"}.get(x, x))
    .astype("category")
)

In [7]:
# Add source column to data, from batch name we can see where it originates
source = {'CS13': 'Herb','CS14': 'Herb','CS15': 'Herb','CS22_2_hypo': 'Herb','CS22_hypo': "Herb",'GW16_hypo': 'Herb','GW18_hypo': 'Herb','GW19_hypo': 'Herb','GW20_34_hypo': 'Herb','GW22T_hypo1': 'Herb','GW25_3V_hypo': 'Herb',
     'GW7-lane1': 'Zhou','GW7-lane2': 'Zhou','GW8-1': 'Zhou','GW8-2': 'Zhou','GW10': "Zhou",'GW12_01': 'Zhou','GW12_02': 'Zhou','GW15-A': 'Zhou','GW15-M': 'Zhou','GW15-P': 'Zhou','GW18-01-A': 'Zhou','GW18-01-M': "Zhou",
     'GW18-01-P': 'Zhou','GW18-02-lane1': 'Zhou','GW18-02-lane2': 'Zhou','GW18-02-lane3': 'Zhou','GW20-A': 'Zhou','GW20-M': 'Zhou','GW20-P': 'Zhou'}

adata.obs['source'] = adata.obs['sample'].map(source).astype('category')

In [8]:
# Filtering Fibroblast and NE cells away
adata = adata[(adata.obs["Cell_types_3"] != "Fibroblasts") & (adata.obs["Cell_types_3"] != "NE") & (adata.obs["Cell_types_3"] != "Blood") & (adata.obs["Cell_types_3"] != "Dividing"), :]

In [10]:
# add the hypothalamic nuclei to the data, these are available only for Herb dataset

# Subset the samples from EmbryoAdultNuclei that we have in our reference dataset (we have only embryo samples)
EmbryoAdultNuclei_subset = EmbryoAdultNuclei[EmbryoAdultNuclei.obs["sample"].isin(["GW16_hypo", "GW18_hypo", "CS22_hypo", "GW25_3V_hypo","GW19_hypo","GW20_34_hypo",  "CS22_2_hypo" , "GW22T_hypo1"])].copy()
# Use same cell name pattering in EmbryoAdultNuclei_subset than we have in our reference andata
new_cell_names = [id + "-1" if id !='' else id for id in EmbryoAdultNuclei_subset.obs_names]
EmbryoAdultNuclei_subset.obs_names = new_cell_names

# Join the column from EmbryoAdultNuclei_subset that contains the neuronal subtypes with our reference data
adata.obs = adata.obs.join(EmbryoAdultNuclei_subset.obs[["EmbryoAdultNuclei"]])

# Rename the EmbryoAdultNuclei column
adata.obs["Cell_subpopulations"] = adata.obs["EmbryoAdultNuclei"]
del adata.obs["EmbryoAdultNuclei"]

# Use the values from Cell_types_3 to replace NA values
adata.obs['Cell_subpopulations'] = adata.obs['Cell_subpopulations'].fillna(adata.obs['Cell_types_3'])


In [None]:
# Ensure we have the nuclei in data
adata.obs['Cell_subpopulations'].value_counts()

In [12]:
# Prepare data for scvi by running hvg selection and save the result to file

adata.uns['log1p']["base"] = None
adata.raw = adata  # keep full dimension safe

# Computing highly variable genes using batch aware function
sc.pp.highly_variable_genes(adata,flavor="seurat_v3",n_top_genes=2000,
    layer="counts",batch_key="sample",subset=True)

adata.write("Data/adata_ready_for_scvi.h5ad")