# Sample Integration, batch Correction and Clustering 

This notebook contains the code to perform sample integration and batch correction of 10x VISIUM ST datasets from the same study. 

# Table of Contents
* [Parameters and Input Anndata](#1.-Parameters-and-Input-Anndata)
* [Running Chrysalis and visualizing the results](#2.-Running-Chrysalis-and-visualizing-the-results)
    * [Computing Spatially variable Genes](#2.1.-Computing-Spatially-variable-Genes)
    * [Computing-Tissue-Comparments-using-Archetypal-analysis](#2.2.-Computing-Tissue-Comparments-using-Archetypal-analysis)
* [Saving the anndata objects for further analysis](#3.-Saving-the-anndata-objects-for-further-analysis)

In [None]:
import scanpy as sc
import pandas as pd
import os
import anndata as ad
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import squidpy as sq
# import harmonypy as hm
import scanorama
import besca as bc
from wrapper_functions import *
sns.set()

In [None]:
# Automatically re-load wrapper functions after an update
# Find details here: https://ipython.readthedocs.io/en/stable/config/extensions/autoreload.html
%load_ext autoreload
%autoreload 2

In [None]:
sc.logging.print_versions()
# sc.set_figure_params(facecolor="white", figsize=(6, 6))
sc.settings.verbosity = 3

# 1. Parameters and Input Anndata

Here, we set some parameters (for more information refer to the quality control notebook) and read the filtered anndata objects for further analysis. 

In [None]:
organism = Organism.mouse
analyze_params = Analyze(protocol=Protocol.FF, organism=organism)

In [None]:
analysis_name='Test01'

root_path = os.getcwd()
results_folder = os.path.join(root_path, 'results')
basepath=root_path+'/analyzed/'+analysis_name+"/"

In [None]:
## check if folder exists and create it otherwise
if not os.path.exists(basepath):
    os.makedirs(basepath)
    print(f"Folder '{basepath}' created.")
else:
    print(f"Folder '{basepath}' already exists.")

We read the Anndata files corresponding to our samples after the QC filtering

In [None]:
file_names = [f for f in os.listdir(os.path.join(results_folder, 'qc_filtered')) if os.path.isfile(os.path.join(results_folder,'qc_filtered',f))]

adata_list = [ad.read(os.path.join(results_folder, 'qc_filtered', file)) for file in file_names if file.endswith('.h5ad')]

# 2. Normalization, data integration and Clustering

## 2.1. Normalization of the concatenated dataset

We are now going to concatenate datasets for uploading the normalizae information into MongoDB. Notice that we are concatenating the dataset with uns_merge="unique" strategy, in order to keep each image from the original visium datasets in the concatenated anndata object. An important point to consider here it is the parameter `join`. In case we select `join = 'outer'`, we will consider in the concatenated `Anndata` the **union** of genes from all the datasets under analysis. The intersection will be consider when selection `join = 'inner'`. As a general approach, we recommend selecting `join = 'outer'` to avoid losing relevant information in the downstream analysis. We have seen studies where relevant genes are only expressed in a particular condition (or so low in the other condition thar do not pass the qc filtering) and are dropped from the analysis when considering `join = 'inner'`. As a small drawback, the normalization results look a bit better when using the option `join = 'inner'` in our opinion.

In [None]:
adata_concat = sc.concat(
    adata_list,
    label="library_id",
    uns_merge="unique",
    keys=[
        k
        for d in [adata.uns["spatial"] for adata in adata_list]
        for k, v in d.items()
    ],
    index_unique="-",
    join='outer' 
)

We now normalize together the concatenated dataset. Importantly, we also save the raw counts that can be relevant for some downstream analysis. 

In [None]:
adata_concat.obs['batch']=adata_concat.obs['batch'].astype('category')

In [None]:
adata_concat_norm = adata_concat.copy()
#adata_concat_norm.raw = adata_concat.copy()
sc.pp.normalize_total(adata_concat_norm)
sc.pp.log1p(adata_concat_norm)
adata_concat_norm.raw = adata_concat_norm.copy()
sc.pp.highly_variable_genes(adata_concat_norm,  flavor="seurat", batch_key='readout_id')

In [None]:
adata_concat_norm.var['gene_ids']=list(adata_list[0].var['gene_ids'].loc[adata_concat_norm.var.index])

In [None]:
export_cp10k(adata_concat_norm, basepath=basepath, geneannotation="SYMBOL", additional_geneannotation='gene_ids')

## 2.1. Normalization of the individual datasets. 

In the individual anndata objects, we perform normalization and compute the highly variable genes. 

In [None]:
adata_list_norm = norm_hvg(adata_list)

We are now going to concatenate datasets for uploading the normalizae information into MongoDB. Notice that we are concatenating the dataset with uns_merge="unique" strategy, in order to keep both images from the visium datasets in the concatenated anndata object.

In [None]:
adata_norm_concat = sc.concat(
    adata_list_norm,
    label="library_id",
    uns_merge="unique",
    keys=[
        k
        for d in [adata.uns["spatial"] for adata in adata_list_norm]
        for k, v in d.items()
    ],
    index_unique="-",
    join='outer' 
)

In [None]:
## In order to keep some information of stored in adata.var we need to this a bit manually. 
concatenated_var = pd.concat([adata.var for adata in adata_list_norm], axis=0).drop_duplicates(subset = 'gene_ids', keep='first')
concatenated_var = concatenated_var[['gene_ids']]
adata_norm_concat.var = concatenated_var
adata_norm_concat.raw = adata_norm_concat.copy()

In [None]:
export_cp10k(adata_norm_concat, basepath=basepath, geneannotation="SYMBOL", additional_geneannotation='gene_ids')

We then integrate the datasets using Scanorama. 

In [None]:
adatas_cor = scanorama.correct_scanpy(adata_list_norm, return_dimred=True)

In [None]:
adata_spatial = sc.concat(
    adatas_cor,
    label="library_id",
    uns_merge="unique",
    keys=[
        k
        for d in [adata.uns["spatial"] for adata in adatas_cor]
        for k, v in d.items()
    ],
    index_unique="-",
)

In [None]:

concatenated_var_cor = pd.concat([adata.var for adata in adatas_cor], axis=0).drop_duplicates(subset = 'gene_ids', keep='first')
concatenated_var_cor = concatenated_var_cor[['gene_ids']]
adata_spatial.var = concatenated_var_cor
adata_spatial.var['SYMBOL'] = adata_spatial.var.index
adata_spatial.var.rename(columns={"gene_ids": "ENSEMBL"}, inplace = True)
adata_spatial.raw = adata_spatial.copy()

In [None]:

sc.pp.highly_variable_genes(adata_spatial, batch_key='readout_id')
sc.pp.pca(adata_spatial)
sc.pp.neighbors(adata_spatial, use_rep="X_scanorama")
sc.tl.umap(adata_spatial)
sc.tl.leiden(adata_spatial, key_added="leiden",  resolution=0.6)

In [None]:
sc.set_figure_params()

In [None]:
sc.pl.umap(
    adata_spatial, color=["readout_id"], palette=sc.pl.palettes.default_20)

In [None]:
sc.pl.umap(
    adata_spatial, color=["treatment_id"], palette=sc.pl.palettes.default_20)

In [None]:
sc.pl.umap(
    adata_spatial, color=["leiden"], palette=sc.pl.palettes.default_20)

In [None]:
clusters_colors = dict(
    zip([str(i) for i in range(17)], adata_spatial.uns["leiden_colors"])
)

In [None]:
for i, library in enumerate(
   adata_spatial.obs["readout_id"].unique().tolist()
):
    ad = adata_spatial[adata_spatial.obs.library_id == library, :].copy()
    print(library)
    sc.pl.spatial(
        ad,
        img_key="hires",
        library_id=library,
        color="leiden",
        size=1.5,
        palette=[
            v
            for k, v in clusters_colors.items()
            if k in ad.obs.leiden.unique().tolist()
        ])

In [None]:
# We have to add the results of the clusterin on the scanorama corrected results to the normalized concatenated results (non-batch corrected)
adata_norm_concat.obs['leiden'] = adata_spatial.obs['leiden']
adata_norm_concat.var['SYMBOL'] = adata_norm_concat.var.index
adata_norm_concat.var.rename(columns={"gene_ids": "ENSEMBL"}, inplace = True)
adata_norm_concat.raw = adata_norm_concat.copy()

In [None]:
export_clustering(adata_norm_concat, basepath=basepath, method="leiden", use_raw= True)

In [None]:
sc.tl.rank_genes_groups(adata_norm_concat, 
            groupby="leiden", 
            method='wilcoxon',
            corr_method='benjamini-hochberg',
            use_raw = True)

In [None]:
sc.pl.rank_genes_groups_heatmap(adata_norm_concat, groupby="leiden", n_genes=3, show_gene_labels=True)

In [None]:
export_rank(adata_norm_concat, basepath=basepath, type="wilcox", labeling_name="leiden", geneannotation = 'SYMBOL', additional_geneannotation='ENSEMBL')

### Second version: use Harmony for integration rather then Scanorama 

In [None]:
adata_concat_norm_saved=adata_concat_norm.copy()


In [None]:
import scanpy.external as sce


In [None]:

adata_concat_harmony=adata_concat_norm_saved.copy()

#sc.pp.highly_variable_genes(adata_concat_harmony, flavor="seurat", batch_key='Sample_ID')
sc.pp.pca(adata_concat_harmony)
sce.pp.harmony_integrate(adata_concat_harmony, 'batch')

sc.pp.neighbors(adata_concat_harmony, use_rep='X_pca_harmony')
sc.tl.umap(adata_concat_harmony)
sc.tl.leiden(adata_concat_harmony, key_added="leiden",  resolution=0.6)

sc.pl.umap(
    adata_concat_harmony, color=["Sample_ID"], palette=sc.pl.palettes.default_20)


In [None]:
sc.pl.umap(
    adata_concat_harmony, color=["leiden"], palette=sc.pl.palettes.default_20)

sc.pl.umap(
    adata_concat_harmony, color=["batch"], palette=sc.pl.palettes.default_20)

sc.pl.umap(
    adata_concat_harmony, color=["CONDITION"], palette=sc.pl.palettes.default_20)


In [None]:
sc.pl.umap(
    adata_concat_harmony, color=["leiden"], palette=sc.pl.palettes.default_20, legend_loc='on data')

In [None]:
export_metadata(adata_concat_harmony, basepath=basepath, n_pcs=3, umap=True, tsne=False)

In [None]:
#### No correction, just raw data
#sc.pp.highly_variable_genes(adata_concat_norm, flavor="seurat", batch_key='Sample_ID')
sc.pp.pca(adata_concat_norm)
sc.pp.neighbors(adata_concat_norm)
sc.tl.umap(adata_concat_norm)


In [None]:

sc.tl.leiden(adata_concat_norm, key_added="leiden",  resolution=0.6)



In [None]:
sc.pl.umap(
    adata_concat_norm, color=["readout_id"], palette=sc.pl.palettes.default_20)

sc.pl.umap(
    adata_concat_norm, color=["leiden"], palette=sc.pl.palettes.default_20)

sc.pl.umap(
    adata_concat_norm, color=["batch"], palette=sc.pl.palettes.default_20)


In [None]:
goi=['Glycam1','Selp','Sele','Ackr1','Enpp6','Madcam1',
     'Lipg','Enpp2','Cxcl1','Lifr','Serpina1b','Vwf','Syt15','Chst4','Fut7']
goii=['Glycam1','Selp','Sele','Madcam1','Lifr','Vwf','Fut7']


sc.tl.score_genes(adata_concat_harmony, gene_list=goi, score_name='HEV')

sc.pl.umap(
    adata_concat_harmony, color=["HEV"], palette=sc.pl.palettes.default_20, size = 20)


In [None]:

sc.pl.umap(
    adata_concat_harmony, color=goii, palette=sc.pl.palettes.default_20, size = 20)

sc.tl.dendrogram(adata_concat_harmony,groupby='leiden')

sc.pl.dotplot(adata_concat_harmony, var_names=['HEV']+goii, groupby='leiden', dendrogram=True)


In [None]:
sc.pl.matrixplot(adata_concat_harmony, var_names=goii+['Cd3d','Cxcl9','Cxcl10','Gzmb','Cd3d'], 
                 groupby='CONDITION',)


In [None]:

sc.tl.rank_genes_groups(adata_concat_harmony, 
            groupby="leiden", 
            method='wilcoxon',
            corr_method='benjamini-hochberg',
            use_raw = True)

sc.tl.dendrogram(adata_concat_harmony, groupby="leiden", use_raw= True)
sc.pl.rank_genes_groups_heatmap(adata_concat_harmony, groupby="leiden", n_genes=3, show_gene_labels=True)


In [None]:
clusters_colors = dict(
    zip([str(i) for i in range(21)], adata_concat_harmony.uns["leiden_colors"])
)

for i, library in enumerate(
   adata_concat_harmony.obs["readout_id"].unique().tolist()
):
    ad = adata_concat_harmony[adata_concat_harmony.obs.library_id == library, :].copy()
    print(library)
    sc.pl.spatial(
        ad,
        img_key="hires",
        library_id=library,
        color="leiden",
        size=1.5,
        palette=[
            v
            for k, v in clusters_colors.items()
            if k in ad.obs.leiden.unique().tolist()
        ])


In [None]:
sc.tl.score_genes(adata_concat_norm, gene_list=goi, score_name='HEV')

sc.pl.umap(
    adata_concat_norm, color=["HEV"], palette=sc.pl.palettes.default_20, size = 20)

In [None]:
sample_cluster_counts = adata_concat_norm.obs.groupby(['readout_id', 'leiden']).size().unstack(fill_value=0)

sorder=['V43J24-078_A1_B06','V43J19-050_A1_B07','V42D20-002_A1_B08','V43J11-302_A1_B08','V43J19-319_A1_B09',
                           'V42D20-025_A1_B16','V43A11-284_A1_B16',
                          'V43A13-374_A1_B19',
                          'V42D20-002_D1_A02', 
                           'V43J11-302_D1_A02','V43J19-319_D1_A03','V43A11-284_D1_A04',
                          'V42D20-025_D1_A05','V43A13-374_D1_A05',  'V43J24-078_D1_A06',
                         'V43J19-050_D1_A08'
                        ]


In [None]:
sc.pl.dotplot(adata_concat_harmony, var_names=['HEV']+goii, groupby='readout_id', dendrogram=False, categories_order=sorder)



In [None]:
global_clustering_folder = os.path.join(results_folder, 'global_clustering') 
    
## check if folder exists and create it otherwise
if not os.path.exists(global_clustering_folder):
    os.makedirs(global_clustering_folder)
    print(f"Folder '{global_clustering_folder}' created.")
else:
    print(f"Folder '{global_clustering_folder}' already exists.")


In [None]:
adata_concat_harmony=sc.read(os.path.join(global_clustering_folder , 'clustering_results_harmony.h5ad'))

In [None]:
adata_concat_harmony.uns['log1p'] = {'base' : None} # Fix for bug related to scanpy version scverse/scanpy#2239

In [None]:
#adata_concat_norm.write(os.path.join(global_clustering_folder , 'clustering_results_concat_norm.h5ad'))

In [None]:
adata_concat_norm=sc.read(os.path.join(global_clustering_folder , 'clustering_results.h5ad'))

In [None]:
labeloi="leiden"

### Perform DE cells of each celltype3 vs. all other cells
DEgenes = bc.tl.dge.get_de(
    adata_concat_harmony, labeloi, demethod="wilcoxon", topnr=5000, logfc=1, padj=0.05
)

In [None]:
sc.set_figure_params()

In [None]:
### Export top marker genes per celltype to .tsv file to facilitate checking the annotation
top5s=list()
for myc in DEgenes.keys():
    top5s=top5s+list(DEgenes[myc].sort_values("Log2FC", ascending=False).iloc[0:5,:]['Name'])
    DEgenes[myc].sort_values("Log2FC", ascending=False).iloc[0:50,:].to_csv(results_folder + "/figures/TopMarkergenes-" + labeloi + "_"+ myc + "_top50_wilcox.csv", sep="\t",index=False)

### Generate plots of the top marker genes 
sc.pl.dotplot(adata_concat_harmony, var_names=top5s, groupby=labeloi, dot_max=0.8,vmax=3, dendrogram=True, save="TopMarkerplots-"+labeloi+".svg")

sc.pl.matrixplot(adata_concat_harmony, var_names=top5s, groupby=labeloi, standard_scale='var',  dendrogram=True, save="TopMarkerplots-"+labeloi+".svg")


In [None]:
list(DEgenes['18'].sort_values("Log2FC", ascending=False).iloc[0:15,:]['Name'])

In [None]:
list(DEgenes['16'].sort_values("Log2FC", ascending=False).iloc[0:20,:]['Name'])

In [None]:
list(DEgenes['12'].sort_values("Log2FC", ascending=False).iloc[0:20,:]['Name'])

In [None]:
list(DEgenes['0'].sort_values("Log2FC", ascending=False).iloc[0:20,:]['Name'])

In [None]:
# define standardized filepaths based on above input
root_path = os.getcwd()
bescapath_full = os.path.dirname(bc.__file__)
bescapath = os.path.split(bescapath_full)[0]

species = "mouse"  ## or mouse for now
conversion = None
sigsuffix = ""
if species == "mouse":
    sigsuffix = ".mouse"

In [None]:
adata=adata_concat_harmony.copy()

In [None]:
umap_basis='umap'

In [None]:
## Possible filtering thanks to Roland Suggestion
score_thr = 0.5
var_thrs = 0.005
# Variance threshold need to be super low. In case of doube can even be 0; indeed if a cluster is really small (in term of number of cells) with a strong signature, variance across the whole datasets will be small too.
## Provided with besca; change this for own gmt file
gmt_file_anno = (
    bescapath + "/besca/datasets/genesets/CellNames_scseqCMs6_sigs" + sigsuffix + ".gmt"
)
bc.tl.sig.combined_signature_score(
    adata, gmt_file_anno
)  # optional conversion argument , conversion=conversion
### Plot all signatures containing "scanpy" in name
scores = [x for x in adata.obs.columns if "scanpy" in x]



scores0 = [x for x in adata.obs.columns if 'scanpy' in x]
scores = [x for x in scores0 if max(adata.obs[x]) > score_thr and np.var(adata.obs[x]) > var_thrs ]


In [None]:

sc.pl.embedding(adata, basis=umap_basis, color=scores, color_map="viridis")
## An extra set of signatures (less specific but informative) is also provided
gmt_file_anno_extra = (
    bescapath
    +  "/besca/datasets/genesets/CellNames_scseqCMs6_Extrasigs"
    + sigsuffix
    + ".gmt"
)
bc.tl.sig.combined_signature_score(
    adata, gmt_file_anno_extra
)  # optional conversion argument , conversion=conversion
### Plot all signatures containing "_scv" in name
scores0 = [x for x in adata.obs.columns if "_scv" in x]

scores = [x for x in scores0 if max(adata.obs[x]) > score_thr and np.var(adata.obs[x]) > var_thrs ]


In [None]:
### Plot only selected signatures
sc.pl.embedding(
    adata,
    basis=umap_basis,
    color=[
         "score_Fibroblast_scanpy",
        "score_Endothelial_scanpy",
        "score_HEVEndothelial_scanpy",
        "score_Myeloid_scanpy",
        "score_Bcell_scanpy",
        "score_Tcell_scanpy",
        "score_NKcell_scanpy",
    ],
    color_map="viridis",
)

In [None]:

umap_basis='umap'
sc.pl.embedding(adata, basis=umap_basis, color=scores, color_map="viridis")

In [None]:
set(adata.obs.columns)

In [None]:
labeloi='leiden'
ident=["Lgals1","Krt8","Tpm1","Mif","score_Epithelial_scanpy","Epcam",
                                   "Fxyd3","score_ProlifEpithelialStem_scanpy",
                                   "Hbb-bs","Alas2",'Acta1','Myh2','Pgam2',
                                   'Cxcl13','C1qc',
         "score_Endothelial_scanpy",   "score_Pericyte_scanpy", 'Rgs5',                 
         "score_Fibroblast_scanpy", "Fstl1","Col5a2",
        "score_Hematopoietic_scanpy",                             
        "score_Myeloid_scanpy","score_Macrophage_CXCL9_scanpy",'Ctsk','Acp5','Mmp9',
        "score_Bcell_scanpy",'Cd19',
        "score_Tcell_scanpy",'Cd3e','Cd8a','Gzmb','Gzme','Prf1','Ifng',
        "score_NKcell_scanpy",'Glycam1','Selp','Cxcl12']

In [None]:
sc.pl.matrixplot(adata, var_names=ident, groupby='leiden', standard_scale='var',  dendrogram=True, save="Sigplots-"+labeloi+".svg")


In [None]:
sc.pl.dotplot(adata, var_names=ident, groupby=labeloi, dot_max=0.8,
              dendrogram=True, save="Sigplots-"+labeloi+".svg")


In [None]:
DEgenes.keys()

In [None]:
sc.pl.matrixplot(adata_concat_harmony, var_names=['HEV']+goii, groupby='readout_id', 
                 dendrogram=False, standard_scale='var', categories_order=sorder)


In [None]:
indcat=['A2','A3','A4','A5','A8','B6','B7','B8','B16','B19']
adata_sub=adata_concat_harmony[adata_concat_harmony.obs.individual_id.isin(indcat)]

In [None]:
sc.pl.matrixplot(adata_sub, var_names=['Glycam1','Selp','Sele','Madcam1','Cxcl13','Cd19','Ms4a1','Cxcl9',
                                       'Cxcl10','Stat1','Ifng','Gzmb','Gzma','Cd3d','Cd8a'], 
                                                  groupby='individual_id', standard_scale='var', vmax=0.6, 
                 categories_order=indcat)


In [None]:
sample_cluster_counts=sample_cluster_counts.loc[sorder,:]

ax = sample_cluster_counts.plot(kind='bar', stacked=True, figsize=(10, 6), )



In [None]:
adata_concat_harmony.obs['anno'] = adata_concat_harmony.obs['leiden'].copy()


# export_rank(adata_concat_norm, basepath=basepath, type="wilcox", labeling_name="leiden", geneannotation = 'SYMBOL', additional_geneannotation='ENSEMBL')

# rename the names of the clustering. 

# Dictionary for renaming values
correspondence = {
    '0': 'Tumor',
    '1': 'Tumor',
    '2': 'Tumor',
    '3': 'Tumor_Fibro',
    '4': 'Tumor_Fibro',
    '5': 'Epi_HEV',
    '6': 'Tumor',
    '7': 'Tumor_Fibro',
    '8': 'Tumor',
    '9': 'Tumor',
    '10': 'Tumor_Fibro',
    '11': 'Tumor_Fibro',
    '12': 'Tumor_Cytotox',
    '13': 'Fibro_HEV',
    '14': 'Tumor',
    '15': 'Tumor_Fibro',
    '16': 'Muscle',
    '17': 'Tumor',
    '18': 'Macrophage_Mmp9',    
}
adata_concat_harmony.obs['anno'] = adata_concat_harmony.obs['anno'].replace(correspondence)

#Tumor only/primarily - 14, 8, 2, 6, 0, 17, 1, 9, 0
#Tumor/Cytotox - 12
#Tumor/Fibroblast - 7, 4, 15, 10, 11, 3
#Muscle - 16
#Epithelial/HEV - 5
#Fibroblast/HEV - 13
#Macrophage_MMP9 18




In [None]:
adata.obs['anno'] = adata.obs['leiden'].copy()
adata.obs['anno'] = adata.obs['anno'].replace(correspondence)


In [None]:
sc.pl.matrixplot(adata, var_names=ident, groupby='anno', standard_scale='var',  dendrogram=True, save="Sigplots_anno.svg")


In [None]:
sc.pl.umap(adata_concat_harmony, color='anno')

In [None]:
clusters_colors_a = dict(
    zip([str(i) for i in range(7)], adata_concat_harmony.uns["anno_colors"])
)

In [None]:

for i, library in enumerate(
   adata_concat_harmony.obs["readout_id"].unique().tolist()
):
    ad = adata_concat_harmony[adata_concat_harmony.obs.library_id == library, :].copy()
    print(library)
    sc.pl.spatial(
        ad,
        img_key="hires",
        library_id=library,
        color="anno",
        size=1.5,
        palette=[
            v
            for k, v in clusters_colors_a.items()
            if k in ad.obs.anno.unique().tolist()
        ])



In [None]:
for i, library in enumerate(
   adata_concat_harmony.obs["readout_id"].unique().tolist()
):
    ad = adata_concat_harmony[adata_concat_harmony.obs.library_id == library, :].copy()
    print(library)
    sc.pl.spatial(
        ad,
        img_key="hires",
        library_id=library,
        color=["Cxcl13","Gzmb","Cd3e","Ifng"],
        size=1.5, color_map='viridis')


In [None]:
### Breakdown of cell types per experiment (sample)
bc.pl.celllabel_quant_stackedbar(
    adata_concat_harmony, count_variable="readout_id", subset_variable="anno"
)

In [None]:
myocc=bc.tl.count_occurrence_subset(
    adata_concat_harmony, subset_variable='readout_id', count_variable="anno", return_percentage=True)

In [None]:
myocc=myocc.transpose()

In [None]:
list(myocc.index)
myorder=['V42D20-002_A1_B08',
 'V43J11-302_A1_B08',
 'V43J24-078_A1_B06',         
'V43J19-050_A1_B07',    
 'V43A13-374_A1_B19',         
 'V42D20-025_A1_B16',
 'V43A11-284_A1_B16',
 'V43J19-319_A1_B09',    
          'V42D20-002_D1_A02',
          'V43J11-302_D1_A02',
          'V43J19-319_D1_A03',
          'V43A11-284_D1_A04',
          'V42D20-025_D1_A05',
          'V43A13-374_D1_A05',
          'V43J24-078_D1_A06',
 'V43J19-050_D1_A08']


In [None]:
myocc.loc[myorder,:].plot.bar(stacked=True).legend(loc='center left',bbox_to_anchor=(1.0, 0.5))

In [None]:
adata_concat_harmony.write(os.path.join(global_clustering_folder , 'clustering_results_harmony.h5ad'))

In [None]:
! jupyter nbconvert --to html 03_SampleIntegration_Clustering.ipynb