# Comparison of fibroblast populations (review)

In this notebook we are going to extract and replicate the main populations from diffrent papers where fibroblast populations are described, and find similarities and differences. The premise of this analysis is that many of the populations described in different papers seem not to match, or to be transcriptomically different, but in reality they are quite similar; that is, the main types of populations are indeed shared by the different papers, which should come as no surprise.

We will use the following references to extract fibroblast information from:
* Tabib et al. 2018
* Philippeos et al. 2018
* Popescu et al. 2019
* Solé-Boldo et al. 2020
* Vorstandlechner et al. 2020
* He et al. 2020

The data from He et al was reanalyzed from fastq files with healthy donor due to the strong batch effects (the samples were already normalized / log transformed, which limits the scope of the downstream processing), and some important genes such as WIF1 were not appearing. 

## imports

In [None]:
import scanpy as sc
import scanpy.external as sce
import pandas as pd
import numpy as np
import os
import triku as tk
import matplotlib.pyplot as plt
import matplotlib as mpl
from tqdm.notebook import tqdm
import ray
import subprocess
import time
import scvelo as scv
import gc

In [None]:
# To print versions of imports 

import types

def imports():
    for name, val in globals().items():
        if isinstance(val, types.ModuleType):
            yield val.__name__

excludes = ['builtins', 'types', 'sys']

imported_modules = [module for module in imports() if module not in excludes]

clean_modules = []

for module in imported_modules:

    sep = '.'  # to handle 'matplotlib.pyplot' cases
    rest = module.split(sep, 1)[0]
    clean_modules.append(rest)

changed_imported_modules = list(set(clean_modules))  # drop duplicates

pip_modules = !pip freeze  # you could also use `!conda list` with anaconda

for module in pip_modules:
    try:
        name, version = module.split('==')
        if name in changed_imported_modules:
            print(name + '\t' + version)
    except:
        pass

In [None]:
seed = 0

In [None]:
# Palettes for UMAP gene expression

magma = [plt.get_cmap('magma')(i) for i in np.linspace(0,1, 80)]
magma[0] = (0.88, 0.88, 0.88, 1)
magma = mpl.colors.LinearSegmentedColormap.from_list("", magma[:65])

In [None]:
dict_rep = {'CCN5': 'WISP2', 'ECRG4': 'C2orf40'}

**IMPORTANT: I am running this analysis in a computer with ~500 GB of RAM. I will load many datasets at once, which might be too much for some computers. I took this decision conciously, to have as much info available at any time as possible. If you cannot run all the analysis at once, you can run it by parts.**

## data extraction and processing

In [None]:
data_dir = os.getcwd()

### Tabib et al. 2018

In [None]:
tabib_dir = data_dir + '/Tabib_2018'

In [None]:
adata_tabib = sc.read_csv(tabib_dir + '/Skin_6Control_rawUMI.csv')
adata_tabib = adata_tabib.transpose()

In [None]:
df_metadata_tabib = pd.read_csv(tabib_dir + '/Skin_6Control_Metadata.csv', index_col=0)

df metadata has 8366 cells, although the paper states that 8522 cells were analyzed. The rest of cells are erithrocytes, which were filtered out from the analysis.

In [None]:
adata_tabib.raw = adata_tabib

In [None]:
dict_reverse_mappings = {'Fibroblast': ['0', '3', '4'], 
                 'Keratinocyte': ['1', '5', '7', '11', '14',], 
                 'Endothelial cell': ['2'], 
                 'Pericyte': ['6', '10'], 
                 'Macrophage/DC': ['8'], 
                 'Lymphocyte': ['9'], 
                 'Secretory Epith': ['12'], 
                 'Smooth Muscle': ['13'], 
                 'Melanocyte': ['15'], 
                 'Neural Cell': ['16'],
                 'Cornified Env': ['17'],
                 'B cell': ['18'], 
                 'Erithrocyte': [np.NaN]}  # This is ours!

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key

In [None]:
adata_tabib.obs['res.0.6'] = df_metadata_tabib['res.0.6'].astype(str)
adata_tabib.obs['cluster'] = [dict_mappings[i] for i in adata_tabib.obs['res.0.6']]

Since we are interested in fibros, we are going to filter their specific populations (0, 3, 4)

In [None]:
adata_tabib_fb = adata_tabib[adata_tabib.obs['cluster'].isin(['Fibroblast']), :].copy()
adata_tabib_fb_raw = adata_tabib_fb.copy()

In [None]:
sc.pp.filter_genes(adata_tabib, min_counts=1)
sc.pp.log1p(adata_tabib)
sc.pp.normalize_total(adata_tabib)
tk.tl.triku(adata_tabib, n_procs=1, random_state=seed)
sc.pp.pca(adata_tabib, random_state=seed)
sc.pp.neighbors(adata_tabib, random_state=seed)
sc.tl.umap(adata_tabib, random_state=seed)

In [None]:
sc.pl.umap(adata_tabib, color=['cluster', 'LUM', 'PDGFRA', 'COL1A1', 'DCN', 'FBLN1'], legend_loc='on data', cmap=magma, use_raw=False)

In [None]:
# Pericyte markers, for later
sc.pl.umap(adata_tabib, color=['cluster', 'MYL9', 'RGS5'], legend_loc='on data', cmap=magma, use_raw=False)

In [None]:
sc.pp.filter_genes(adata_tabib_fb, min_counts=1)
sc.pp.log1p(adata_tabib_fb)
sc.pp.normalize_total(adata_tabib_fb)
tk.tl.triku(adata_tabib_fb, n_procs=1, random_state=seed)
sc.pp.pca(adata_tabib_fb, random_state=seed, n_comps=30)
sc.pp.neighbors(adata_tabib_fb, random_state=seed, metric='cosine', knn=len(adata_tabib_fb) ** 0.5 // 2)

#### Labelling Tabib clusters

In the data from tabib there is no information about the fibroblast subclusters. To label them, we are going to do a broad clustering with many clusters, and gather them into bigger clusters that share the same patterns as Tabib clusters. Those clusters will not be exactly the clusters from Tabib, but will be similar enough to extract the correct biological conclusions. 

For this dataset there is a list of DEGs of the main clusters, but not the subclusters! Therefore, we will rely on the Supplementary Figure 3, which is a heatmap of the subclusters. We will use their recommended genes to do the selection. Also, clusters 4A and 4B are not divided in that heatmap, so we will do the separation based on the genes that appear on the paper.

In [None]:
sc.tl.umap(adata_tabib_fb, min_dist=0.2, random_state=seed)
sc.tl.leiden(adata_tabib_fb, resolution=2, random_state=seed)
sc.pl.umap(adata_tabib_fb, color=['cluster', 'leiden'], legend_loc='on data')

In [None]:
sc.pl.umap(adata_tabib_fb, color=['leiden', 'APCDD1', 'WIF1', 'WISP2', 'COMP'], legend_loc='on data', cmap=magma, use_raw=False)

We want to map clusters from leiden to cluster from Tabib. To do that, we will create a dictionary between correspondences in clusters based on markers, 
and when all correspondences are done, we will use the renamed clusters from Tabib. To simplify the processing, we will run leiden with a high resolution, and merge several clusters in one.

In [None]:
print('T0')
sc.pl.umap(adata_tabib_fb, color=['leiden', 'WIF1', 'COMP'], cmap=magma, legend_loc='on data', use_raw=False)  # Markers of T0
print('T1')
sc.pl.umap(adata_tabib_fb, color=['leiden', 'MYOC', 'FMO1'], cmap=magma, legend_loc='on data', use_raw=False)  # Markers of T1
print('T5')
sc.pl.umap(adata_tabib_fb, color=['leiden', 'CXCL12', 'C7'], cmap=magma, legend_loc='on data', use_raw=False)  # Markers of T5                               
print('T2')
sc.pl.umap(adata_tabib_fb, color=['leiden', 'FBLN1', 'C1R', 'PI16'], cmap=magma, legend_loc='on data', use_raw=False)  # Markers of T2
print('T3')
sc.pl.umap(adata_tabib_fb, color=['leiden', 'IGFBP5', 'PLEKHH2'], cmap=magma, legend_loc='on data', use_raw=False)  # Markers of T3
print('T4a')
sc.pl.umap(adata_tabib_fb, color=['leiden', 'CRABP1', 'TNN'], cmap=magma, legend_loc='on data', use_raw=False)  # Markers of T4A
print('T4b')
sc.pl.umap(adata_tabib_fb, color=['leiden', 'COL11A1', 'UGT3A2'], cmap=magma, legend_loc='on data', use_raw=False)  # Markers of T4B
print('T6')
sc.pl.umap(adata_tabib_fb, color=['leiden', 'PCOLCE2', 'FBN1', 'SFRP4'], cmap=magma, legend_loc='on data', use_raw=False)  # Markers of T6
print('T7')
sc.pl.umap(adata_tabib_fb, color=['leiden', 'ANGPTL7', 'C2orf40'], cmap=magma, legend_loc='on data', use_raw=False)  # Markers of T7

In [None]:
dict_reverse_mappings = {'T0': ['8', '0', '16', '7', '17', ], 
                 'T1': ['2', '5', '10', '20', '22'], 
                 'T2': ['1', '4', '12', '14'], 
                 'T3': ['11', '13'], 
                 'T4A': ['15', '21'], 
                 'T4B': ['3'], 
                 'T5': ['6', '9'], 
                 'T6': ['18'], 
                 'T7': ['19']}

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key

In [None]:
adata_tabib_fb.obs['tabib_clusters'] = [dict_mappings[i] for i in adata_tabib_fb.obs['leiden']]
sc.pl.umap(adata_tabib_fb, color=['tabib_clusters'], cmap=magma, legend_loc='on data')

**IMPORTANT: These clusters are not exactly the clusters from Tabib, but they are really close based on the expression of markers!**

### Philippeos et al. 2018

In [None]:
phil_dir = data_dir + '/Philippeos_2018'

In [None]:
adata_phil_1 = sc.read_csv(phil_dir + '/GSE109822_CD90.csv')
adata_phil_1 = adata_phil_1.transpose()

In [None]:
adata_phil_2 = sc.read_csv(phil_dir + '/GSE109822_CD3145.csv')
adata_phil_2 = adata_phil_2.transpose()

In [None]:
adata_phil = sc.AnnData.concatenate(adata_phil_1, adata_phil_2)

In [None]:
sc.pp.filter_genes(adata_phil, min_counts=1)
sc.pp.log1p(adata_phil)
sc.pp.pca(adata_phil, random_state=seed)
sc.pp.neighbors(adata_phil, random_state=seed, metric='cosine')
tk.tl.triku(adata_phil, n_procs=1, random_state=seed, use_adata_knn=True)

In [None]:
sc.tl.umap(adata_phil, min_dist=0.5, random_state=seed)
sc.tl.leiden(adata_phil, resolution=1, random_state=seed)
sc.pl.umap(adata_phil, color=['leiden'], legend_loc='on data')

#### Labelling Philippeos clusters

In Philippeos et al. paper they detect 5 subpopulations, but 4 of them are really relevant (one of them has 5 cells). We are going to use their markers to map their populations to ours, since we do not have that information. Also, population 2 seems to be able to be subdivided into two populations. We will name them as 2A and 2B.

![](images/Phil_F6.png)

In [None]:
fb_genes = ['DCN', 'LUM', 'RGS5', 'COL6A5', 'COL23A1', 'MFAP5', 'PRG4', 'DPP4', 'CD34', 'CD74', 'CLDN5']
sc.pl.umap(adata_phil, color=['leiden'] + fb_genes, legend_loc='on data', cmap=magma, )

In [None]:
dict_reverse_mappings = {'P2A': ['3'], 
                 'P2B': ['1'], 
                 'P3': ['4', '5'], 
                 'P4': ['0'], 
                 'P5': ['2'], }

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key

In [None]:
adata_phil.obs['philippeos_clusters'] = [dict_mappings[i] for i in adata_phil.obs['leiden']]
sc.pl.umap(adata_phil, color=['philippeos_clusters'], cmap=magma, legend_loc='on data')

#### Analyzing Philippeos subpopulations in more detail

We are going to analyse some of the Philippeos populations in detail, because we suspect they are not canonical fibroblasts and, instead, can be other types of cell types that have not been filtered.

In [None]:
sc.tl.rank_genes_groups(adata_phil, groupby='philippeos_clusters', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_phil, dendrogram=False, use_raw=False, n_genes=50)

##### P2A fibroblast cluster cells are perivascular cells, not fibroblats

If we plot some of the P2A DEGs (RGS5, PDGFA, GJA4, NOTCH3, APOLD1, MT1A, PARM1) in Tabib dataset, we clearly see that they colocalize within the same cluster of perivascular cells (labelled as 'Pericyte') by Tabib. They do not colocalize with the canonical fibroblast cluster, neither appear as an interphase.

In [None]:
peri_markers = ['RGS5', 'PDGFA', 'GJA4', 'NOTCH3', 'APOLD1', 'MT1A', 'PARM1']
sc.pl.umap(adata_phil, color=['philippeos_clusters'] + peri_markers, 
           cmap=magma, legend_loc='on data', use_raw=False) 

##### Cluster P2B does not express any relevant marker. Neither does it in the Philippeos publication. It can be a type of contamination or, in any case, irrelevant cells.

##### Cluster P5 are endothelial cells based on DEGs

In [None]:
peri_markers = ['STC1', 'CD74', 'MCTP1', 'CTSH', 'CLDN5', 'FLT1']
sc.pl.umap(adata_phil, color=['philippeos_clusters'] + peri_markers, 
           cmap=magma, legend_loc='on data', use_raw=False) 

With that in mind, we are going to select onlt P3 and P4 populations of Philippeos to narrow down the search of fibroblast subpopulations. However, the number of cells is so low that limited resolutive power can be achieved.

In [None]:
adata_phil_fb = adata_phil[adata_phil.obs['philippeos_clusters'].isin(['P3', 'P4']), :].copy()

In [None]:
sc.pp.filter_genes(adata_phil_fb, min_counts=1)
tk.tl.triku(adata_phil_fb, n_procs=1, random_state=seed)
sc.pp.pca(adata_phil_fb, random_state=seed)
sc.pp.neighbors(adata_phil_fb, random_state=seed, knn=4, metric='cosine')  # 15 would be too much
sc.tl.umap(adata_phil_fb, random_state=seed)

In [None]:
sc.tl.leiden(adata_phil_fb, resolution=0.8)
sc.pl.umap(adata_phil_fb, color='leiden')

In [None]:
sc.pl.umap(adata_phil_fb, color=['philippeos_clusters'] + ['APCDD1', 'COL18A1', 'NKD2', 'WISP2', 'PI16', 'IGFBP6', 'SLPI'], 
cmap=magma, legend_loc='on data', use_raw=False) 

### Solé-Boldo et al. 2020

In [None]:
sole_dir = data_dir + '/Sole-Boldo_2020'

In [None]:
adata_sole_young = sc.read_loom(sole_dir + '/SB2020.loom')
adata_sole_young.var_names_make_unique()

In [None]:
adata_sole_young.var_names = [dict_rep[i] if i in dict_rep else i for i in adata_sole_young.var_names ]

In [None]:
adata_sole_young.X = np.array(adata_sole_young.X.todense())

In [None]:
# Basic QC filtering
adata_sole_young.var['mt'] = adata_sole_young.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata_sole_young, qc_vars=['mt'], percent_top=None, inplace=True)

In [None]:
sc.pl.violin(adata_sole_young, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)

sc.pl.scatter(adata_sole_young, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata_sole_young, x='total_counts', y='n_genes_by_counts')

In [None]:
adata_sole_young = adata_sole_young[((adata_sole_young.obs.n_genes_by_counts < 2500) & 
                                    (adata_sole_young.obs.n_genes_by_counts > 200)).values, :]
adata_sole_young = adata_sole_young[adata_sole_young.obs.pct_counts_mt < 15, :]

In [None]:
sc.pp.filter_genes(adata_sole_young, min_counts=1)
sc.pp.log1p(adata_sole_young)
sc.pp.normalize_total(adata_sole_young)

In [None]:
tk.tl.triku(adata_sole_young, n_procs=1, random_state=seed)
sc.pp.pca(adata_sole_young, random_state=seed)
sc.pp.neighbors(adata_sole_young, random_state=seed, metric='cosine')

In [None]:
sc.tl.umap(adata_sole_young, min_dist=0.7, random_state=seed)
sc.tl.leiden(adata_sole_young, resolution=0.8, random_state=seed)
sc.pl.umap(adata_sole_young, color=['leiden'], legend_loc='on data')

In [None]:
# GOOD MARKERS FOR FIBROBLASTS LUM AND PDGFRA
sc.pl.umap(adata_sole_young, color=['PDGFRA', 'LUM', 'DCN', 'RGS5', 
                                    'VWF', 'HLA-DRA', 'KRT5', 'TRAC', 'HBB'], 
           legend_loc='on data', cmap=magma, ncols=3)

In [None]:
adata_sole_young_fb = adata_sole_young[adata_sole_young.obs['leiden'].isin(['2', '4', '5', '6', '8', '19'])].copy()
adata_sole_young_fb_raw = adata_sole_young_fb.copy()
adata_sole_young_fb_scvelo = adata_sole_young_fb.copy()

In [None]:
sc.pp.filter_genes(adata_sole_young_fb, min_counts=1)
tk.tl.triku(adata_sole_young_fb, n_procs=1, random_state=seed)
sc.pp.pca(adata_sole_young_fb, random_state=seed, n_comps=30)
sc.pp.neighbors(adata_sole_young_fb, random_state=seed, metric='cosine', knn=len(adata_sole_young_fb) ** 0.5 // 2)

In [None]:
adata_sole_old_fb = adata_sole_old[adata_sole_old.obs['leiden'].isin(['0', '2', '10'])].copy()

In [None]:
sc.pp.filter_genes(adata_sole_old_fb, min_counts=1)
tk.tl.triku(adata_sole_old_fb, n_procs=1, random_state=seed)
sc.pp.pca(adata_sole_old_fb, random_state=seed, n_comps=30)
sce.pp.bbknn(adata_sole_old_fb, batch_key='id', metric='angular')

#### Labelling Solé-Boldo YOUNG clusters

In the dataset from Solé-Boldo there is no direct information about the fibroblast subclusters (there is an R object but it fails). To label them, we are going to do a broad clustering with many clusters, and gather them into bigger clusters that share the same patterns as Tabib clusters. Those clusters will not be exactly the clusters from Solé-Boldo, but will be similar enough to extract the correct biological conclusions. 

To do the assignment, we will select the list of DEGs from the Supplementary Data 3, and apply it either on old and young samples. We will select the first 10 DEGs to do the plotting.

In [None]:
sc.tl.umap(adata_sole_young_fb, min_dist=0.35, random_state=seed)
sc.tl.leiden(adata_sole_young_fb, resolution=2, random_state=seed)
sc.pl.umap(adata_sole_young_fb, color=['leiden'], legend_loc='on data')

In [None]:
sc.pl.umap(adata_sole_young_fb, color=['CCL19', 'CXCL12', 'WIF1', 'COMP',
                                       'SLPI', 'SFRP2', 'DPP4',
                                       'WISP2', 'ANGPTL7', 'COCH', 'POSTN', 
                                       'APOE', 'TNN', 'ASPN', 'MYOC'], 
           legend_loc='on data', cmap=magma, use_raw=False, ncols=3)

In [None]:
SPap = ['APCDD1', 'COMP', 'RGS16', 'ID1', 'HBB', 'RGS2', 'COL18A1', 'DUSP1', 'WIF1', 'NKD2']
InfA = ['CXCL3', 'CXCL2', 'CXCL1', 'C11orf96', 'SOD2', 'CCL2', 'TNFAIP6', 'GEM', 'MEDAG', 'IL6']
InfB =['CCL19', 'APOE', 'PTGDS', 'APOD', 'SOCS3', 'CXCL12', 'EGR1', 'JUN', 'ABCA8', 'RARRES2']
SRet = ['WISP2', 'SLPI', 'CTHRC1', 'FBLN1', 'IGFBP6', 'PCSK1N', 'TSPAN8', 'MFAP5', 'DCN', 'CFD']
Mes = ['COCH', 'ASPN', 'POSTN', 'TNN', 'DPEP1', 'COL11A1', 'SFRP1', 'HTRA1', 'MRPS6', 'GPC3']

In [None]:
print('SPap')
sc.pl.umap(adata_sole_young_fb, color=['leiden'] + SPap, cmap=magma, legend_loc='on data', use_raw=False)
print('Sret')
sc.pl.umap(adata_sole_young_fb, color=['leiden'] + SRet, cmap=magma, legend_loc='on data', use_raw=False)  
print('InfA')
sc.pl.umap(adata_sole_young_fb, color=['leiden'] + InfA, cmap=magma, legend_loc='on data', use_raw=False) 
print('InfB')
sc.pl.umap(adata_sole_young_fb, color=['leiden'] + InfB, cmap=magma, legend_loc='on data', use_raw=False)                      
print('Mes')
sc.pl.umap(adata_sole_young_fb, color=['leiden'] + Mes, cmap=magma, legend_loc='on data', use_raw=False)  


In [None]:
sc.tl.rank_genes_groups(adata_sole_young_fb, groupby='leiden', )
sc.pl.rank_genes_groups_tracksplot(adata_sole_young_fb, dendrogram=False, use_raw=False)

In [None]:
dict_reverse_mappings = {'SPap': ['1', '6', '2'], 
                 'SRet': ['5', '7', '9', '17'], 
                 'InfA': ['12', '14', '0'], 
                 'InfB': ['3', '8', '10'], 
                 'Mes': ['4', '11', '13', '15', '16', '18',], }

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key

In [None]:
adata_sole_young_fb.obs['SB_clusters'] = [dict_mappings[i] for i in adata_sole_young_fb.obs['leiden']]
sc.pl.umap(adata_sole_young_fb, color=['leiden', 'SB_clusters'], cmap=magma, legend_loc='on data')

In [None]:
sc.pl.umap(adata_sole_young_fb, color=['leiden', 'C2orf40'], cmap=magma, legend_loc='on data')

#### "Labelling" Solé-Boldo OLD clusters

In this part we are simply going to DR and leiden the communities, and we will later on merge them into the known axes/clusters. **WE HAVE REMOVED THE 3rd DONOR BECAUSE OF BATCH EFFECTS!**

In [None]:
sc.tl.umap(adata_sole_old_fb, min_dist=0.05, random_state=seed)
sc.tl.leiden(adata_sole_old_fb, resolution=1.3, random_state=seed)
sc.pl.umap(adata_sole_old_fb, color=['leiden', 'id'], legend_loc='on data')

In [None]:
sc.pl.umap(adata_sole_young_fb, color=['leiden', 'POSTN', 'COL5A2', 'CALD1', 'DPEP1'], legend_loc='on data', cmap=magma, use_raw=False)

In [None]:
sc.pl.umap(adata_sole_old_fb, color=['leiden', 'WISP2', 'SLPI', 'APCDD1', 'COL18A1', 'CCL19', 'APOE', 
                                     'POSTN', 'COL11A1', 'DPEP1'], legend_loc='on data', cmap=magma, use_raw=False)

In [None]:
sc.tl.rank_genes_groups(adata_sole_old_fb, groupby='leiden', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_sole_old_fb, dendrogram=False, use_raw=False)

In [None]:
sc.pl.umap(adata_sole_old_fb, color=['leiden', 'APOE'], legend_loc='on data', cmap=magma, use_raw=False)

### Vorstandlechner et al. 2020

In [None]:
vors_dir = data_dir + '/Vorstandlechner_2020'

In [None]:
adata_vors = sc.read(vors_dir + '/skin_vorstandlechner.loom', cache=True)

In [None]:
adata_vors.obsm['tsne']  = adata_vors.obsm['tsne_cell_embeddings'] 

In [None]:
sc.tl.rank_genes_groups(adata_vors, groupby='res_1_2', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_vors, dendrogram=False, use_raw=False)

In [None]:
sc.pl.tsne(adata_vors, color=['res_1_2', 'FBLN1', 'PDGFRA', 'LUM', 'DCN', 'COL1A1'], cmap=magma, legend_loc='on data')

In [None]:
adata_vors_fb = adata_vors[adata_vors.obs['res_1_2'].isin(['0', '5', '6', '11', '12', '14'])]
adata_vors_fb_raw = adata_vors_fb.copy()

In [None]:
sc.pp.filter_genes(adata_vors_fb, min_counts=1)
sc.pp.log1p(adata_vors_fb)
tk.tl.triku(adata_vors_fb, n_procs=1, random_state=seed)
sc.pp.pca(adata_vors_fb, random_state=seed, n_comps=30)
sc.pp.neighbors(adata_vors_fb, random_state=seed, knn=len(adata_vors_fb) ** 0.5 // 2, metric='cosine')

In [None]:
sc.tl.umap(adata_vors_fb, min_dist=0.3, random_state=seed)
sc.tl.leiden(adata_vors_fb, resolution=2, random_state=seed)
sc.pl.umap(adata_vors_fb, color=['leiden'], legend_loc='on data')

#### Labelling Vorstandlechner clusters

Since we do not have the original labellings, and most of the clsuters do not have exclusively-expressed genes, we will use some of the most useful markers to map the clusters as originally as possible.

![](images/FS1.png)

In [None]:
FB1 = ['SLPI', 'C1QTNF3', 'MFAP5', ]  # 'PI16', 'DCN', 'CTHRC1',  'TSPAN8', 'CD55', 'PCOLCE2', 'SPARC', 'IGFBP6', 'COL1A1', 'COL3A1']
FB2 = ['JUN','FOS', 'ADH1B',]  # 'APOE','CXCL12','IGFBP7',,'C7','CCL19','RARRES2','IER2','SOCS3','CD74','IGFBP3']
FB3 = ['HSPB3','COL6A5', 'POSTN', 'RGS2',]  # 'GPC3','APOD','APCDD1','F13A1','NKD2','COL18A1', 'TSC22D3','SPRY1','BST2', 'SPARCL1','DDIT4','STMN1','CTGF']
FB4 = ['B4GALT1','JUND','HNRNPH1']  # ,'PPP1CB','C1orf56','CTNNB1','SAR1A','PTPRS','TWIST1', 'INSR','C11orf96', 'SET', 'WTAP', 'CRISPLD2']
FB5 = ['CCL2',  'DNAJA1', 'IL6', 'H2AFZ', 'GEM', 'CXCL3','CXCL2','SOD2','CXCL1','PLAUR','BIRC3',]
FB6 = ['DUSP4', 'PTGS2']  # ,'COMP', 'TNFAIP6', 'NR4A2','CEBPB','TNFAIP3','ARID5A','TGIF1','PTGS2',]

In [None]:
sc.pl.umap(adata_vors_fb, color=['leiden'] + ['SFRP4', 'APCDD1', 'WIF1', 'WISP2', 'APOE', 'CCL19', 'COCH', 'ANGPTL7', 'TNN', 'APLN'], cmap=magma, legend_loc='on data', use_raw=False)  

In [None]:
print('FB1')
sc.pl.umap(adata_vors_fb, color=['leiden'] + FB1, cmap=magma, legend_loc='on data', use_raw=False)
print('FB2')
sc.pl.umap(adata_vors_fb, color=['leiden'] + FB2, cmap=magma, legend_loc='on data', use_raw=False)  
print('FB3')
sc.pl.umap(adata_vors_fb, color=['leiden'] + FB3, cmap=magma, legend_loc='on data', use_raw=False) 
print('FB4')
sc.pl.umap(adata_vors_fb, color=['leiden'] + FB4, cmap=magma, legend_loc='on data', use_raw=False)                      
print('FB5')
sc.pl.umap(adata_vors_fb, color=['leiden'] + FB5, cmap=magma, legend_loc='on data', use_raw=False)  
print('FB6')
sc.pl.umap(adata_vors_fb, color=['leiden'] + FB6, cmap=magma, legend_loc='on data', use_raw=False)  

In [None]:
dict_reverse_mappings = {
                 'FB1': ['1', '7', '8', '13', '4'], 
                 'FB2': ['6', '11'], 
                 'FB3': ['2', '3', '5', '14', '15', '16'], 
                 'FB4': ['9', '10'], 
                 'FB5': ['0', '12'], 
                 'FB6': ['3']}

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key

In [None]:
adata_vors_fb.obs['vors_clusters'] = [dict_mappings[i] for i in adata_vors_fb.obs['leiden']]
sc.pl.umap(adata_vors_fb, color=['vors_clusters'], cmap=magma, legend_loc='on data')

### He et al. 2020

**IMPORTANT: FOR SOME REASON SOME GENE SYMBOLS CHANGE HERE!!!!!!**

**WISP2 = CCN5**

#### Adata creation and metadata gathering (Control samples)

In [None]:
adata_he = sc.read_loom(he_dir + '/He2020.loom')
adata_he.var_names_make_unique()

In [None]:
# Replace CCN5 by WISP2 because it is a key gene
adata_he.var_names = [dict_rep[i] if i in dict_rep else i for i in adata_he.var_names]

In [None]:
# Basic QC filtering
adata_he.var['mt'] = adata_he.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata_he, qc_vars=['mt'], percent_top=None, inplace=True)

In [None]:
sc.pl.violin(adata_he, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)

sc.pl.scatter(adata_he, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata_he, x='total_counts', y='n_genes_by_counts')

In [None]:
adata_he = adata_he[adata_he.obs.n_genes_by_counts < 5000, :]
adata_he = adata_he[adata_he.obs.pct_counts_mt < 30, :]

In [None]:
sc.pp.filter_cells(adata_he, min_genes=250)

In [None]:
sc.pp.filter_genes(adata_he, min_counts=1)
sc.pp.log1p(adata_he)
sc.pp.normalize_per_cell(adata_he)
tk.tl.triku(adata_he, n_procs=25, random_state=seed)
sc.pp.pca(adata_he, random_state=seed, n_comps=30)
sc.pp.neighbors(adata_he, random_state=seed, knn=len(adata_he) ** 0.5 // 2, metric='cosine')

In [None]:
sc.tl.umap(adata_he, min_dist=0.1, random_state=seed)
sc.tl.leiden(adata_he, resolution=1, random_state=seed)
sc.pl.umap(adata_he, color=['leiden'], legend_loc='on data')

In [None]:
sc.pl.umap(adata_he, color=['leiden', 'ANGPTL7', 'PDGFRA'], legend_loc='on data', cmap=magma, )

In [None]:
sc.pl.umap(adata_he, color=['leiden', 'KRT5', 'COL1A1', 'RGS5', 'MLANA', 'VWF', 'S100B', 'NRXN1'], legend_loc='on data', cmap=magma, )

In [None]:
sc.pl.umap(adata_he, color=['leiden', 'PDGFRA', 'LUM', 'DCN', 'FBLN1', 'COL1A1'], legend_loc='on data', cmap=magma, )

In [None]:
sc.tl.rank_genes_groups(adata_he, groupby='leiden')
sc.pl.rank_genes_groups(adata_he)

In [None]:
adata_he_fb = adata_he[adata_he.obs['leiden'].isin(['0', '9', '12', '21', '25'])]
adata_he_fb_raw = adata_he_fb.copy()
adata_he_fb_scvelo = adata_he_fb.copy()

In [None]:
sc.pp.filter_genes(adata_he_fb, min_counts=1)
tk.tl.triku(adata_he_fb, n_procs=1, random_state=seed)
sc.pp.pca(adata_he_fb, random_state=seed, n_comps=30)
sc.pp.neighbors(adata_he_fb, random_state=seed, knn=len(adata_he_fb) ** 0.5 // 2, metric='cosine')

In [None]:
sc.tl.umap(adata_he_fb, min_dist=0.1, random_state=seed)
sc.tl.leiden(adata_he_fb, resolution=1.5, random_state=seed)
sc.pl.umap(adata_he_fb, color=['leiden'], legend_loc='on data')

#### Adata creation and metadata gathering (Injury samples)

In [None]:
adata_he_inj = sc.read_loom(he_dir + '/He2020_inj.loom')
adata_he_inj.var_names_make_unique()

In [None]:
# Basic QC filtering
adata_he_inj.var['mt'] = adata_he_inj.var_names.str.startswith('MT-')  # annotate the group of mitochondrial genes as 'mt'
sc.pp.calculate_qc_metrics(adata_he_inj, qc_vars=['mt'], percent_top=None, inplace=True)

In [None]:
sc.pl.violin(adata_he_inj, ['n_genes_by_counts', 'total_counts', 'pct_counts_mt'],
             jitter=0.4, multi_panel=True)

sc.pl.scatter(adata_he_inj, x='total_counts', y='pct_counts_mt')
sc.pl.scatter(adata_he_inj, x='total_counts', y='n_genes_by_counts')

In [None]:
adata_he_inj = adata_he_inj[adata_he_inj.obs.n_genes_by_counts < 5000, :]
adata_he_inj = adata_he_inj[adata_he_inj.obs.pct_counts_mt < 30, :]

In [None]:
sc.pp.filter_cells(adata_he_inj, min_genes=250)

In [None]:
sc.pp.filter_genes(adata_he_inj, min_counts=1)
sc.pp.log1p(adata_he_inj)
sc.pp.normalize_per_cell(adata_he_inj)
tk.tl.triku(adata_he_inj, n_procs=1, random_state=seed)
sc.pp.pca(adata_he_inj, random_state=seed, n_comps=30)
sc.pp.neighbors(adata_he_inj, random_state=seed, knn=len(adata_vors) ** 0.5 // 2, metric='cosine')

In [None]:
sc.tl.umap(adata_he_inj, min_dist=0.1, random_state=seed)
sc.tl.leiden(adata_he_inj, resolution=1, random_state=seed)
sc.pl.umap(adata_he_inj, color=['leiden'], legend_loc='on data')

In [None]:
# Replace CCN5 by WISP2 because it is a key gene
adata_he_inj.var_names = [dict_rep[i] if i in dict_rep else i for i in adata_he_inj.var_names]

In [None]:
sc.pl.umap(adata_he_inj, color=['leiden', 'KRT5', 'COL1A1', 'RGS5', 'MLANA', 'VWF', 'S100B', 'NRXN1'], legend_loc='on data', cmap=magma, )

In [None]:
sc.pl.umap(adata_he_inj, color=['leiden', 'PDGFRA', 'LUM', 'DCN', 'FBLN1', 'COL1A1'], legend_loc='on data', cmap=magma, )

In [None]:
sc.tl.rank_genes_groups(adata_he_inj, groupby='leiden')
sc.pl.rank_genes_groups(adata_he_inj)

In [None]:
adata_he_inj_fb = adata_he_inj[adata_he_inj.obs['leiden'].isin(['0', '9', '10', '12', '15'])]

In [None]:
sc.pp.filter_genes(adata_he_inj_fb, min_counts=1)
tk.tl.triku(adata_he_inj_fb, n_procs=1, random_state=seed)
sc.pp.pca(adata_he_inj_fb, random_state=seed, n_comps=30)
sc.pp.neighbors(adata_he_inj_fb, random_state=seed, knn=len(adata_vors) ** 0.5 // 2, metric='cosine')

In [None]:
sc.tl.umap(adata_he_inj_fb, min_dist=0.1, random_state=seed)
sc.tl.leiden(adata_he_inj_fb, resolution=2, random_state=seed)
sc.pl.umap(adata_he_inj_fb, color=['leiden'], legend_loc='on data')

### Popescu et al. 2019 (This is not included in the datasets!)

This is a LARGE dataset of analysis of skin, liver and kidney, as main organs. They sequence each organ and divide between immune and non-immune populations. In our case, the non-inmune part of skin is not even considered in the paper, so we have a tabula rasa to do any analysis, we do not even need to find any population from the paper, because there are none. 

#### Raw data and metadata extraction

In [None]:
popescu_dir = data_dir + '/Popescu_2019'
os.makedirs(popescu_dir, exist_ok=True)

In [None]:
!rm -rf {popescu_dir}

In [None]:
!wget -P {popescu_dir} https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-7407/E-MTAB-7407.processed.1.zip
!wget -P {popescu_dir} https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-7407/E-MTAB-7407.processed.2.zip
!wget -P {popescu_dir} https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-7407/E-MTAB-7407.processed.3.zip
!wget -P {popescu_dir} https://www.ebi.ac.uk/arrayexpress/files/E-MTAB-7407/E-MTAB-7407.processed.4.zip

In [None]:
!unzip -o {popescu_dir}/E-MTAB-7407.processed.1.zip -d {popescu_dir} 
!unzip -o {popescu_dir}/E-MTAB-7407.processed.2.zip -d {popescu_dir}
!unzip -o {popescu_dir}/E-MTAB-7407.processed.3.zip -d {popescu_dir} 
!unzip -o {popescu_dir}/E-MTAB-7407.processed.4.zip -d {popescu_dir} 

We are going to select samples that belong to skin and that are CD45- (to exclude immune, since fibro populations are CD45-)

In [None]:
selected_files = ['4834STDY7002880', '4834STDY7038753', 'FCAImmP7241241', 'FCAImmP7316888', 'FCAImmP7316897', 'FCAImmP7352190', 'FCAImmP7352191', 
                  'FCAImmP7462241']

for file in os.listdir(popescu_dir):
    exists = False
    for f in selected_files:
        if f in file:
            exists = True
    
    if not exists:
        os.remove(popescu_dir + '/' + file)

In [None]:
!ls {popescu_dir}

In [None]:
for file in selected_files:
    os.system(f"tar -zxvf {popescu_dir}/{file}.tar.gz -C {popescu_dir}")

#### Adata creation and metadata gathering

In [None]:
adata_popescu_4834STDY7002880 = sc.read_10x_mtx(popescu_dir + '/4834STDY7002880/GRCh38')
adata_popescu_4834STDY7038753 = sc.read_10x_mtx(popescu_dir + '/4834STDY7038753/GRCh38')
adata_popescu_FCAImmP7241241 = sc.read_10x_mtx(popescu_dir + '/FCAImmP7241241/GRCh38')
adata_popescu_FCAImmP7316888 = sc.read_10x_mtx(popescu_dir + '/FCAImmP7316888/GRCh38')
adata_popescu_FCAImmP7316897 = sc.read_10x_mtx(popescu_dir + '/FCAImmP7316897/GRCh38')
adata_popescu_FCAImmP7352190 = sc.read_10x_mtx(popescu_dir + '/FCAImmP7352190/GRCh38')
adata_popescu_FCAImmP7352191 = sc.read_10x_mtx(popescu_dir + '/FCAImmP7352191/GRCh38')
adata_popescu_FCAImmP7462241 = sc.read_10x_mtx(popescu_dir + '/FCAImmP7462241/GRCh38')

In [None]:
adata_popescu = sc.AnnData.concatenate(adata_popescu_4834STDY7002880, adata_popescu_4834STDY7038753, adata_popescu_FCAImmP7241241, 
                                       adata_popescu_FCAImmP7316888, adata_popescu_FCAImmP7316897, adata_popescu_FCAImmP7352190, 
                                       adata_popescu_FCAImmP7352191, adata_popescu_FCAImmP7462241)

In [None]:
adata_popescu.raw = adata_popescu

In [None]:
sc.pp.filter_genes(adata_popescu, min_counts=1)
sc.pp.log1p(adata_popescu)
tk.tl.triku(adata_popescu, n_procs=1, random_state=seed)
sc.pp.pca(adata_popescu, random_state=seed, n_comps=30)
sce.pp.bbknn(adata_popescu, metric='angular')

In [None]:
sc.tl.umap(adata_popescu, min_dist=0.3, random_state=seed)
sc.tl.leiden(adata_popescu, resolution=1.5, random_state=seed)
sc.pl.umap(adata_popescu, color=['leiden', 'batch'], legend_loc='on data')

In [None]:
sc.pl.umap(adata_popescu, color=['leiden', 'RGS5', 'MYL9', 'NDUFA4L2', 'NPY', 'CHRNA1', 'S100B', 'MPZ'], legend_loc='on data', cmap=magma, use_raw=False)

In [None]:
sc.pl.umap(adata_popescu, color=['leiden', 'KRT18', 'KRT5', 'MLANA', 'CLDN5', 'PECAM1'], legend_loc='on data', cmap=magma, use_raw=False)

In [None]:
sc.pl.umap(adata_popescu, color=['leiden', 'GINS2', 'PCNA', 'MCM3', 'NASP', 'DEK', 'HELLS', 'DUT'], legend_loc='on data', cmap=magma, use_raw=False)

We see that clusters in the surroundings (16, 17, 19, 20, 21, 22) barely express LUM or PDGFRA, so they ar not good candidates for being fibroblasts. 
* 16, 21: Pericytes - MYL9, RGS5, NDUFA4L2
* 19 - Neurons (NPY, CHRNA1, CHODL) 
* 17, 22: Schwann cells (S100B, MPZ) 
* 20, 21: Keratinocytes (KRT18, KRT5) 
* 22: Melanocytes (MLANA, TYRP)
* 21: Endothelial cells - CLDN5, PECAM1, CDH5 + Keratinocytes

Additionally, clusters 3, 11, 12, 18 express genes that are involved in cell cycle:
* GINS2 - DNA Replication Complex GINS Protein PSF2
* MCM3/5/7 - Minichromosome Maintenance Complex Component: involved in the initiation of eukaryotic genome replication
* PCNA - Proliferating Cell Nuclear Antigen: increase the processivity of leading strand synthesis during DNA replication
* NASP: encodes a H1 histone binding protein that is involved in transporting histones into the nucleus of dividing cells
* DEK: This protein binds to cruciform and superhelical DNA and induces positive supercoils into closed circular DNA
* HELLS - Helicase, Lymphoid Specific: DNA strand separation, including replication, repair, recombination, and transcription
* DUT: providing a precursor (dUMP) for the synthesis of thymine nucleotides needed for DNA replication.

We will remove those clusters from the analysis, because we are focused on populations as similar as the adult populations. Also, cluster 13 is mixed with other fibroblast clusters, so we will include it just in case.


In [None]:
sc.tl.rank_genes_groups(adata_popescu, groupby='leiden', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_popescu, dendrogram=False, n_genes=20, use_raw=False)

In [None]:
adata_popescu_fb = adata_popescu[adata_popescu.obs['leiden'].isin(['0', '1', '2', '4', '5', '6', '7', '8', '9', '10', '13', '14', '15'])]

In [None]:
sc.pp.filter_genes(adata_popescu_fb, min_counts=1)
sc.pp.pca(adata_popescu_fb, random_state=seed, n_comps=30)
sce.pp.bbknn(adata_popescu_fb, metric='angular')
tk.tl.triku(adata_popescu_fb, n_procs=1, random_state=seed, use_adata_knn=True)

In [None]:
sc.tl.umap(adata_popescu_fb, min_dist=0.3, random_state=seed)
sc.tl.leiden(adata_popescu_fb, resolution=1.5, random_state=seed)
sc.pl.umap(adata_popescu_fb, color=['leiden'], legend_loc='on data')

## Getting robust clusters

We see that all datasets have similar cell types, but the number of clusters is inconsistent. In this section we are going to do a reclustering of the datasets based on the leiden clustering. To do the reclustering, we will get the DEGs between all small clusters, and we will join two or more clusters if they do not have any distinct DEG. We will repeat the procedure until we find no more clusters without DEGs. We will aim to get between 10 to 15 clusters, although in later steps they might be further down reduced.

### Tabib et al. 2018

In [None]:
sc.tl.rank_genes_groups(adata_tabib_fb, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_tabib_fb, dendrogram=False, n_genes=80, use_raw=False)

In [None]:
sc.tl.rank_genes_groups(adata_tabib_fb, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_tabib_fb, dendrogram=False, n_genes=15, use_raw=False)

In [None]:
sc.pl.umap(adata_tabib_fb, color='leiden', legend_loc='on data')

In [None]:
dict_reverse_mappings = {'T0': ['1', '4'], 'T1': ['3'], 'T2': ['2'], 'T3': ['0', '8', '16'], 
                         'T4': ['11', '13', '21'], 'T5': ['5', '10'], 'T6': ['7', '17', '14'], 'T7': ['6', '9'], 
                         'T8': ['12'], 'T9':['15'], 'T10':['18'], 'T11':['19'], 'T12':['20'], 'T13':['22']}

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_tabib_fb.obs['robust_clustering_1'] = [dict_mappings[i] for i in adata_tabib_fb.obs['leiden']]
sc.pl.umap(adata_tabib_fb, color=['robust_clustering_1'], cmap=magma, legend_loc='on data')

In [None]:
sc.pl.umap(adata_tabib_fb, color=['robust_clustering_1',  'CCL2', 'GGT5'], cmap=magma, legend_loc='on data', use_raw=False)

In [None]:
genes = ['GDF10', 'PLA2G2A']
sc.pl.umap(adata_tabib_fb, color=['robust_clustering_1'] + genes, cmap=magma, legend_loc='on data', use_raw=False) 

In [None]:
sc.tl.rank_genes_groups(adata_tabib_fb, 'robust_clustering_1', method='wilcoxon')
sc.pl.rank_genes_groups(adata_tabib_fb)

In [None]:
sc.tl.rank_genes_groups(adata_tabib_fb, 'robust_clustering_1', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_tabib_fb, dendrogram=False,  n_genes=60, use_raw=False)

In [None]:
dict_reverse_mappings = {'A': ['T0', 'T3', 'T6', 'T8', 'T10'], 'B': ['T2', 'T5', 'T7', 'T12'], 'C': ['T1', 'T4', 'T9', 'T11'], 'D': ['T13', 'T12']}

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_tabib_fb.obs['axes'] = [dict_mappings[i] for i in adata_tabib_fb.obs['robust_clustering_1']]

adata_tabib_fb.uns['axes_colors'] = ['#9a1549', '#00764b', '#002562', '#606060']

sc.pl.umap(adata_tabib_fb, color=['axes'], cmap=magma, legend_loc='on data', legend_fontsize=16)

In [None]:
sc.tl.rank_genes_groups(adata_tabib_fb, 'axes', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_tabib_fb, dendrogram=False,  n_genes=60, use_raw=False)

In [None]:
dict_reverse_mappings = {'A1': ['T8', 'T0'], 'A2': ['T3'], 'A3': ['T6'], 'A4': ['T10'], 
                         'B1': ['T2', 'T5'], 'B2': ['T7'], 
                         'C1': ['T1'], 'C2': ['T9'], 'C3': ['T4'], 'C4': ['T11'], 
                         'D1': ['T12'], 'D2': ['T13']}

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_tabib_fb.obs['clusters'] = [dict_mappings[i] for i in adata_tabib_fb.obs['robust_clustering_1']]

adata_tabib_fb.uns['clusters_colors'] = ['#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
                                                    '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8']

sc.pl.umap(adata_tabib_fb, color=['clusters'], cmap=magma, legend_loc='on data', legend_fontsize=16)

In [None]:
sc.tl.rank_genes_groups(adata_tabib_fb, 'clusters', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_tabib_fb, dendrogram=False,  n_genes=70, use_raw=False)

#### Getting DEGs for clusters D1 and D2

In [None]:
sc.tl.rank_genes_groups(adata_tabib_fb, 'clusters', method='wilcoxon', groups=['D1', 'D2'], reference='rest', n_genes=200)

In [None]:
sc.pl.umap(adata_tabib_fb, color=['clusters'] + list(adata_tabib_fb.uns['rank_genes_groups']['names']['D1'][:100]), 
           cmap=magma, legend_loc='on data', use_raw=False) 

In [None]:
sc.pl.umap(adata_tabib_fb, color=['clusters'] + list(adata_tabib_fb.uns['rank_genes_groups']['names']['D1'][100:]), 
           cmap=magma, legend_loc='on data', use_raw=False) 

In [None]:
sc.pl.umap(adata_tabib_fb, color=['clusters'] + list(adata_tabib_fb.uns['rank_genes_groups']['names']['D2'][100:]), 
           cmap=magma, legend_loc='on data', use_raw=False) 

In [None]:
sc.pl.umap(adata_tabib_fb, color=['clusters'] + list(adata_tabib_fb.uns['rank_genes_groups']['names']['D2'][:100]), 
           cmap=magma, legend_loc='on data', use_raw=False) 

### Philippeos et al. 2018

In [None]:
sc.pl.umap(adata_phil_fb, color=['leiden', 'MFAP5','ID3', 'POSTN', 'FGF7', 'COL18A1'], cmap=magma, legend_loc='on data', use_raw=False) 

In [None]:
sc.tl.rank_genes_groups(adata_phil_fb, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_phil_fb, dendrogram=False, n_genes=60, use_raw=False)

* Clusters 1 and 2 are interesting because they are the common COL18A1 and MFAP5 clusters, which appear in all datasets. 
* Cluster 0 does not have a clear transcriptomic profile. Based on its DEGs (ID3, MFAP5(lo), PLK2, MEOX2...) it can be the POSTN/ASPN cluster, but I am not sure.

In [None]:
dict_reverse_mappings = {'P0': ['0'], 'P1': ['1'], 'P2': ['2']}
 
dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_phil_fb.obs['robust_clustering_1'] = [dict_mappings[i] for i in adata_phil_fb.obs['leiden']]
sc.pl.umap(adata_phil_fb, color=['robust_clustering_1'], cmap=magma, legend_loc='on data')

In [None]:
sc.tl.rank_genes_groups(adata_phil_fb, 'robust_clustering_1', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_phil_fb, dendrogram=False, n_genes=60, use_raw=False)

In [None]:
sc.pl.umap(adata_phil_fb, color=['robust_clustering_1', 'ID3', 'CYR61', 'SPARCL1', 'CRABP1', 'FIBIN', 'COCH'], cmap=magma, legend_loc='on data', ncols=5)
sc.pl.umap(adata_tabib_fb, color=['robust_clustering_1', 'ID3', 'CYR61', 'SPARCL1'], cmap=magma, legend_loc='on data', ncols=5, use_raw=False)

In [None]:
fig = sc.pl.umap(adata_phil_fb, color=['robust_clustering_1', 
                                 'WISP2', 'SLPI', 'PI16', 'SFRP2', 'APCDD1', 'COL18A1', 'RGS3'], 
           cmap=magma, legend_loc='on data', ncols=4, return_fig=True)

plt.tight_layout()
plt.savefig('images/philippeos_markers.png', dpi=300)

### Solé-Boldo et al. 2020

In [None]:
sc.tl.rank_genes_groups(adata_sole_young_fb, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_sole_young_fb, dendrogram=False, n_genes=20, use_raw=False)

In [None]:
dict_reverse_mappings = {'S0': ['5', '7', '9', '17'], 
                         'S1': ['1', '2', '6'], 
                         'S2': ['3', '8', '10'], 
                         'S3': ['0', '12', '14'], 'S4': ['6'], 'S5': ['4', '13', '15'],
                         'S6': ['11'], 'S7': ['18'], 'S8': ['16']}
 
dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_sole_young_fb.obs['robust_clustering_1'] = [dict_mappings[i] for i in adata_sole_young_fb.obs['leiden']]
sc.pl.umap(adata_sole_young_fb, color=['robust_clustering_1'], cmap=magma, legend_loc='on data')

In [None]:
sc.pl.umap(adata_sole_young_fb, color=['robust_clustering_1', 'APOE', 'C7', 'MEDAG', 'CYGB', 'APOC1', 'CXCL2', 'GPC3'
], cmap=magma, legend_loc='on data', use_raw=False) 

In [None]:
sc.tl.rank_genes_groups(adata_sole_young_fb, 'robust_clustering_1', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_sole_young_fb, dendrogram=False, n_genes=70, use_raw=False)

In [None]:
dict_reverse_mappings = {'A': ['S0', 'S1', 'S4'], 
                         'B': ['S2', 'S3'], 
                         'C': ['S5', 'S6', 'S7', 'S8']}

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_sole_young_fb.obs['axes'] = [dict_mappings[i] for i in adata_sole_young_fb.obs['robust_clustering_1']]
adata_sole_young_fb.uns['axes_colors'] = ['#9a1549', '#00764b', '#002562', '#b21b95']


sc.pl.umap(adata_sole_young_fb, color=['axes'], cmap=magma, legend_loc='on data', legend_fontsize=16)

In [None]:
sc.tl.rank_genes_groups(adata_sole_young_fb, 'axes', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_sole_young_fb, dendrogram=False,  n_genes=60, use_raw=False)

In [None]:
dict_reverse_mappings = {'A1': ['S0'], 'A2': ['S1'], 'A3': ['S4'],  
                         'B1': ['S3'], 'B2': ['S2'], 
                         'C1': ['S5'], 'C2': ['S6'], 'C3': ['S7'], 'C4': ['S8'], 
                         }

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_sole_young_fb.obs['clusters'] = [dict_mappings[i] for i in adata_sole_young_fb.obs['robust_clustering_1']]

adata_sole_young_fb.uns['clusters_colors'] = ['#e14b67', '#d98c58', '#e55e32', '#009f61', '#54ab4c', 
                                                    '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#b21b95']

sc.pl.umap(adata_sole_young_fb, color=['clusters'], cmap=magma, legend_loc='on data', legend_fontsize=16)

### Vorstandlechner et al. 2020

In [None]:
sc.tl.rank_genes_groups(adata_vors_fb, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_vors_fb, dendrogram=False, n_genes=50, use_raw=False)

In [None]:
sc.tl.rank_genes_groups(adata_vors_fb, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_vors_fb, dendrogram=False, n_genes=15, use_raw=False)

In [None]:
sc.pl.umap(adata_vors_fb, color=['leiden', 'CCL19', 'PLA2G2A'], cmap=magma, legend_loc='on data', use_raw=False) 

In [None]:
sc.pl.umap(adata_vors_fb, color=['leiden'], cmap=magma, legend_loc='on data', use_raw=False) 

In [None]:
dict_reverse_mappings = {'V0': ['1', '4', '7'], 'V1': ['5', '14'], 'V2': ['6'], 'V3': ['0'], 
                         'V4': ['3', '13'], 'V5': ['12'], 'V6': ['2'], 'V7': ['9', '10'], 
                         'V8': ['8'], 'V9':['11'], 'V10': ['15'], 'V11': ['16']}
 
dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_vors_fb.obs['robust_clustering_1'] = [dict_mappings[i] for i in adata_vors_fb.obs['leiden']]
sc.pl.umap(adata_vors_fb, color=['robust_clustering_1'], cmap=magma, legend_loc='on data')

In [None]:
sc.pl.umap(adata_vors_fb, color=['robust_clustering_1', 'CCL2', 'ITM2A', 'PLA2G2A', 'SOD2'], cmap=magma, legend_loc='on data', legend_fontsize=16)

In [None]:
sc.pl.umap(adata_vors_fb, color=['robust_clustering_1', 'WIF1', 'PLA2G2A', 'APCDD1'], cmap=magma, legend_loc='on data', use_raw=False) 

In [None]:
sc.tl.rank_genes_groups(adata_vors_fb, 'robust_clustering_1', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_vors_fb, dendrogram=False, n_genes=20, use_raw=False)

In [None]:
dict_reverse_mappings = {'A': ['V0', 'V1', 'V4', 'V6', 'V8'], 'B': ['V2', 'V3', 'V5', 'V9', 'V10'], 'C': ['V11'], 'F': ['V7']}

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_vors_fb.obs['axes'] = [dict_mappings[i] for i in adata_vors_fb.obs['robust_clustering_1']]
adata_vors_fb.uns['axes_colors'] = ['#9a1549', '#00764b', '#002562', '#6d6d6d']
sc.pl.umap(adata_vors_fb, color=['axes'], cmap=magma, legend_loc='on data', legend_fontsize=16)

In [None]:
sc.tl.rank_genes_groups(adata_vors_fb, 'axes', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_vors_fb, dendrogram=False,  n_genes=60, use_raw=False)

In [None]:
dict_reverse_mappings = {'A1': ['V0'], 'A3': ['V4'], 'A2': ['V1', 'V6'],  'A4': ['V8'],
                         'B1': ['V3', 'V5', 'V10'], 'B2': ['V2', 'V9'], 
                         'C': ['V11'],
                         'F': ['V7']}

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_vors_fb.obs['clusters'] = [dict_mappings[i] for i in adata_vors_fb.obs['robust_clustering_1']]

adata_vors_fb.uns['clusters_colors'] = ['#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
                                                    '#002562', '#6d6d6d']

sc.pl.umap(adata_vors_fb, color=['clusters'], cmap=magma, legend_loc='on data', legend_fontsize=16)

#### Getting DEGs for cluster F

In [None]:
sc.tl.rank_genes_groups(adata_vors_fb, 'clusters', method='wilcoxon', groups=['F'], reference='rest', n_genes=200)

In [None]:
sc.pl.umap(adata_vors_fb, color=['clusters'] + list(adata_vors_fb.uns['rank_genes_groups']['names']['F'][:100]), 
           cmap=magma, legend_loc='on data', use_raw=False) 

### He et al. 2020

#### Colocallization of COL18A1, COL6A5, CCL19

In [None]:
genes = ['COL18A1', 'COL6A5', 'CCL19']
sc.pl.umap(adata_tabib_fb, color=['leiden'] + genes, cmap=magma, legend_loc='on data', use_raw=False) 
sc.pl.umap(adata_vors_fb, color=['leiden'] + genes, cmap=magma, legend_loc='on data', use_raw=False) 
sc.pl.umap(adata_sole_young_fb, color=['leiden'] + genes, cmap=magma, legend_loc='on data', use_raw=False) 
sc.pl.umap(adata_he_fb, color=['leiden'] + genes, cmap=magma, legend_loc='on data', use_raw=False) 

Interestingly, CCL19 barely colocalizes with COL6A5 in the rest of datasets, but it does in He et al.

In [None]:
sc.tl.rank_genes_groups(adata_he_fb, 'leiden', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_he_fb, dendrogram=False, n_genes=70, use_raw=False)

In [None]:
sc.pl.umap(adata_he_fb, color=['leiden'] + genes, cmap=magma, legend_loc='on data', use_raw=False) 

In [None]:
sc.pl.umap(adata_he_fb, color=['leiden', 'COL11A1', 'DPEP1', 'STMN1', 'EDNRA'], cmap=magma, legend_loc='on data', use_raw=False, ncols=3) 

In [None]:
sc.pl.umap(adata_he_fb, color=['leiden', 'COCH', 'DPEP1', 'ANGPTL7', 'COL11A1'], cmap=magma, legend_loc='on data', use_raw=False, ncols=3) 

In [None]:
sc.pl.umap(adata_he_fb, color=['leiden', 'ANGPTL7', 'C3', 'ITM2A', 'SFRP2', 
                               'CCL2', 'CD74', 'CTSH', 'PSME2'], cmap=magma, legend_loc='on data', use_raw=False, ncols=3) 

In [None]:
sc.tl.rank_genes_groups(adata_he_fb, groupby='leiden')
sc.pl.rank_genes_groups(adata_he_fb)

In [None]:
adata_he_fb

In [None]:
sc.pl.umap(adata_he_fb, color=['leiden'], cmap=magma, legend_loc='on data', use_raw=False, ncols=1)

In [None]:
dict_reverse_mappings = {'H0': ['1', '7'], 'H1': ['13', '14', '4', '5', '0', '16'], 
                         'H2': ['8', '9', '11'], 'H3': ['2', '6'], 
                         'H4': ['3', '10'], 'H5': ['12'], 'H6': ['15'],}
 
dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_he_fb.obs['robust_clustering_1'] = [dict_mappings[i] for i in adata_he_fb.obs['leiden']]
sc.pl.umap(adata_he_fb, color=['robust_clustering_1'], cmap=magma, legend_loc='on data')

In [None]:
sc.pl.umap(adata_he_fb, color=['robust_clustering_1', 'AXIN2', 'PTK7'], cmap=magma, legend_loc='on data')

In [None]:
sc.tl.rank_genes_groups(adata_he_fb, 'robust_clustering_1', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_he_fb, dendrogram=False, n_genes=70, use_raw=False)

In [None]:
dict_reverse_mappings = {'A': ['H0', 'H1'], 'B': ['H2', 'H3'], 'C': ['H4', 'H6', 'H5']}

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_he_fb.obs['axes'] = [dict_mappings[i] for i in adata_he_fb.obs['robust_clustering_1']]
adata_he_fb.uns['axes_colors'] = ['#9a1549', '#00764b', '#002562', '#b21b95']

sc.pl.umap(adata_he_fb, color=['axes'], cmap=magma, legend_loc='on data', legend_fontsize=16)

In [None]:
sc.tl.rank_genes_groups(adata_he_fb, 'axes', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_he_fb, dendrogram=False, n_genes=70, use_raw=False)

In [None]:
dict_reverse_mappings = {'A': ['H0', 'H1'],
                         'B1': ['H3'], 'B2': ['H2'], 
                         'C1': ['H4'], 'C2': ['H6'], 'C4': ['H5']}

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_he_fb.obs['clusters'] = [dict_mappings[i] for i in adata_he_fb.obs['robust_clustering_1']]

adata_he_fb.uns['clusters_colors'] = ['#9a1549', '#009f61', '#54ab4c', '#008aac', '#006aad', '#3fb4c1', '#b21b95']

sc.pl.umap(adata_he_fb, color=['clusters'], cmap=magma, legend_loc='on data', legend_fontsize=16)

In [None]:
xx = ['ABCA10', 'ABI3BP', 'APOC1']
sc.pl.umap(adata_he_fb, color=['clusters'] + xx, cmap=magma, legend_loc='on data', legend_fontsize=16)
sc.pl.umap(adata_tabib_fb, color=['clusters'] + xx, cmap=magma, legend_loc='on data', legend_fontsize=16)

In [None]:
sc.tl.rank_genes_groups(adata_he_fb, 'clusters', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_he_fb, dendrogram=False, n_genes=20, use_raw=False)

#### Labelling the injured samples

Now that we know the markers for each of the populations, we are going to apply this naming convention to the diseased sample.

In [None]:
# A1, A2, A3, A4 markers

A_markers = ['SFRP2', 'CTHRC1', 'ELN']
A1_markers = ['WISP2', 'PI16', 'SLPI']
A2_markers = ['COL18A1', 'COMP', 'NKD2']
A3_markers = ['WIF1', 'RGCC', 'CD9']
A4_markers = ['IGFBP6', 'SFRP4', 'PCOLCE2']

sc.pl.umap(adata_he_inj_fb, color=['leiden'] + A_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 
sc.pl.umap(adata_he_inj_fb, color=['leiden'] + A1_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 
sc.pl.umap(adata_he_inj_fb, color=['leiden'] + A2_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 
sc.pl.umap(adata_he_inj_fb, color=['leiden'] + A3_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 
sc.pl.umap(adata_he_inj_fb, color=['leiden'] + A4_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 

In [None]:
# B markers

B_markers = ['APOE', 'IGFBP7', 'C7']
B1_markers = ['ITM2A', 'GPC3', 'CCL2']
B2_markers = ['CCL19', 'CTSH', 'C3']

sc.pl.umap(adata_he_inj_fb, color=['leiden'] + B_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 
sc.pl.umap(adata_he_inj_fb, color=['leiden'] + B1_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 
sc.pl.umap(adata_he_inj_fb, color=['leiden'] + B2_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 

In [None]:
# C markers

C_markers = ['DKK3', 'TNMD', 'TNN']
C1_markers = ['DPEP1', 'COL11A1', 'STMN1']
C2_markers = ['COCH', 'HSPA2', 'CRABP1']
C3_markers = ['POSTN', 'ASPN', 'GPM6B']
C4_markers = ['ANGPTL7', 'APOD', 'TENM2']

sc.pl.umap(adata_he_inj_fb, color=['leiden'] + C_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 
sc.pl.umap(adata_he_inj_fb, color=['leiden'] + C1_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 
sc.pl.umap(adata_he_inj_fb, color=['leiden'] + C2_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 
sc.pl.umap(adata_he_inj_fb, color=['leiden'] + C3_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 
sc.pl.umap(adata_tabib_fb, color=['leiden'] + C3_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 
sc.pl.umap(adata_he_inj_fb, color=['leiden'] + C4_markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 

The cluster B1 can in reality be subdivided in two groups (based similar to Tabib): the clusters 0/10 are more GPC3/MGST1-like, while clusters 1/2/3/17 are more C7-like

In [None]:
markers = ['GPC3', 'MGST1', 'C7']
sc.pl.umap(adata_he_inj_fb, color=['leiden'] + markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 
sc.pl.umap(adata_tabib_fb, color=['leiden'] + markers, cmap=magma, legend_loc='on data', use_raw=False, ncols=4) 

In [None]:
dict_reverse_mappings = {'A': ['5', '11', '15', '13', '6', '7', '4', '8', '14'], 
                         'B': ['0', '10', '1', '2', '3', '17', '9', '12'], 'C': ['16']}

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_he_inj_fb.obs['axes'] = [dict_mappings[i] for i in adata_he_inj_fb.obs['leiden']]
adata_he_inj_fb.uns['axes_colors'] = ['#9a1549', '#00764b', '#002562']

sc.pl.umap(adata_he_inj_fb, color=['axes'], cmap=magma, legend_loc='on data', legend_fontsize=16)




dict_reverse_mappings = {'A1': ['6', '7', '13', '15', '5'], 'A2': ['4'], 'A3': ['8', '14'], 'A4': ['11'], 
                         'B1': ['0', '10', '1', '2', '3', '17'], 'B2': ['9', '12'], 
                         'C1': [], 'C2': [], 'C3': [], 'C4': ['16']}

dict_mappings = {}

for key, val in dict_reverse_mappings.items():
    for val_i in val:
        dict_mappings[val_i] = key
        
adata_he_inj_fb.obs['clusters'] = [dict_mappings[i] for i in adata_he_inj_fb.obs['leiden']]

adata_he_inj_fb.uns['clusters_colors'] = ['#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
                                                    '#008aac', '#006aad', '#2a358c', '#3fb4c1']

sc.pl.umap(adata_he_inj_fb, color=['clusters'], cmap=magma, legend_loc='on data', legend_fontsize=16)

In [None]:
sc.pl.umap(adata_he_inj_fb, color=['clusters', 'COL18A1', 'WIF1', 'SLPI', 'SFRP4'], cmap=magma, legend_loc='on data', legend_fontsize=16, ncols=5)
sc.pl.umap(adata_he_fb, color=['clusters', 'COL18A1', 'WIF1', 'SLPI', 'SFRP4'], cmap=magma, legend_loc='on data', legend_fontsize=16, ncols=5)

# Integration of datasets
In this section we are going to integrate the different datasets into one to perform the reclustering and see how our populations are merged. Due to quality of the data He dataset will be included and excluded, to see differences in integration.

In [None]:
for adata, adata_raw in zip([adata_tabib_fb, adata_sole_young_fb, adata_vors_fb, adata_he_fb], 
                            [adata_tabib_fb_raw, adata_sole_young_fb_raw, adata_vors_fb_raw, adata_he_fb_raw]):
    adata_raw.obs['axes'] = adata.obs['axes']
    adata_raw.obs['clusters'] = adata.obs['clusters']
    adata_raw.obs = adata_raw.obs[['axes', 'clusters']].astype(str)
    
    # Adata He and SB are already log transformed because they need to be so during the first steps of the 
    # preprocessing. to keep the same log status we will apply the log transform now and not later.
    if 'log1p' not in adata_raw.uns:
        sc.pp.log1p(adata_raw)


We will create a function to evaluate simply the goodness of the batch effect correction using the entropy. The function will plot, for each leiden cluster (the number of cluster may vary, but will be generally low), the proportion of each dataset, and will also calculate the entropy to have a numeric measure of the "goodness" of the integration. We will use the normalized entropy, that is, the entropy divided by the maximum entropy (log(n_datasets)), to adjust for the number of datasets in the adata.

In [None]:
def plot_entropy(adata, column='leiden', column_ref='dataset', list_colors = ['#94346e', '#e17c05', '#0F8554', '#1D6996'] * 10):
    fig, axs = plt.subplots(1, 2, figsize=(15, 5))
    
    clusters = sorted(list(set(adata.obs[column])))
    n_clusters = len(clusters)
    datasets = sorted(list(set(adata.obs[column_ref].values)))
    
    entropies = []
    max_entropy_vec = np.array([len(adata[adata.obs[column_ref] == d]) for d in datasets])/len(adata)
    max_entropy = -np.sum(max_entropy_vec * np.log(max_entropy_vec))
    print(max_entropy)
    
    for cluster_idx, cluster in enumerate(clusters):
        labels = adata[adata.obs[column] == str(cluster)].obs[column_ref].values
        
        props = np.zeros(len(datasets))
        
        for dataset_idx, dataset in enumerate(datasets):
            props[dataset_idx] = np.sum(labels == dataset)/len(labels)
            
            axs[0].bar(cluster_idx, props[dataset_idx], bottom = np.sum(props[:dataset_idx]), 
                       label=dataset, color=list_colors[dataset_idx])    
        
        props = props[props > 0]
        entropy = -np.sum(props * np.log(props))/max_entropy
        entropies.append(entropy)
        
        axs[1].scatter(cluster, entropy, c='#808080')
            
    axs[1].set_ylim([0, 1])
    axs[1].plot([0, n_clusters], [np.mean(entropies), np.mean(entropies)], c='#007ab7')
    
    handles, labels = axs[0].get_legend_handles_labels()
    axs[0].legend(handles[:len(datasets)], labels[:len(datasets)], bbox_to_anchor=[-0.2, 1])
    
    for ax in axs:
        ax.set_xticks(range(n_clusters))
        ax.set_xticklabels(clusters)
                      
    plt.show()

## Adata raw concatenation

### With H dataset

In [None]:
adata_concat_raw = sc.AnnData.concatenate(adata_tabib_fb_raw, adata_sole_young_fb_raw, 
                                                 adata_vors_fb_raw, adata_he_fb_raw,
                                            batch_key='dataset', batch_categories=['T', 'S', 'V', 'H'])
sc.pp.filter_genes(adata_concat_raw, min_cells=10)

adata_concat_raw.uns['dataset_colors'] = ['#94346e', '#e17c05', '#0F8554', '#1D6996']


adata_concat_raw.uns['clusters_colors'] = ['#9a1549','#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
            '#002562', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8', '#6d6d6d']

In [None]:
adata_concat_raw.obs['logcounts'] = np.log10(adata_concat_raw.X.sum(1))

In [None]:
sc.pp.normalize_total(adata_concat_raw)

In [None]:
adata_concat_raw.obs['logcounts_norm'] = np.log10(adata_concat_raw.X.sum(1))

In [None]:
sc.pp.pca(adata_concat_raw, random_state=seed, n_comps=30)
tk.tl.triku(adata_concat_raw, n_procs=1, random_state=seed, use_adata_knn=True)
sc.pp.neighbors(adata_concat_raw, random_state=seed, metric='cosine', knn=len(adata_concat_raw) ** 0.5 // 2)

In [None]:
sc.tl.umap(adata_concat_raw, min_dist=0.03, random_state=seed)
sc.tl.leiden(adata_concat_raw, resolution=1, random_state=seed)
sc.pl.umap(adata_concat_raw, color=['dataset', 'leiden', 'axes', 'clusters', 'logcounts', 'logcounts_norm'], ncols=2)

In [None]:
plot_entropy(adata_concat_raw, 'leiden', 'dataset')

In [None]:
plot_entropy(adata_concat_raw, 'leiden', 'clusters', list_colors = 
            ['#9a1549','#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
            '#002562', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8', '#6d6d6d'])

### Without H dataset

In [None]:
adata_concat_raw = sc.AnnData.concatenate(adata_tabib_fb_raw, adata_sole_young_fb_raw, 
                                                 adata_vors_fb_raw, 
                                            batch_key='dataset', batch_categories=['T', 'S', 'V'])
sc.pp.filter_genes(adata_concat_raw, min_cells=10)

adata_concat_raw.uns['dataset_colors'] = ['#94346e', '#e17c05', '#0F8554']


adata_concat_raw.uns['clusters_colors'] = ['#9a1549','#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
            '#002562', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8', '#6d6d6d']

In [None]:
adata_concat_raw.obs['logcounts'] = np.log10(adata_concat_raw.X.sum(1))

In [None]:
sc.pp.normalize_total(adata_concat_raw)

In [None]:
adata_concat_raw.obs['logcounts_norm'] = np.log10(adata_concat_raw.X.sum(1))

In [None]:
sc.pp.pca(adata_concat_raw, random_state=seed, n_comps=30)
tk.tl.triku(adata_concat_raw, n_procs=1, random_state=seed, use_adata_knn=True)
sc.pp.neighbors(adata_concat_raw, random_state=seed, metric='cosine', knn=len(adata_concat_raw) ** 0.5 // 2)

In [None]:
sc.tl.umap(adata_concat_raw, min_dist=0.03, random_state=seed)
sc.tl.leiden(adata_concat_raw, resolution=0.8, random_state=seed)
sc.pl.umap(adata_concat_raw, color=['dataset', 'leiden', 'axes', 'clusters', 'logcounts', 'logcounts_norm'], ncols=2)

In [None]:
plot_entropy(adata_concat_raw, 'leiden', 'dataset')

In [None]:
plot_entropy(adata_concat_raw, 'leiden', 'clusters', list_colors = 
            ['#9a1549','#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
            '#002562', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8', '#6d6d6d'])

## bbknn

### BBKNN on dataset with H dataset

In [None]:
adata_concat_raw = sc.AnnData.concatenate(adata_tabib_fb_raw, adata_sole_young_fb_raw, 
                                                 adata_vors_fb_raw, adata_he_fb_raw,
                                            batch_key='dataset', batch_categories=['T', 'S', 'V', 'H'])
sc.pp.filter_genes(adata_concat_raw, min_cells=10)

adata_concat_raw.uns['clusters_colors'] = ['#9a1549','#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
            '#002562', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8', '#6d6d6d']

In [None]:
sc.pp.normalize_total(adata_concat_raw)

In [None]:
sc.pp.pca(adata_concat_raw, random_state=seed, n_comps=30)
sce.pp.bbknn(adata_concat_raw, approx=False, batch_key='dataset')
tk.tl.triku(adata_concat_raw, n_procs=1, random_state=seed, use_adata_knn=True)

In [None]:
sc.tl.umap(adata_concat_raw, min_dist=0.03, random_state=seed)
sc.tl.leiden(adata_concat_raw, resolution=0.8, random_state=seed)
sc.pl.umap(adata_concat_raw, color=['dataset', 'leiden', 'axes', 'clusters'], ncols=2)

In [None]:
plot_entropy(adata_concat_raw, 'leiden', 'dataset')

In [None]:
plot_entropy(adata_concat_raw, 'leiden', 'clusters', list_colors = 
            ['#9a1549','#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
            '#002562', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8', '#6d6d6d'])

In [None]:
sc.tl.rank_genes_groups(adata_concat_raw, groupby='leiden', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_concat_raw, dendrogram=False, n_genes=40)

### BBKNN on dataset without H dataset

In [None]:
adata_concat_not_he_raw = sc.AnnData.concatenate(adata_sole_young_fb_raw, adata_tabib_fb_raw,
                                                 adata_vors_fb_raw,
                                            batch_key='dataset', batch_categories=['S', 'T', 'V'])
sc.pp.filter_genes(adata_concat_not_he_raw, min_cells=10)

adata_concat_not_he_raw.uns['dataset_colors'] = ['#94346e', '#e17c05', '#0F8554']
adata_concat_not_he_raw.uns['clusters_colors'] = ['#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
            '#002562', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8', '#6d6d6d']

In [None]:
sc.pp.normalize_total(adata_concat_not_he_raw)

In [None]:
sc.pp.pca(adata_concat_not_he_raw, random_state=seed, n_comps=30)
sce.pp.bbknn(adata_concat_not_he_raw, approx=False, batch_key='dataset')
tk.tl.triku(adata_concat_not_he_raw, n_procs=1, random_state=seed, use_adata_knn=True)

In [None]:
sc.tl.umap(adata_concat_not_he_raw, min_dist=0.03, random_state=seed)
sc.tl.leiden(adata_concat_not_he_raw, resolution=0.8, random_state=seed)
sc.pl.umap(adata_concat_not_he_raw, color=['dataset', 'leiden', 'axes', 'clusters'], ncols=2)

In [None]:
plot_entropy(adata_concat_not_he_raw, 'leiden', 'dataset')

In [None]:
plot_entropy(adata_concat_not_he_raw, 'leiden', 'clusters', list_colors = 
            ['#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
            '#002562', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8', '#6d6d6d'])

#### DEG and marker analysis 

In [None]:
# A AXIS
sc.pl.umap(adata_concat_not_he_raw, color=['dataset', 'clusters', 'leiden', 
                                           'SLPI', 'PI16', 'WISP2', 
                                           'APCDD1', 'COL18A1', 'COMP', 
                                           'WIF1', 'RGCC', 'SGCA', 
                                          ], ncols=3, cmap=magma)

In [None]:
# B AXIS
sc.pl.umap(adata_concat_not_he_raw, color=['dataset', 'clusters', 'leiden', 
                                           'APOE', 'C7', 'IGFBP7',
                                           'CCL2', 'ITM2A', 'SOD2',
                                           'CCL19', 'RBP5', 'CTSH'
                                          ], ncols=3, cmap=magma)

In [None]:
# C AXIS
sc.pl.umap(adata_concat_not_he_raw, color=['dataset', 'clusters', 'leiden', 
                                           'COL11A1', 'DPEP1', 'WFDC1', 
                                           'COCH', 'CRABP1', 'RSPO4', 
                                            'ANGPTL7', 'TM4SF1', 'APOD', 
                                          ], ncols=3, cmap=magma)

In [None]:
sc.tl.rank_genes_groups(adata_concat_not_he_raw, groupby='clusters', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_concat_not_he_raw, dendrogram=False, n_genes=130)

In [None]:
genes = ['DPEP1', 'TNMD', 'COL11A1', 'MEF2C', 'WFDC1']

sc.pl.umap(adata_tabib_fb, color=['clusters'] + genes, ncols=3, cmap=magma)
sc.pl.umap(adata_sole_young_fb, color=['clusters'] + genes, ncols=3, cmap=magma)
sc.pl.umap(adata_vors_fb, color=['clusters'] + genes, ncols=3, cmap=magma)

In [None]:
sc.pl.umap(adata_concat_not_he_raw, color=['clusters', 'leiden', 'dataset'] + genes,
           ncols=3, cmap=magma)

In [None]:
sc.tl.rank_genes_groups(adata_concat_not_he_raw, groupby='leiden', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_concat_not_he_raw, dendrogram=False, n_genes=150)

## Harmony

### Harmony on dataset with H dataset

In [None]:
adata_concat_raw = sc.AnnData.concatenate(adata_he_fb_raw, adata_sole_young_fb_raw, adata_tabib_fb_raw,
                                                 adata_vors_fb_raw,
                                            batch_key='dataset', batch_categories=['H', 'S', 'T', 'V', ])
sc.pp.filter_genes(adata_concat_raw, min_cells=10)

adata_concat_raw.uns['dataset_colors'] = ['#94346e', '#e17c05', '#0F8554', '#1D6996']
adata_concat_raw.uns['clusters_colors'] = ['#9a1549','#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
            '#002562', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8', '#6d6d6d']

In [None]:
sc.pp.normalize_total(adata_concat_raw)

In [None]:
sc.pp.pca(adata_concat_raw, random_state=seed, n_comps=30)
sce.pp.harmony_integrate(adata_concat_raw, 'dataset', epsilon_harmony = 1e-7)
sc.pp.neighbors(adata_concat_raw, random_state=seed, use_rep='X_pca_harmony', 
                metric='cosine', knn=len(adata_concat_raw) ** 0.5 // 2)
tk.tl.triku(adata_concat_raw, n_procs=1, random_state=seed, use_adata_knn=True)

In [None]:
sc.tl.umap(adata_concat_raw, min_dist=0.05, random_state=seed)
sc.tl.leiden(adata_concat_raw, resolution=1, random_state=seed)
sc.pl.umap(adata_concat_raw, color=['dataset', 'leiden', 'axes', 'clusters'], ncols=2)

In [None]:
plot_entropy(adata_concat_raw, 'leiden', 'dataset')

In [None]:
plot_entropy(adata_concat_raw, 'leiden', 'clusters', list_colors = 
            ['#9a1549','#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
            '#002562', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8', '#6d6d6d'])

In [None]:
sc.tl.rank_genes_groups(adata_concat_raw, groupby='leiden', method='wilcoxon')
sc.pl.rank_genes_groups_tracksplot(adata_concat_raw, dendrogram=False, n_genes=40)

In [None]:
sc.pl.umap(adata_concat_raw, color=['dataset', 'leiden', 'axes', 'clusters', 'WIF1', 'WISP2', 'COMP', 'SLPI', 
                                    'APCDD1', 'COL18A1', 'CCL19', 'ITM2A', 'DNAJA1', 'SOD2', 'UAP1'], ncols=2, cmap=magma)

### Harmony on dataset without H dataset

In [None]:
adata_concat_not_he_raw = sc.AnnData.concatenate(adata_sole_young_fb_raw, adata_tabib_fb_raw,
                                                 adata_vors_fb_raw,
                                            batch_key='dataset', batch_categories=['S', 'T', 'V'])
sc.pp.filter_genes(adata_concat_not_he_raw, min_cells=10)

adata_concat_not_he_raw.uns['dataset_colors'] = ['#94346e', '#e17c05', '#0F8554']
adata_concat_not_he_raw.uns['clusters_colors'] = ['#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
            '#002562', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8', '#6d6d6d']

In [None]:
sc.pp.normalize_total(adata_concat_not_he_raw)

In [None]:
sc.pp.pca(adata_concat_not_he_raw, random_state=seed, n_comps=50)
sce.pp.harmony_integrate(adata_concat_not_he_raw, 'dataset', epsilon_harmony = 1e-7,)
sc.pp.neighbors(adata_concat_not_he_raw, random_state=seed, use_rep='X_pca_harmony', 
                metric='cosine', knn=len(adata_concat_not_he_raw) ** 0.5 // 2)
tk.tl.triku(adata_concat_not_he_raw, n_procs=1, random_state=seed, use_adata_knn=True)

In [None]:
sc.tl.umap(adata_concat_not_he_raw, min_dist=0.15, random_state=seed)
sc.tl.leiden(adata_concat_not_he_raw, resolution=0.8, random_state=seed)
sc.pl.umap(adata_concat_not_he_raw, color=['dataset', 'leiden', 'axes', 'clusters'], ncols=2)

In [None]:
plot_entropy(adata_concat_not_he_raw, 'leiden', 'dataset')

In [None]:
plot_entropy(adata_concat_not_he_raw, 'leiden', 'clusters', list_colors = 
            ['#e14b67', '#d98c58', '#e55e32', '#cd2333', '#009f61', '#54ab4c', 
            '#002562', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#9a9a9a', '#d8d8d8', '#6d6d6d'])

In [None]:
sc.tl.rank_genes_groups(adata_concat_not_he_raw, groupby='leiden')
sc.pl.rank_genes_groups_tracksplot(adata_concat_not_he_raw, dendrogram=False, n_genes=40)

In [None]:
sc.pl.umap(adata_concat_not_he_raw, color=['dataset', 'clusters', 'leiden',  
                                           'clusters',  'TWIST2', 'SLPI', 
                                           'COMP', 'COL18A1', 'APCDD1', 'WIF1'], ncols=3, cmap=magma)

# Velocity on SB and H datasets

In this section we are going to run scVelo on the individual SB and H datasets.

In [None]:
adata_merge_scvelo = sc.AnnData.concatenate(adata_he_fb_scvelo, adata_sole_young_fb_scvelo, 
                                            batch_key='dataset', batch_categories=['H', 'S'])

In [None]:
for adata in [adata_sole_young_fb_scvelo, adata_he_fb_scvelo]:
    scv.pp.filter_and_normalize(adata, )
    scv.pp.neighbors(adata, n_neighbors=int(len(adata) ** 0.5 // 2), metric='cosine')
    scv.pp.moments(adata, mode='connectivities')

In [None]:
for adata in [adata_sole_young_fb_scvelo, adata_he_fb_scvelo]:
    scv.tl.recover_dynamics(adata, )
    scv.tl.velocity(adata, mode='dynamical', )

In [None]:
for adata in [adata_sole_young_fb_scvelo, adata_he_fb_scvelo]:
    scv.tl.velocity_graph(adata, n_neighbors=int(0.5 * (len(adata) ** 0.5)))
    scv.tl.velocity_confidence(adata, )

### Plotting embedding in SB dataset

In [None]:
adata_sole_young_fb_scvelo.obs['clusters'] = adata_sole_young_fb.obs['clusters']

adata_sole_young_fb_scvelo.uns['clusters_colors'] = ['#e14b67', '#d98c58', '#e55e32', '#009f61', '#54ab4c', 
                                                    '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#b21b95']

In [None]:
scv.tl.umap(adata_sole_young_fb_scvelo, min_dist=0.25)
scv.tl.velocity_embedding(adata_sole_young_fb_scvelo, basis='umap')

In [None]:
scv.pl.velocity_embedding_stream(adata_sole_young_fb_scvelo, 
                                 color=['velocity_confidence', 'velocity_confidence_transition', 'clusters'], 
                                ncols=2)


In [None]:
scv.tl.terminal_states(adata_sole_young_fb_scvelo)

In [None]:
adata_sole_young_fb_scvelo.obs['root_cells_bool'] = \
[True if \
 (adata_sole_young_fb_scvelo.obs['clusters'].loc[i] in ['C1', 'B1']) & \
 (adata_sole_young_fb_scvelo.obs['root_cells'].loc[i] > 0.35)
 else False for i in  adata_sole_young_fb_scvelo.obs_names]

In [None]:
scv.tl.latent_time(adata_sole_young_fb_scvelo)

In [None]:
scv.pl.velocity_embedding_stream(adata_sole_young_fb_scvelo, 
                                 color=['velocity_confidence_transition', 'clusters', 'latent_time', 
                                             'velocity_pseudotime'], 
                                ncols=2)

scv.pl.scatter(adata_sole_young_fb_scvelo, color=['root_cells', 'root_cells_bool'], 
               color_map='gnuplot', size=80, legend_loc='on data', ncols=2)

In [None]:
adata_sole_young_fb_scvelo.write_h5ad(sole_dir + '/adata_scvelo_SB_full.h5')

#### Subsetting the cells to axes B and C

In [None]:
adata_sole_young_fb_scvelo_B_C = adata_sole_young_fb_scvelo[\
        adata_sole_young_fb_scvelo.obs['clusters'].isin(['B1', 'B2', 'C1', 'C2', 'C3', 'C4'])].copy()

In [None]:
scv.pp.neighbors(adata_sole_young_fb_scvelo_B_C, n_neighbors=int(len(adata) ** 0.5 // 2), metric='cosine')
scv.pp.moments(adata_sole_young_fb_scvelo_B_C, mode='connectivities')

In [None]:
scv.tl.recover_dynamics(adata_sole_young_fb_scvelo_B_C, )

In [None]:
scv.tl.velocity(adata_sole_young_fb_scvelo_B_C, mode='dynamical', )

In [None]:
scv.tl.velocity_graph(adata_sole_young_fb_scvelo_B_C, n_neighbors=int(0.5 * (len(adata) ** 0.5)))
scv.tl.velocity_confidence(adata_sole_young_fb_scvelo_B_C, )

In [None]:
adata_sole_young_fb_scvelo_B_C.uns['clusters_colors'] = ['#009f61', '#54ab4c', '#008aac', '#006aad', '#2a358c', '#3fb4c1', '#b21b95']

In [None]:
scv.tl.umap(adata_sole_young_fb_scvelo_B_C, min_dist=0.25)
scv.tl.velocity_embedding(adata_sole_young_fb_scvelo_B_C, basis='umap')

In [None]:
scv.tl.latent_time(adata_sole_young_fb_scvelo_B_C)

In [None]:
scv.tl.terminal_states(adata_sole_young_fb_scvelo_B_C)

In [None]:
adata_sole_young_fb_scvelo_B_C.obs['root_cells_bool'] = \
[True if \
 (adata_sole_young_fb_scvelo_B_C.obs['clusters'].loc[i] in ['C1', 'B1']) & \
 (adata_sole_young_fb_scvelo_B_C.obs['root_cells'].loc[i] > 0.35)
 else False for i in  adata_sole_young_fb_scvelo_B_C.obs_names]

In [None]:
scv.pl.velocity_embedding_stream(adata_sole_young_fb_scvelo_B_C, 
                                 color=['velocity_confidence_transition', 'clusters', 'latent_time', 
                                             'velocity_pseudotime'], 
                                ncols=2)

scv.pl.scatter(adata_sole_young_fb_scvelo_B_C, color=['root_cells', 'root_cells_bool'], 
               color_map='gnuplot', size=80, legend_loc='on data', ncols=2)

In [None]:
sc.tl.rank_genes_groups(adata_sole_young_fb_scvelo_B_C, groupby='root_cells_bool')
sc.pl.rank_genes_groups_tracksplot(adata_sole_young_fb_scvelo_B_C, n_genes=50, dendrogram=False)

In [None]:
sc.tl.rank_genes_groups(adata_sole_young_fb_scvelo, groupby='root_cells_bool')
sc.pl.rank_genes_groups_tracksplot(adata_sole_young_fb_scvelo, n_genes=50, dendrogram=False)

In [None]:
adata_sole_young_fb.obs['root_cells_bool'] = adata_sole_young_fb_scvelo.obs['root_cells_bool']

In [None]:
list_genes = list(adata_sole_young_fb_scvelo.uns['rank_genes_groups']['names']['True'])[50:100]
list_genes = ['IGFBP7', 'CFD', 'APOD', 'CFH', 'GSN', 'ITM2A']

sc.pl.umap(adata_sole_young_fb_scvelo, color=['clusters', 'root_cells_bool', 'latent_time', 
                                             ] + list_genes, legend_loc='on data', 
           cmap=magma, ncols=3)

In [None]:
scv.pl.scatter(adata_sole_young_fb_scvelo_B_C, basis=list_genes, ncols=3, frameon=False)

In [None]:
adata_sole_young_fb_scvelo_B_C.write_h5ad(sole_dir + '/adata_scvelo_SB_B_C.h5')

### Plotting embedding in H dataset

In [None]:
adata_he_fb_scvelo.obs['clusters'] = adata_he_fb.obs['clusters']

adata_he_fb_scvelo.uns['clusters_colors'] = ['#9a1549', '#009f61', '#54ab4c', '#008aac', '#006aad', '#3fb4c1', '#b21b95']

In [None]:
scv.tl.umap(adata_he_fb_scvelo, min_dist=0.3)
scv.tl.velocity_embedding(adata_he_fb_scvelo, basis='umap')

In [None]:
scv.pl.velocity_embedding_stream(adata_he_fb_scvelo, basis='umap',
                                 color=['velocity_confidence', 'velocity_confidence_transition', 'clusters'], 
                                ncols=2)


### Plotting embedding in merged dataset

In [None]:
sc.pp.filter_genes(adata_merge_scvelo, min_counts=1)
sc.pp.pca(adata_merge_scvelo, random_state=seed, n_comps=30)
sce.pp.harmony_integrate(adata_merge_scvelo, 'dataset', epsilon_harmony = 1e-7)
sc.pp.neighbors(adata_merge_scvelo, random_state=seed, use_rep='X_pca_harmony', 
                metric='cosine', knn=len(adata_merge_scvelo) ** 0.5 // 2)
tk.tl.triku(adata_merge_scvelo, n_procs=1, random_state=seed, use_adata_knn=True)

In [None]:
scv.pp.neighbors(adata_merge_scvelo, n_neighbors=int(len(adata) ** 0.5 // 2), 
                 metric='cosine', use_rep='X_pca_harmony')
scv.pp.moments(adata_merge_scvelo, mode='connectivities')
scv.tl.recover_dynamics(adata_merge_scvelo, )
scv.tl.velocity(adata_merge_scvelo, mode='dynamical', )
scv.tl.velocity_graph(adata_merge_scvelo, n_neighbors=int(0.5 * (len(adata) ** 0.5)))
scv.tl.velocity_confidence(adata_merge_scvelo, )

In [None]:
adata_merge_scvelo.obs['clusters'] = ''
adata_merge_scvelo.obs['clusters'].loc[[i + '-H' for i in adata_he_fb.obs_names]] = adata_he_fb.obs['clusters'].values
adata_merge_scvelo.obs['clusters'].loc[[i + '-S' for i in adata_sole_young_fb.obs_names]] = adata_sole_young_fb.obs['clusters'].values

adata_merge_scvelo.uns['clusters_colors'] = ['#9a1549'] + list(adata_sole_young_fb.uns['clusters_colors'])

In [None]:
scv.tl.umap(adata_merge_scvelo, min_dist=0.3)
scv.tl.velocity_embedding(adata_merge_scvelo, basis='umap')

In [None]:
scv.pl.velocity_embedding_stream(adata_merge_scvelo, basis='umap',
                                 color=['velocity_confidence', 'velocity_confidence_transition', 'clusters'], 
                                ncols=2)


# Individual dataset figure production

In [None]:
def create_plot_axis(genes):
    names = ['Tabib', 'Solé-Boldo', 'Vorstandlechner', 'He']
    fig, axs = plt.subplots(1 + len(genes), 4, figsize=(4*4, (1 + len(genes))*3))
    for adata_ix, adata in enumerate([adata_tabib_fb, adata_sole_young_fb, adata_vors_fb, adata_he_fb]):
        sc.pl.umap(adata, color=['axes',] , cmap=magma, legend_loc='on data', ax=axs[0][adata_ix], show=False, title=names[adata_ix], 
                   legend_fontsize=16, legend_fontoutline=3)
        
        for gene_ix, gene in enumerate(genes):
            try:
                sc.pl.umap(adata, color=gene , cmap=magma, legend_loc='on data', ax=axs[gene_ix + 1][adata_ix], show=False, use_raw=False)
            except:
                pass
            
    plt.tight_layout()
    fig.savefig('images/'+'_'.join(genes)+'png', dpi=300)
    
def create_plot_cluster(genes, clusters=[]):
    names = ['Tabib', 'Solé-Boldo', 'Vorstandlechner', 'He']
    
    fig, axs = plt.subplots(1 + len(genes), 4, figsize=(4*4, (1 + len(genes))*3))
    for adata_ix, adata in enumerate([adata_tabib_fb, adata_sole_young_fb, adata_vors_fb, adata_he_fb]):       
        sc.pl.umap(adata, color=['clusters',] , cmap=magma, legend_loc='on data', ax=axs[0][adata_ix], show=False, title=names[adata_ix], 
                  legend_fontsize=12, legend_fontoutline=2)
        
        for gene_ix, gene in enumerate(genes):
            try:
                if len(clusters):
                    sc.pl.umap(adata, color=gene , cmap=magma, legend_loc='on data', ax=axs[gene_ix + 1][adata_ix], show=False, use_raw=False, 
                           title = f'{gene} ({clusters[gene_ix]})')
                else:
                    sc.pl.umap(adata, color=gene , cmap=magma, legend_loc='on data', ax=axs[gene_ix + 1][adata_ix], show=False, use_raw=False, 
                           title = f'{gene}')
            except:
                pass
            
    plt.tight_layout()
    fig.savefig('images/'+'_'.join(genes)+'png', dpi=300)


### Figure 1: Representation of major axes with principal representative genes

In [None]:
create_plot_axis(['SFRP2', 'APOE', 'SFRP1'])

### Figure 2: Representation of markers of axis A populations

In [None]:
create_plot_cluster(['SLPI', 'COMP', 'WIF1', 'SFRP4'], ['A1', 'A2', 'A3', 'A4'])

### Figure 3: Representation of markers of axis B populations

In [None]:
create_plot_cluster(['ITM2A', 'CCL2', 'CCL19', 'CTSH'], ['B1', 'B1', 'B2', 'B2'])

### Figure 4: Representation of markers of axis C populations

In [None]:
create_plot_cluster(['COL11A1', 'COCH', 'POSTN', 'TM4SF1'], ['C1', 'C2', 'C3', 'C4'])

## Supplementary figures

In [None]:
create_plot_axis(['ELN', 'MMP2', 'QPCT', 'SFRP2'])
create_plot_axis(['APOE', 'C7', 'CYGB', 'IGFBP7'])
create_plot_axis(['DKK3', 'SFRP1', 'TNMD', 'TNN'])

In [None]:
create_plot_cluster(['IGFBP6', 'PI16', 'SLPI', 'WISP2'])
create_plot_cluster(['ELN', 'RGCC', 'SGCA', 'WIF1'])
create_plot_cluster(['APCDD1', 'COL18A1', 'COMP', 'NKD2'])
create_plot_cluster(['FBN1', 'PCOLCE2', 'PRG4', 'SFRP4'])

In [None]:
create_plot_cluster(['CCL2', 'ITM2A',  'SPSB1', 'TNFAIP6'])
create_plot_cluster(['CCDC146', 'CCL19', 'CD74', 'TNFSF13B',])

In [None]:
create_plot_cluster(['COL11A1', 'DPEP1', 'TNMD', 'WFDC1'])
create_plot_cluster(['COCH', 'CRABP1', 'FIBIN', 'RSPO4'])
create_plot_cluster(['ASPN', 'F2R', 'GPM6B', 'POSTN'])
create_plot_cluster(['ANGPTL7', 'APOD', 'C2orf40', 'TM4SF1'])

In [None]:
create_plot_cluster(['CHRDL1', 'GDF10', 'ITM2A', 'OGN'])
create_plot_cluster(['ADAMTS5', 'HSPH1', 'MIR22HG', 'MME'])
create_plot_cluster(['EEF1B2', 'FXYD3', 'KRT14'])
create_plot_cluster(['B4GALT1', 'CTNNB1', 'HNRNPH1', 'WTAP'])

In [None]:
create_plot_cluster(['FMO1', 'LSP1'])

## Table S1
This is the table of the *robust* DEGs of axes in all datasets.
We will select DEGs common between datasets, and manually curate them. The last thre cells are the curated list of genes.

In [None]:
def create_table(list_list_genes, list_columns, clus_ax, list_adatas=[adata_tabib_fb, adata_sole_young_fb, adata_vors_fb, adata_he_fb], 
                 list_adata_letters=['T', 'S', 'V', 'H']):
    df_supercolumns = [i for i in list_columns for j in range(len(list_adatas) + 1)]
    df_subcolumns = [i for j in range(len(list_columns)) for i in ['DEGs'] + list_adata_letters]
    
    cols = pd.MultiIndex.from_tuples(list(zip(*[df_supercolumns, df_subcolumns])), names=[clus_ax, 'dataset'])
    df = pd.DataFrame(index=range(max([len(i) for i in list_list_genes])), columns=cols)
    
    for column_idx, column in enumerate(list_columns):
        list_genes = list_list_genes[column_idx]
        
        for gene_idx, gene in enumerate(list_genes):
            df[column, 'DEGs'].loc[gene_idx] = gene
            
            for adata, letter in zip(list_adatas, list_adata_letters):
                try:
                    gene_loc = np.argwhere(adata.uns['rank_genes_groups']['names'][column] == gene)[0][0]
                    pval = adata.uns['rank_genes_groups']['pvals'][column][gene_loc]
                    logfold = adata.uns['rank_genes_groups']['logfoldchanges'][column][gene_loc]
                    df[column, letter].loc[gene_idx] = f'{pval:.2e} ({logfold:.2f})'
                except:
                    df[column, letter].loc[gene_idx] = f'-'
                
                
                
    return df

In [None]:
# YOU MUST RUN THIS FIRST!!!!!!
sc.tl.rank_genes_groups(adata_tabib_fb, groupby='axes', method='wilcoxon', n_genes=200)
sc.tl.rank_genes_groups(adata_sole_young_fb, groupby='axes', method='wilcoxon', n_genes=200)
sc.tl.rank_genes_groups(adata_vors_fb, groupby='axes', method='wilcoxon', n_genes=200)
sc.tl.rank_genes_groups(adata_he_fb, groupby='axes', method='wilcoxon', n_genes=200)

In [None]:
list_genes_axis_A = set(adata_tabib_fb.uns['rank_genes_groups']['names']['A'])
list_genes_axis_B = set(adata_tabib_fb.uns['rank_genes_groups']['names']['B'])
list_genes_axis_C = set(adata_tabib_fb.uns['rank_genes_groups']['names']['C'])

for adata in [adata_sole_young_fb, adata_vors_fb]:    
    list_genes_axis_A = list_genes_axis_A & set(adata.uns['rank_genes_groups']['names']['A'])
    list_genes_axis_B = list_genes_axis_B & set(adata.uns['rank_genes_groups']['names']['B'])
    
for adata in [adata_sole_young_fb, adata_he_fb]:    
    list_genes_axis_C = list_genes_axis_C & set(adata.uns['rank_genes_groups']['names']['C'])

In [None]:
# AXIS A
list_genes_A = ['AEBP1','AQP1','CAPZB','CD9','COL1A1','COL1A2','COL6A1','COL6A2',
'CTHRC1','ELN','FBN1','LAMP1','MMP2','NBL1','PAM','QPCT','RGCC','SFRP2','SPARC','THBS2','TSPAN4',]
create_plot_axis(list_genes_A)

In [None]:
# AXIS B
list_genes_B = ['APOE','BTG1','C3','C7','CCDC146','CXCL12','CYGB','FGF7',
'GEM','GGT5','IGFBP7','IRF1','RARRES2','SOCS3','TMEM176A','TMEM176B','TNFSF13B',]
create_plot_axis(list_genes_B)

In [None]:
# AXIS C
list_genes_C = ['CDH11','COL1A1','COL1A2','DKK3','EMID1','GPM6B','INHBA', 'PRSS23', 'SFRP1','SPARCL1','TNMD','TNN',]
create_plot_axis(list_genes_C)

In [None]:
df_axis = create_table([list_genes_A, list_genes_B, list_genes_C], ['A', 'B', 'C'], 'axis')
df_axis.to_csv('tables/TableS1.csv', sep=';')

In [None]:
df_axis

## Table S2
This is the table of the *robust* DEGs of clusters in axis A in all datasets.
We will select DEGs common between datasets, and manually curate them. The last thre cells are the curated list of genes.

In [None]:
sc.tl.rank_genes_groups(adata_tabib_fb, groupby='clusters', method='wilcoxon', n_genes=200)
sc.tl.rank_genes_groups(adata_sole_young_fb, groupby='clusters', method='wilcoxon', n_genes=200)
sc.tl.rank_genes_groups(adata_vors_fb, groupby='clusters', method='wilcoxon', n_genes=200)
sc.tl.rank_genes_groups(adata_he_fb, groupby='clusters', method='wilcoxon', n_genes=200)

In [None]:
df_tabib = pd.DataFrame(adata_tabib_fb.uns['rank_genes_groups']['names'])
df_sole = pd.DataFrame(adata_sole_young_fb.uns['rank_genes_groups']['names'])
df_vors = pd.DataFrame(adata_vors_fb.uns['rank_genes_groups']['names'])
df_he = pd.DataFrame(adata_he_fb.uns['rank_genes_groups']['names'])

In [None]:
list_genes_A1 = set(df_tabib['A1'].values)
list_genes_A2 = set(df_tabib['A2'].values)
list_genes_A3 = set(df_tabib['A3'].values)
list_genes_A4 = set(df_tabib['A4'].values)

for df in [df_sole, df_vors]:    
    list_genes_A1 = list_genes_A1 & set(df['A1'].values)
    list_genes_A2 = list_genes_A2 & set(df['A2'].values)
    list_genes_A3 = list_genes_A3 & set(df['A3'].values)

for df in [df_vors]:
    list_genes_A4 = list_genes_A4 & set(df['A4'].values)

In [None]:
# CLUSTER A1
list_genes_A1 = ['ANGPTL5', 'C1QTNF3', 'CD151', 'CD55', 'CD99', 'CPE', 'CTSB', 'CYBRD1', 'DCN', 'FBLN1', 'FGL2', 
'GPX3', 'GSN', 'IGFBP6', 'LOX', 'MFAP5', 'MGST1', 'MMP2', 'OLFML3', 'PDGFRL', 'PI16', 'PIGT', 'PODN', 'REXO2', 'SCARA5', 
'SERPINF1', 'SLPI', 'TSPAN8', 'WISP2', 'XG'
]
create_plot_cluster(list_genes_A1)

In [None]:
# CLUSTER A2
list_genes_A2 = ['APCDD1', 'AXIN2', 'C1orf198', 'CLEC2A', 'COL13A1', 'COL18A1', 'COL23A1', 'COMP', 
'CTSC', 'CYB26B1', 'EMX2', 'F13A1', 'GNG11', 'GREM2', 'HSPB3', 'ID1', 'LAMC3', 'NKD2', 'NPTX2', 'PTK7', 'RGS2', 'RGS3', 'RSPO1', 'SPRY1', 'STC2', 'TGFBI', ]
create_plot_cluster(list_genes_A2)

In [None]:
# CLUSTER A3
list_genes_A3 = ['CD9', 'COL6A1', 'ELN', 'LEPR', 'RGCC', 'SGCA', 'WIF1', ]
create_plot_cluster(list_genes_A3)

In [None]:
# CLUSTER A4
list_genes_A4 = ['C1QTNF3','FBN1', 'FSTL1', 'HSD3B7', 'IGFBP6', 'ISLR', 'MFAP5', 'PCOLCE2', 'PRG4', 'PRSS23', 'SFRP4', 'TNXB', ]
create_plot_cluster(list_genes_A4)

In [None]:
df_axis = create_table([list_genes_A1, list_genes_A2, list_genes_A3, list_genes_A4], ['A1', 'A2', 'A3', 'A4'], 'cluster')
df_axis.to_csv('tables/TableS2.csv', sep=';')

## Table S3
This is the table of the *robust* DEGs of clusters in axis B in all datasets.
We will select DEGs common between datasets, and manually curate them. The last thre cells are the curated list of genes.

In [None]:
list_genes_B1 = set(df_tabib['B1'].values)
list_genes_B2 = set(df_tabib['B2'].values)

for df in [df_vors]:    
    list_genes_B1 = list_genes_B1 & set(df['B1'].values)
    list_genes_B2 = list_genes_B2 & set(df['B2'].values)


In [None]:
create_plot_cluster(['CXCL2', 'CXCL3', 'GEM', 'CXCL1'])

In [None]:
# CLUSTER B1
list_genes_B1 = ['ARL6IP1', 'CCL2', 'CXCL1', 'CXCL2', 'CXCL3', 'DNAJA1', 'ERRFI1', 'GPC3', 'ITM2A', 'MCL1', 'MYC', 'PLA2G2A', 'PLIN2', 'SOD2', 'UAP1', ]
create_plot_cluster(list_genes_B1)

In [None]:
# CLUSTER B2
list_genes_B2 = ['BIRC3', 'BTG1', 'C3', 'CCDC146', 'CCL19', 'CD74', 'CTSH', 'IGFBP3', 'OLFM2', 'PSME2', 'TNFSF13B' ]
create_plot_cluster(list_genes_B2)

In [None]:
df_axis = create_table([list_genes_B1, list_genes_B2, ], ['B1', 'B2'], 'cluster')
df_axis.to_csv('tables/TableS3.csv', sep=';')

## Table S4
This is the table of the *robust* DEGs of clusters in axis C in all datasets.
We will select DEGs common between datasets, and manually curate them. The last thre cells are the curated list of genes.

In [None]:
list_genes_C1 = set(df_tabib['C1'].values)
list_genes_C2 = set(df_tabib['C2'].values)
list_genes_C3 = set(df_tabib['C3'].values)
list_genes_C4 = set(df_tabib['C4'].values)

for df in [df_sole]:    
    list_genes_C1 = list_genes_C1 & set(df['C1'].values)
    list_genes_C2 = list_genes_C2 & set(df['C2'].values)
    list_genes_C4 = list_genes_C4 & set(df['C4'].values)   
    list_genes_C3 = list_genes_C3 & set(df['C3'].values)


In [None]:
# CLUSTER C1
list_genes_C1 = ['CCND1', 'CDH11', 'COL11A1', 'COL5A2', 'DPEP1', 'EDNRA', 'GPC3', 'LAMC3', 'MEF2C', 'MME', 'POSTN', 'SPARC', 'STMN1', ]
create_plot_cluster(list_genes_C1)

In [None]:
# CLUSTER C2
list_genes_C2 = ['ADAMTS6', 'ARHGAP15', 'CADM2', 'CCK', 'CHADL', 'CLEC14A', 'COCH', 'CRABP1', 'DKK2', 'EMID1', 
'FIBIN', 'FZD1', 'GAP43', 'HSPA2', 'LEPREL1', 'LIPA', 'MEIS2', 'MEOX2', 'MKX', 'NDNF', 'NECAB1', 'OGN', 
'PCSK1N', 'PLXDC1', 'PPAPDC1B', 'PRRG4', 'RHPN1', 'RSPO4', 'SLC22A16', 'SLITRK6', 'SYTL2',]
create_plot_cluster(list_genes_C2)

In [None]:
# CLUSTER C3
list_genes_C3 = ['ASPN', 'BGN', 'C9orf3', 'DIO2', 'DKK2', 'F2R', 'FIBIN', 'GPM6B', 'LRRC15', 'LTBP2', 'MARCKS', 'PLEKHH2', 'PMEPA1', 'POSTN', ]
create_plot_cluster(list_genes_C3)

In [None]:
# CLUSTER C4
list_genes_C4 = ['ANGPTL7', 'APOD', 'C2orf40', 'CLDN1', 'CYP1B1', 'EBF2', 'EIF4A3', 'FGFBP2', 'IFI27', 
'KLK1', 'PODNL1', 'SCN7A', 'SFRP4', 'TAGLN', 'TENM2', 'TM4SF1' ]
create_plot_cluster(list_genes_C4)

In [None]:
df_axis = create_table([list_genes_C1, list_genes_C2, list_genes_C3, list_genes_C4,], ['C1', 'C2', 'C3', 'C4'], 'cluster')
df_axis.to_csv('tables/TableS4.csv', sep=';')

## Export adatas

In [None]:
os.makedirs('adatas', exist_ok=True)

In [None]:
adata_he_fb.write_h5ad('adatas/he_fb.h5ad')
adata_phil_fb.write_h5ad('adatas/philippeos_fb.h5ad')
adata_tabib_fb.write_h5ad('adatas/tabib_fb.h5ad')
adata_vors_fb.write_h5ad('adatas/vorstandlechner_fb.h5ad')
adata_sole_young_fb.write_h5ad('adatas/sole_fb.h5ad')

In [None]:
df = pd.DataFrame(index=['A', 'A1', 'A2', 'A3', 'A4', 'B', 'B1', 'B2', 'C', 'C1', 'C2', 'C3', 'C4', 'Unassigned'], 
                  columns=['T', 'S', 'V', 'H', 'Consensus (%)'])


for dataset, dataset_name in zip([adata_tabib_fb, adata_sole_young_fb, adata_vors_fb, adata_he_fb], 
                                 ['T', 'S', 'V', 'H']):
    for axis in ['A', 'B', 'C']:
        num = len(dataset[dataset.obs['axes'] == axis])
        den = len(dataset)
        
        pstr = f'{num}/{den} ({100 * num/den:.1f}%)' if num > 0 else '-'
        df.loc[axis, dataset_name] = pstr
    
    num = len(dataset[~ dataset.obs['axes'].isin(['A', 'B', 'C'])])
    pstr = f'{num}/{den} ({100 * num/den:.1f}%)' if num > 0 else '-'
    df.loc['Unassigned', dataset_name] = pstr
        
    for cluster in ['A1', 'A2', 'A3', 'A4', 'B1', 'B2', 'C1', 'C2', 'C3', 'C4']:
        num = len(dataset[dataset.obs['clusters'] == cluster])
        den = len(dataset[dataset.obs['axes'] == cluster[0]])
        
        pstr = f'{num}/{den} ({100 * num/den:.1f}%)' if num > 0 else '-'
        df.loc[cluster, dataset_name] = pstr
        
for row in range(len(df)):
    vec_str = df[['T', 'S', 'V', 'H']].iloc[row]
    
    list_nums = []
    for i in vec_str:
        if not str(i) == '-':
            list_nums.append(float(i[-6:-2].replace('(', '')))
    
    mean = np.mean(list_nums)
    std = np.std(list_nums)
    
    df['Consensus (%)'].iloc[row] = f'{mean:.1f} ± {std:.1f}'

In [None]:
from IPython.display import display

display(df)
df.to_csv('tables/Table_percentages.csv', sep=';')

# Depicting the proportion of cells with reticular / papillary signatures

In this section we will work on depicting which of the clusters/axes belong to populations of reticular/papillary fibroblasts. To do that, we are going to do two types of analysis: plot the individual UMAPs of each of the genes, and the UMAPs of combined genes / datasets.

We are going to take the signatures from different datasets, and observe which are the ones that sprout when analysing their expression patterns.

In [None]:
pap_haydont_2020 = ['CADM1', 'EFHD1', 'TOX', 'UCP2']
ret_haydont_2020 = ['ACAN', 'COL11A1', 'DIRAS3', 'EMCN', 'FGF9', 'LIMCH1', 'MGST1', 'NPR3', 'SOST', 'SOX11', 
                    'VCAM1']

pap_haydont_2019 = ['APCDD1', 'PLAC8', 'GPR126', 'TFAP2C', 'COLEC12', 'RAB11FIP1', 'VIT', 'COL10A1', 
                    'ADAMTS5', 'PSMB9', 'ENPP2', 'NFE2L3', 'RARRES2', 'SLC1A3']
ret_haydont_2019 = ['ZFPM2', 'APBB1IP', 'RGS4', 'TM4SF1', 'DSP', 'NTN4', 'PTGER3', 'ITGA2', 
                    'LRRC17', 'CXCL1', 'NCALD', 'CD9', 'NR2F2', 'WISP2', 'KLHL13', 'MSX1', 
                    'MARVELD2', 'SERPINB7', 'MYO10', 'PHLDB2']


pap_nauroy_2017 = ['BST2', 'IL8', 'CXCL1', 'MMP1', 'CTSC', 'PTDGS', 'CCL2', 'TNFRSF19', 'IGDCC4', 
                   'EFNB2', 'GPR37', 'INHBB', 'TRHDE', 'PLXNC1', 'PLCB4', 'COL18A1', 'PRDM1', 'GFRA1', 
                   'CCL8', 'IL15', 'TNFSF4', 'MLPH']
ret_nauroy_2017 = ['SCRG1', 'COMP', 'DEPTOR', 'A2M', 'MFAP5', 'CRLF1', 'RDH10', 'DOK5', 'SFRP4', 'PPP1R14A', 
                   'GPC4', 'GATA6', 'ROR1', 'LIMCH1', 'DACT1', 'ELN', 'SLC38A4']

pap_janson_2012_A = ['CCRL1', 'NTN1', 'PDPN',]
ret_janson_2012_A = ['CDH2', 'CNN1', 'TGM2']

pap_janson_2012_B = ['ITM2C', 'STEAP1', 'TNFRSF19']
ret_janson_2012_B = ['CNN1', 'MAP1B', 'MGP', 'PPP1R14A', 'TAGLN', 'TGM2', 'TMEM200A']

## UMAPs on particular datasets / genes

### Haydont 2020

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
    genes = sorted(set([i for i in pap_haydont_2020 if i in adata.var_names]))
    sc.pl.umap(adata, color=['clusters'] + genes, cmap=magma, legend_loc='on data', ncols=3, use_raw=False)

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
    genes = sorted(set([i for i in ret_haydont_2020 if i in adata.var_names]))
    sc.pl.umap(adata, color=['clusters'] + genes, cmap=magma, legend_loc='on data', ncols=3, use_raw=False)

### Haydont 2019

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
    genes = sorted(set([i for i in pap_haydont_2019 if i in adata.var_names]))
    sc.pl.umap(adata, color=['clusters'] + genes, cmap=magma, legend_loc='on data', ncols=3, use_raw=False)

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
    genes = sorted(set([i for i in ret_haydont_2019 if i in adata.var_names]))
    sc.pl.umap(adata, color=['clusters'] + genes, cmap=magma, legend_loc='on data', ncols=3, use_raw=False)

### Nauroy 2017

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
    genes = sorted(set([i for i in pap_nauroy_2017 if i in adata.var_names]))
    sc.pl.umap(adata, color=['clusters'] + genes, cmap=magma, legend_loc='on data', ncols=3, use_raw=False)

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
    genes = sorted(set([i for i in ret_nauroy_2017 if i in adata.var_names]))
    sc.pl.umap(adata, color=['clusters'] + genes, cmap=magma, legend_loc='on data', ncols=3, use_raw=False)

### Janson 2012

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
    genes = sorted(set([i for i in pap_janson_2012_A + pap_janson_2012_B if i in adata.var_names]))
    sc.pl.umap(adata, color=['clusters'] + genes, cmap=magma, legend_loc='on data', ncols=3, use_raw=False)

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
    genes = sorted(set([i for i in ret_janson_2012_A + ret_janson_2012_B if i in adata.var_names]))
    sc.pl.umap(adata, color=['clusters'] + genes, cmap=magma, legend_loc='on data', ncols=3, use_raw=False)

## Plotting gene scores per dataset / all datasets

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
    genes = sorted(set([i for i in pap_haydont_2020 if i in adata.var_names]))
    sc.tl.score_genes(adata, genes, score_name='pap_haydont_2020')
    genes = sorted(set([i for i in ret_haydont_2020 if i in adata.var_names]))
    sc.tl.score_genes(adata, genes, score_name='ret_haydont_2020')
    
    genes = sorted(set([i for i in pap_haydont_2019 if i in adata.var_names]))
    sc.tl.score_genes(adata, genes, score_name='pap_haydont_2019')
    genes = sorted(set([i for i in ret_haydont_2019 if i in adata.var_names]))
    sc.tl.score_genes(adata, genes, score_name='ret_haydont_2019')   
    
    genes = sorted(set([i for i in pap_nauroy_2017 if i in adata.var_names]))
    sc.tl.score_genes(adata, genes, score_name='pap_nauroy_2017')
    genes = sorted(set([i for i in ret_nauroy_2017 if i in adata.var_names]))
    sc.tl.score_genes(adata, genes, score_name='ret_nauroy_2017')
        
    genes = sorted(set([i for i in pap_janson_2012_A + pap_janson_2012_B if i in adata.var_names]))
    sc.tl.score_genes(adata, genes, score_name='pap_janson_2012')
    genes = sorted(set([i for i in ret_janson_2012_A + ret_janson_2012_B if i in adata.var_names]))
    sc.tl.score_genes(adata, genes, score_name='ret_janson_2012')
    
    genes = sorted(set([i for i in pap_haydont_2020 + pap_haydont_2019 + pap_nauroy_2017 + pap_janson_2012_A + pap_janson_2012_B if i in adata.var_names]))
    sc.tl.score_genes(adata, genes, score_name='pap_all')    
    genes = sorted(set([i for i in ret_haydont_2020 + ret_haydont_2019 + ret_nauroy_2017 + ret_janson_2012_A + ret_janson_2012_B if i in adata.var_names]))
    sc.tl.score_genes(adata, genes, score_name='ret_all')

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
        sc.pl.umap(adata, color=['clusters', 'pap_haydont_2020', 'ret_haydont_2020'],
                   legend_loc='on data', ncols=3, use_raw=False)

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
        sc.pl.umap(adata, color=['clusters', 'pap_haydont_2019', 'ret_haydont_2019'],
                   legend_loc='on data', ncols=3, use_raw=False)

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
        sc.pl.umap(adata, color=['clusters', 'pap_nauroy_2017', 'ret_nauroy_2017'],
                   legend_loc='on data', ncols=3, use_raw=False)

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
        sc.pl.umap(adata, color=['clusters', 'pap_janson_2012', 'ret_janson_2012'],
                   legend_loc='on data', ncols=3, use_raw=False)

In [None]:
for adata in [adata_tabib_fb, adata_vors_fb, adata_sole_young_fb]:
        sc.pl.umap(adata, color=['clusters', 'pap_all', 'ret_all'],
                   legend_loc='on data', ncols=3, use_raw=False)

In [None]:
create_plot_axis(['EMILIN1', 'EMILIN3'])