# Annotating Cell Types

This workbook can be run after the standard workflow.
It is demonstrating how to use the  annotation function to annotate the dataset that was runned through the standard worflow.

In this notebook, we will show how to use in-besca annotation to assign cell type to clusters.
We focus on immune celltype and demonstrate signature-scoring functions.


An alternative in the case a an annotated training dataset already exists is to use the auto-annot module. Please refer to the corresponding tutorial.

In [None]:
import besca as bc
import numpy as np
import pandas as pd
import scanpy.api as sc
import matplotlib.pyplot as plt
from scipy import sparse, io
import os
import time
import logging
import seaborn as sns
sc.logging.print_versions()

# for standard processing, set verbosity to minimum
sc.settings.verbosity = 0  # verbosity: errors (0), warnings (1), info (2), hints (3)
sc.settings.set_figure_params(dpi=80)
version = '2.8'
start0 = time.time()

In [None]:
#define standardized filepaths based on above input
root_path = os.getcwd()
bescapath_full = os.path.dirname(bc.__file__)
bescapath = os.path.split(bescapath_full)[0]

### Uncomment this when running after the standard workflow
analysis_name = 'standard_workflow_besca2.0'

clusters='leiden'




The chunk of code below is usefull if this is the initial installation of besca and that you are running this notebook as a test. It will download if needed the test datasets and export the labelling. 
This export is usually done at the end of the standard workflow. Exported files  are necessary for the annotations.

In [None]:
use_example_dataset = True
if use_example_dataset:
    analysis_name='pbmc3k_processed'
    results_folder = os.path.split(os.getcwd())[0] + '/besca/datasets/data/'
    clusters='leiden'
# This line will either download, or load the datasets
    adata = bc.datasets.pbmc3k_processed()
    # This line export the annotation for the annotation.
    adata = bc.st.additional_labeling(adata, labeling_to_use= clusters, labeling_name = clusters, 
                                      labeling_description = 'Exporting a postori the labels for annotation',
                                      labeling_author = 'Testing', 
                                      results_folder= results_folder)
else:
    results_folder = os.path.join(root_path, 'analyzed', analysis_name)
    adata = sc.read_h5ad(os.path.join(results_folder, analysis_name + '.h5ad') )


In [None]:
sc.pl.umap(adata, color= [clusters], legend_loc='on data')

In [None]:
# One can load besca-provided signatures using the function below
signature_dict = bc.datasets.load_immune_signatures(refined=False)

signature_dict

Additionaly it is possible to read an compute scanpy score using this function below.

If the gmt file is composed of combined signature (UP and DN), a common score will be computed: 
$$Total\_SCORE= Score_{UP} - Score_{DN}$$

In [None]:

gmt_file= bescapath + '/besca/datasets/genesets/Immune.gmt'
bc.tl.sig.combined_signature_score(adata, gmt_file,
                             UP_suffix='_UP', DN_suffix='_DN', method='scanpy',
                             overwrite=False, verbose=False,
                             use_raw=True, conversion=None)

In [None]:
scores = [x for x in adata.obs.columns if 'scanpy' in x]

In [None]:
sc.pl.umap(adata, color= scores)

# Immune signatures for specific sub-populations

In [None]:
 ## PROVIDED WITH BESCA
gmt_file_anno= bescapath + '/besca/datasets/genesets/CellNames_scseqCMs6_sigs.gmt'
bc.tl.sig.combined_signature_score(adata, gmt_file_anno)


In [None]:
scores = [x for x in adata.obs.columns if 'scanpy' in x]
sc.pl.umap(adata, color= scores, color_map = 'viridis')

# Automated annotation

A decision-tree-based annotation that reads signatures from a provided .gmt file and hierarchy as well as cutoffs and signature ordering from a configuration file and attributes each cell to a specific type according to signature enrichment. 

This is an aid to start ther annotation and annotation can then be further refined by adding further signatures or adjusting the configuration files. It was tested mainly on PBMCs and oncology (tumor biopsies) related samples.


## Loading markers and signature

In [None]:
mymarkers = bc.tl.sig.read_GMT_sign(gmt_file_anno,directed=False)
mymarkers = bc.tl.sig.filter_siggenes(adata, mymarkers) ### remove genes not present in dataset or empty signatures
mymarkers['Ubi'] = ['B2M','ACTB', 'GAPDH'] ### used for cutoff adjustment to individual dataset, can be modified

In [None]:
sc.pl.umap(adata, color= mymarkers['NClassMonocyte'])

## Configuration of the annotation

We read the configuration file, containing hierarchy, cutoff and signature priority information. 
A new version of this file should be created and maintained with each annotation. 
The included example is optimised for the annotation of the 6.6k PBMC dataset. 

In [None]:
configfile=bescapath + '/besca/datasets/genesets/CellNames_scseqCMs6_config.tsv' ### replace this with your config

In [None]:
sigconfig,levsk=bc.tl.sig.read_annotconfig(configfile)

Fract_pos was exported by BESCA in the standard worflow test, 
contains information of fraction positive cells per genes per cluster.

We use these values as a basis for a wilcoxon test per signature per cluster. 

In [None]:

f=pd.read_csv(results_folder + "/labelings/"+clusters+"/fract_pos.gct",sep="\t",skiprows=2)
df=bc.tl.sig.score_mw(f,mymarkers)
myc=np.median(df.loc['Ubi',:]*0.5) ### Set a cutoff based on Ubi and scale with values from config file


In [None]:
df.iloc[0:3,0:7]

For each signature, positive and negative clusters are determined. Only positive clusters are maintained. Cutoffs can be individualised based on the config file (scaling factor) and myc, which is determined based on ubiquitously expressed genes. 

In [None]:
#Cluster attribution based on cutoff
df=df.drop('Ubi')
sigscores={}
for mysig in list(df.index):
    sigscores[mysig]=bc.tl.sig.getset(df,mysig,sigconfig.loc[mysig,'Cutoff']*myc)
    #sigscores[mysig]=bc.tl.sig.getset(df,mysig,10)

One can inspect the cluster attribution per cell type in the signature list and adjust cutoffs as required. 

In [None]:
sigscores

In [None]:
sc.pl.umap(adata, color= [clusters], legend_loc='on data')

Now each cluster gets annotated, according to the distinct levels specified in the config file. 
Note that in case a cluster is positive for multiple identities, only the first one is taken, 
in the order specified in the "Order" column in the config file. 

To check the given order, per levels, you can inspect levsk

In [None]:
levsk

## Obtained cluster assignation

In [None]:
cnames=bc.tl.sig.make_anno(df,sigscores,sigconfig,levsk)

We now obtained per each cluster cell type attribution at distinct levels. 

In [None]:
cnames

Export the annotation used

In [None]:
bc.tl.sig.export_annotconfig(sigconfig, levsk, results_folder, analysis_name)

## Using db label convention

Only short names were used in the signature naming convention in this case. 
One can easity tranform this to EFO terms if preferred, a conversion table comes with besca. 

This nomenclature is quite extended, and the function 
**obtain_dblabel** can perform the conversion.

In [None]:
### transform these short forms to dblabel - EFO standard nomenclature
cnamesDBlabel = bc.tl.sig.obtain_dblabel(bescapath+'/besca/datasets/nomenclature/CellTypes_v1.tsv', cnames )
cnamesDBlabel

Finally, one can add the new labels to adata.obs as annotation. 

In [None]:
adata.obs['celltype0']=bc.tl.sig.add_anno(adata,cnamesDBlabel,'celltype0',clusters)
adata.obs['celltype2']=bc.tl.sig.add_anno(adata,cnamesDBlabel,'celltype2',clusters)
adata.obs['celltype3']=bc.tl.sig.add_anno(adata,cnamesDBlabel,'celltype3',clusters)

In [None]:
sc.pl.umap(adata,color=['celltype2']) #,'celltype3'

In [None]:
sc.pl.umap(adata,color=['celltype3']) #,'celltype3'

# Reclustering sub-clusters 

Sometimes, clusters appears to be of mix cells type. For example, for PBM3K, the lymphocyte clusters are mixed.  In this case, one can try to increase clustering resolution or recluster specifically on those clusters.
Below we show an example.

The main steps are:
+ saving previous clustering and annotation for comparison purpose (advised)
+ recluster 
+ Export the new labelling (see function additional_labeling)
+ Read the new labelling information including fract_pos files.
+ Recompute signatures/markers values
+ Reannotate
+ Convert annotation to dblabel
+ Export all for the data subset to the larger adata object.

In [None]:
recluster = False

if use_example_dataset:
    recluster = True
    celltype_label = 'celltype2_original'
    to_recluster =  ('CD8-positive, alpha-beta T cell','CD4-positive, alpha-beta T cell',
                                         'cytotoxic CD56-dim natural killer cell')

In [None]:
if recluster:
    # Save previous clustering obtained for comparision
    adata.obs['leiden_original'] = adata.obs['leiden'].copy()
    adata.obs['celltype2_original']  = adata.obs['celltype2'].copy() 

    # Calling reclustering
    adata_rc = bc.tl.rc.recluster ( adata, celltype_label = celltype_label, 
                               celltype=to_recluster, resolution=1.3)
    # Leiden reclustering have to be exported to use the annotation function 
    cluster_renamed = 'Leiden_Reclustering'
    adata_rc = bc.st.additional_labeling(adata_rc, 'leiden', cluster_renamed, 'Leiden Reclustering on Lymphocytes', 'author', results_folder)
   
    # Reading additional labelling
    f=pd.read_csv(results_folder + "/labelings/"+cluster_renamed+"/fract_pos.gct",sep="\t",skiprows=2)
    df=bc.tl.sig.score_mw(f,mymarkers)
    myc=np.median(df.loc['Ubi',:]*0.5) ### Set a cutoff based on Ubi and scale with values from config file
    # RECOMPUTING SIG SCORE WITH NEW CUTOFF
    df=df.drop('Ubi')
    sigscores={}
    for mysig in list(df.index):
        sigscores[mysig]=bc.tl.sig.getset(df,mysig,sigconfig.loc[mysig,'Cutoff']*myc)
    cnames=bc.tl.sig.make_anno(df,sigscores,sigconfig,levsk)
    cnamesDBlabel = bc.tl.sig.obtain_dblabel(bescapath+'/besca/datasets/nomenclature/CellTypes_v1.tsv', cnames )
    
    adata_rc.obs['celltype0']=bc.tl.sig.add_anno(adata_rc,cnamesDBlabel,'celltype0','leiden')
    adata_rc.obs['celltype2']=bc.tl.sig.add_anno(adata_rc,cnamesDBlabel,'celltype2','leiden')
    adata_rc.obs['celltype3']=bc.tl.sig.add_anno(adata_rc,cnamesDBlabel,'celltype3','leiden')
    # Lex orrder needed.
    names_2 = []
    names_3 = []
    for i in range( cnames.shape[0]) :
        names_2 += [cnames['celltype2'][str(i)]]
        names_3 += [cnames['celltype3'][str(i)]]
    
    bc.tl.rc.annotate_new_cellnames( adata, adata_rc, names = names_2, new_label='celltype2', method = 'leiden')

    bc.tl.rc.annotate_new_cellnames( adata, adata_rc, names = names_3, new_label='celltype3', method = 'leiden')
    
    sc.pl.umap(adata,color=['celltype2', 'celltype2_original',
                       'celltype3'], ncols=1) 

### Export labelling

Chosen labels can also be exported as a new folder in labelings/

In [None]:
### Save labelling
adata = bc.st.additional_labeling(adata, 'celltype3', 'celltype3', 'Major cell types attributed based on HumanCD45p_scseqCMs8', 'schwalip', results_folder)
