## Cellarium Cell Annotation Service (CAS) Quickstart Tutorial

<img src="https://cellarium.ai/wp-content/uploads/2024/07/cellarium-logo-medium.png" alt="drawing" width="96"/>

This Notebook is a short tutorial on using Cellarium CAS. Please read the instructions and run each cell in the presented order. Once you have finished going through the tutorial, please feel free to go back and modify it as needed for annotating your own datasets.

### Installing Cellarium CAS client library

As a first step, we need to install Cellarium CAS client library, ``cellarium-cas``, along with all dependencies needed for visualizations. To this end, run the next cell.

> **Note:**
> If you have already installed ``cellarium-cas`` without the visualization dependencies, you should still run the next cell.

In [None]:
!pip install --force-reinstall cellarium-cas[vis]

### Load the AnnData file

In this tutorial, we will annotate a peripheral blood mononuclear cell (PBMC) scRNA-seq dataset from 10x Genomics.

>**Note:** The original dataset, _"10k PBMCs from a Healthy Donor (v3 chemistry)"_, can be found [here](https://www.10xgenomics.com/datasets/10-k-pbm-cs-from-a-healthy-donor-v-3-chemistry-3-standard-3-0-0).

For the purpose of this tutorial, we have selected 4,000 cells selected at random from the full dataset. We have additionally precomputed UMAP embeddings of these cells using a standard scanpy workflow and performed unsupervised Leiden clustering.

>**Note:** For a quick tutorial on scRNA-seq data quality control, preprocessing, embedding, and clustering using scanpy, we recommend this [tutorial](https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering-2017.html).

>**Note:** We emphasize that CAS requires raw integer mRNA counts. If you are adapting this tutorial to your own dataset and your data is already normalized and/or restricted to a small gene set (e.g. highly variable genes), it is not suitable for CAS. If you have the raw counts in an AnnData layer or stored in the ``.raw`` attribute, please make sure that the ``.X`` attribute of your AnnData file is populated with the raw counts. 

In [None]:
import scanpy as sc
import warnings

# suppressing some of the informational warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# set default figure resolution and size
sc.set_figure_params(dpi=80)

In [None]:
# Download the sample AnnData file
!curl -O https://storage.googleapis.com/cellarium-file-system-public/cellarium-cas-tutorial/pbmc_10x_v3_4k.h5ad

In [None]:
# Read the sample AnnData object
adata = sc.read('pbmc_10x_v3_4k.h5ad')

Let us inspect the loaded AnnData file:

In [None]:
adata

The AnnData file contains 4000 (cells) x 33538 (genes), a ``cluster_label`` attribute (under ``.obs``), and PCA and UMAP embeddings (under ``.obsm``).

In [None]:
adata.obs

Let us inspect the UMAP embedding already available in the AnnData file:

In [None]:
sc.pl.umap(adata)

Also, let us inspect the unsupervised Leiden clustering of the PCA embeddings for a sanity check:

In [None]:
sc.pl.umap(adata, color='cluster_label')

>**Note:** The UMAP embeddings and unsupervised clustering of the data are both **optional** and are not needed by CAS. However, these attributes are **required** for visualizing and inspecting the CAS output using our visualization tools.

Finally, let us inspect the ``.var`` attribute of the loaded AnnData file:

In [None]:
adata.var

We notice that Gene Symbols (names) serve as the index of the ``.var`` DataFrame, and ENSEMBLE Gene IDs are provided under ``gene_ids`` column. We take note of both for the next steps.

>**Note:** CAS requires both Gene Symbols and ENSEMBLE Gene IDs. If you do not have either available in your AnnData file, please update your AnnData file before proceeding to the next steps. We recommend using [BioMart](http://www.ensembl.org/info/data/biomart/index.html) for converting Gene Symbols to ENSEMBLE Gene IDs or vice versa.

### Submit the loaded AnnData file to Cellarium CAS for annotation

As a first step, please populate your CAS API token in the next cell:

In [None]:
api_token = "<your_api_token>"

You can now connect to the Cellarium CAS backend and authenticate the session with our API token: 

In [None]:
from cellarium.cas.client import CASClient

cas = CASClient(api_token=api_token)

The response will contain a list of annotation models and their brief descriptions. You need to choose the model that is suitable for your dataset. For this tutorial, choose a model that is suitable for annotating human scRNA-seq datasets:

In [None]:
# Select the annotation model
cas_model_name = '<model_name>'

At this point, we are ready to submit our AnnData file to CAS for annotation.

>**Note:** Before you proceed, you may need to modify the next cell as necessary for your dataset. CAS must be pointed to the appropriate columns in the ``.var`` DataFrame for fetching Gene Symbols and ENSEMBLE Gene IDs. This is done by setting ``feature_names_column_name`` and ``feature_ids_column_name`` arguments accordingly. If either appears as the index of the ``.var`` DataFrame, use `index` as argument. Otherwise, use the appropriate column name.  

In [None]:
# Submit AnnData to CAS for ontology-aware cell type query
cas_ontology_aware_response = cas.annotate_matrix_cell_type_ontology_aware_strategy(
    matrix=adata,
    chunk_size=500,
    feature_ids_column_name='gene_ids',
    feature_names_column_name='index',
    cas_model_name=cas_model_name)

Let us take a quick look at the anatomy of the CAS ontology-aware cell type query response. In brief, the response is a Python list with as many elements as the number of cells in the queried AnnData file:

In [None]:
type(cas_ontology_aware_response)

In [None]:
len(cas_ontology_aware_response)

The list entry at position _i_ is a dictionary that contains a number of cell type ontology terms and their relevance scores for the _i_'th cell:

In [None]:
cas_ontology_aware_response[2425]

By gleaning at the above response, we may infer that cell number 2425 is a _natural killer cell_.

### Exploring the Cellarium CAS response

We recommending exploring the CAS response using our provided ``CASCircularTreePlotUMAPDashApp`` Dash App for a more streamlined and holistic visualization of the CAS response. The visualization is self-explanatory.

>**Note:** In a nutshell, the visualization shows various cell type ontology terms as colored circles. The size of the circle signifies the occurence of the term in the entire dataset (or over the chosen group of cells). The color of the circle signifies the relevance score of the term in cells over which the term was found to have any degree of relevance. You can highlight the ontology term relevance scores over the UMAP scatter plot by clicking on the circles. You can also show the terms relevant to individual cells and their scores by clicking on a cell over the UMAP scatter plot. 

In [None]:
from cellarium.cas._io import suppress_stderr
from cellarium.cas.visualization import CASCircularTreePlotUMAPDashApp

DASH_SERVER_PORT = 8050

with suppress_stderr():
    CASCircularTreePlotUMAPDashApp(
        adata,  # the AnnData file
        cas_ontology_aware_response,  # CAS response
        cluster_label_obs_column="cluster_label",  # (optional) The .obs column name containing cluster labels 
    ).run(port=DASH_SERVER_PORT, debug=False, jupyter_width="100%")

### Best cell type label assignment

Often times, we just want the best cell type labels and not a scored cell type ontology graph (!). Such a call _can_ be made, however, with the understanding that the notion of the best cell type call for any given cell is **not** a well-defined task in general. Crude ontology terms (e.g. T cell, B cell) often have higher relevance scores whereas more granular labels (e.g. CD8-positive, alpha-beta T cell, IgG-negative class switched memory B cell) often have lower relevance scores. If the _best call_ is construed as _the most confident call_, thensuch a call will be naturally too crude and uninformative. Therefore, there is an inherent trade-off between accuracy and cell type call granularity.

We have implemented some basic functionalities to help users navigate the scored ontology graph and make cell type calls. Our current notion of the best cell type call is one that that is furthest away from the root node (here, eukaryotic cell) while having a relevance score above a user-provided threshold. This definition allows us to sort the cell type ontology terms and report the top-_k_ calls for each cell. We show top-3 calls for each cell and each cluster for demonstration.

In [None]:
import cellarium.cas.postprocessing.ontology_aware as pp
from cellarium.cas.postprocessing.cell_ontology import CellOntologyCache

with suppress_stderr():
    cl = CellOntologyCache()

#### Assing cell type calls to individual cells

In [None]:
pp.compute_most_granular_top_k_calls_single(
    adata=adata,
    cl=cl,
    min_acceptable_score=0.1,  # minimum acceptable score for a call
    top_k=3,  # how many top calls to make?
    obs_prefix="cas_cell_type"  # .obs column to write the top-k calls to
)

In [None]:
adata.obs

In [None]:
sc.pl.umap(adata, color='cas_cell_type_label_1')
sc.pl.umap(adata, color='cas_cell_type_label_2')
sc.pl.umap(adata, color='cas_cell_type_label_2')

#### Assign cell type calls to predefined cell clusters

In [None]:
pp.compute_most_granular_top_k_calls_cluster(
    adata=adata,
    cl=cl,
    min_acceptable_score=0.1,  # minimum acceptable score for a call
    cluster_label_obs_column='cluster_label',  # .obs column containing cluster labels
    top_k=3,  # how many top calls to make?
    obs_prefix='cas_cell_type_cluster'  # .obs column to write the top-k calls to
)

In [None]:
adata.obs

In [None]:
sc.pl.umap(adata, color='cas_cell_type_cluster_label_1')
sc.pl.umap(adata, color='cas_cell_type_cluster_label_2')
sc.pl.umap(adata, color='cas_cell_type_cluster_label_3')