## Cellarium Cell Annotation Service (CAS) Quickstart Tutorial

<img src="https://cellarium.ai/wp-content/uploads/2024/07/cellarium-logo-medium.png" alt="drawing" width="96"/>

_(Last Modified: 11/1/2024)_

This Notebook is a short tutorial on using Cellarium CAS. Please read the instructions and run each cell in the presented order. Once you have finished going through the tutorial, please feel free to go back and modify it as needed for annotating your own datasets.

> **Note:**
> If you are running this Notebook inside Google Colab, please note that you will not be able to save your changes. If you wish to save your changes, please make a personal copy of this Notebook by navigating to `File` -> `Save a copy in Drive`.

### Installing Cellarium CAS client library

As a first step, we need to install Cellarium CAS client library, ``cellarium-cas``, along with all dependencies needed for visualizations. To this end, run the next cell.

> **Note:**
> If you have already installed ``cellarium-cas`` without the visualization dependencies, you should still run the next cell.

> **Note:**
> If you are running this Notebook inside Google Colab, you may be prompted to restart your session after running the next cell. Please continue with `Restart session`. You do not need to run the next cell again and you can now proceed to to the next section ("Load the AnnData file").

In [None]:
!pip install --no-cache-dir cellarium-cas[vis]

### Load the AnnData file

In this tutorial, we will annotate a peripheral blood mononuclear cell (PBMC) scRNA-seq dataset from 10x Genomics.

>**Note:** The original dataset, _"10k PBMCs from a Healthy Donor (v3 chemistry)"_, can be found [here](https://www.10xgenomics.com/datasets/10-k-pbm-cs-from-a-healthy-donor-v-3-chemistry-3-standard-3-0-0).

For the purpose of this tutorial, we have selected 4,000 cells selected at random from the full dataset. We have additionally precomputed UMAP embeddings of these cells using a standard scanpy workflow and performed unsupervised Leiden clustering.

>**Note:** For a quick tutorial on scRNA-seq data quality control, preprocessing, embedding, and clustering using scanpy, we recommend this [tutorial](https://scanpy.readthedocs.io/en/stable/tutorials/basics/clustering-2017.html).

>**Note:** We emphasize that CAS requires raw integer mRNA counts. If you are adapting this tutorial to your own dataset and your data is already normalized and/or restricted to a small gene set (e.g. highly variable genes), it is not suitable for CAS. If you have the raw counts in an AnnData layer or stored in the ``.raw`` attribute, please make sure that the ``.X`` attribute of your AnnData file is populated with the raw counts.

In [None]:
import scanpy as sc
import warnings

# suppressing some of the informational warnings
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UserWarning)

# set default figure resolution and size
sc.set_figure_params(dpi=80)

In [None]:
# Download the sample AnnData file
!curl -O https://storage.googleapis.com/cellarium-file-system-public/cellarium-cas-tutorial/pbmc_10x_v3_4k.h5ad

In [None]:
# Read the sample AnnData object
adata = sc.read('pbmc_10x_v3_4k.h5ad')

Let us inspect the loaded AnnData file:

In [None]:
adata

The AnnData file contains 4000 (cells) x 33538 (genes), a ``cluster_label`` attribute (under ``.obs``), and PCA and UMAP embeddings (under ``.obsm``).

In [None]:
adata.obs

Let us inspect the UMAP embedding already available in the AnnData file:

In [None]:
sc.pl.umap(adata)

Also, let us inspect the unsupervised Leiden clustering of the PCA embeddings for a sanity check:

In [None]:
sc.pl.umap(adata, color='cluster_label')

>**Note:** The UMAP embeddings and unsupervised clustering of the data are both **optional** and are not needed by CAS. However, these attributes are **required** for visualizing and inspecting the CAS output using our visualization tools.

Finally, let us inspect the ``.var`` attribute of the loaded AnnData file:

In [None]:
adata.var

We notice that Gene Symbols (names) serve as the index of the ``.var`` DataFrame, and Ensembl Gene IDs are provided under ``gene_ids`` column. We take note of both for the next steps.

>**Note:** CAS requires both Gene Symbols and Ensembl Gene IDs. If you do not have either available in your AnnData file, please update your AnnData file before proceeding to the next steps. We recommend using [BioMart](http://www.ensembl.org/info/data/biomart/index.html) for converting Gene Symbols to Ensembl Gene IDs or vice versa.

### Submit the loaded AnnData file to Cellarium CAS for annotation

As a first step, please populate your CAS API token in the next cell:

In [None]:
api_token = "<your-cellarium-cas-api-key>"

You can now connect to the Cellarium CAS backend and authenticate the session with our API token:

In [None]:
from cellarium.cas.client import CASClient

cas = CASClient(api_token=api_token)

The response will contain a list of annotation models and their brief descriptions. You need to choose the model that is suitable for your dataset. For this tutorial, we set `cas_model_name` to `None`, which implies choosing the default model. The default model is suitable for annotating human scRNA-seq datasets.

In [None]:
# Select the annotation model; 'None' for choosing the default model
cas_model_name = None

At this point, we are ready to submit our AnnData file to CAS for annotation.

>**Note:** Before you proceed, you may need to modify the next cell as necessary for your dataset. CAS must be pointed to the appropriate columns in the ``.var`` DataFrame for fetching Gene Symbols and Ensembl Gene IDs. This is done by setting ``feature_names_column_name`` and ``feature_ids_column_name`` arguments accordingly. If either appears as the index of the ``.var`` DataFrame, use `index` as argument. Otherwise, use the appropriate column name.  

In [None]:
# Submit AnnData to CAS for ontology-aware cell type query
cas_ontology_aware_response = cas.annotate_matrix_cell_type_ontology_aware_strategy(
    matrix=adata,
    chunk_size=500,
    feature_ids_column_name='gene_ids',
    feature_names_column_name='index',
    cas_model_name=cas_model_name)

Let us take a quick look at the anatomy of the CAS ontology-aware cell type query response. In brief, the response is a Python object of type CellTypeOntologyAwareResults with results that contain as many elements as the number of cells in the queried AnnData file:

In [None]:
type(cas_ontology_aware_response)

In [None]:
len(cas_ontology_aware_response.data)

The list entry at position _i_ is a dictionary that contains a number of cell type ontology terms and their relevance scores for the _i_'th cell. Let us explore the output for one particular cell:

In [None]:
cell_index = 2425

for matching_term in cas_ontology_aware_response.data[cell_index].matches:
  print(matching_term)

Let's sort the matching cell ontology terms by relevance scores:

In [None]:
import numpy as np

sort_order = np.argsort([matching_term.score for matching_term in cas_ontology_aware_response.data[cell_index].matches])
for idx in sort_order[::-1]:
    print(cas_ontology_aware_response.data[cell_index].matches[idx])

Based on the above response, we can confidently infer that cell number 2425 is a _natural killer cell_ and, with even greater confidence, a _hematopoietic cell_. Generally, there is an inherent trade-off between the specificity of a term and its relevance score. Higher-level terms (e.g., _mononuclear cell_ or _hematopoietic cell_) tend to have stronger association confidence, while lower-level terms (e.g., _group 1 innate lymphoid cell_) typically have weaker confidence levels.

### Exploring the Cellarium CAS response

To streamline further exploration of the CAS response, we will _insert_ the response into the AnnData object using the following helper method:

In [None]:
from cellarium.cas.postprocessing import insert_cas_ontology_aware_response_into_adata

# Insert the CAS ontology-aware cell type query response into the AnnData object
insert_cas_ontology_aware_response_into_adata(cas_ontology_aware_response, adata)

This method will add the following keys to the AnnData object:

- `cas_cl_scores` added to `adata.obsm`: a relevance score matrix of type `np.ndarray` and of shape (number of cells) x (number of cell ontology terms).
- `cas_metadata` added to `adata.uns`: a dictionary containing names and labels of each cell ontology term.

Let us briefly study the corresponding values:

In [None]:
adata

In [None]:
# We expect a NumPy array with shape (4000, 2914), corresponding to 4000 cells and 2914 cell type ontology terms
print(adata.obsm['cas_cl_scores'].shape)

In [None]:
# We expect a dictionary with two keys, `cl_names` and `cl_labels`, corresponding to names and human-readable labels of each of the 2914 cell type ontology terms
print(adata.uns['cas_metadata'].keys())

In [None]:
print(f"Number of cell ontology term names: {len(adata.uns['cas_metadata']['cl_names'])}")
print("The first 10 cell ontology term names:")
for cl_name in adata.uns['cas_metadata']['cl_names'][:10]:
    print('- ' + cl_name)


In [None]:
print(f"Number of cell ontology term labels: {len(adata.uns['cas_metadata']['cl_labels'])}")
print("The first 10 cell ontology term labels:")
for cl_label in adata.uns['cas_metadata']['cl_labels'][:10]:
    print('- ' + cl_label)

For a more streamlined and holistic visualization of the CAS response, we recommend using our provided ``CASCircularTreePlotUMAPDashApp`` Dash App.

>**Note:** The app requires the CAS response to be already inserted into the AnnData file. If you have not run the previous cells, please make sure you do so!

>**Tooltip:** The visualization displays various cell type ontology terms as colored circles in a circular dendrogram. The relationships underlying this dendrogram correspond to "_is_a_" relationships from [Cell Ontology](https://obofoundry.org/ontology/cl.html) (CL). Since these relationships are not mutually exclusive, a term can have multiple parent terms, meaning the same term can appear along different branches of the tree representation. The radius of each circle (whether it is a clade or a leaf node) signifies the occurrence of the term in the entire dataset, regardless of its relevance score. The color of the circle indicates the relevance score of the term in cells where it was found to have non-vanishing relevance.
>
> Here are some of the interactive capabilities of the visualization app:
> - **Cell selection:** By default, all cells are selected, and the cell type ontology dendrogram shows an aggregated summary over all cells. You can restrict the aggregation to a subset of cells by selecting your desired subset over the UMAP scatter plot clicking a single cell or using the rectangular select or lasso select tool. The dendrogram will react to your custom cell selection. If your input AnnData file includes clustering, you can restrict score aggregation to each cluster by selecting your cluster in the Settings panel (accessible via the gear icon in the upper right of the app).
> - **Highlighting ontology term relevance scores:** You can highlight cell type ontology term relevance scores over the UMAP scatter plot by clicking on the circles in the dendrogram. Only the selected cells will be scored, and the rest will be grayed out. You can revert to selecting all cells from the settings panel or by using the rectangular select tool to select all cells.
> - **Studying the ontology term relevance scores for a single cell:** You can display the term relevance scores for individual cells by clicking on a single cell in the UMAP scatter plot.
> - **Advanced settings:** By default, only terms above a specified relevance threshold with occurrence above another threshold over the selected cells are shown. You can modify these thresholds in the Settings panel (accessible via the gear icon in the upper right of the app).
>
>**Note**: The number of cells displayed should be limited to roughly 50K. Beyond that, performance of the Dash App may suffer.  If you need to visualize more cells, please attempt to downsample your cells.

In [None]:
from cellarium.cas._io import suppress_stderr
from cellarium.cas.visualization import CASCircularTreePlotUMAPDashApp

DASH_SERVER_PORT = 8050

with suppress_stderr():
    CASCircularTreePlotUMAPDashApp(
        adata=adata,  # the AnnData file
        root_node="CL_0000255",  # set to CL root node to "eukaryotic cell"
        cluster_label_obs_column="cluster_label",  # (optional) The .obs column name containing cluster labels
    ).run(port=DASH_SERVER_PORT, debug=False, jupyter_width="100%")

### Best cell type label assignment

Often times, we just want the best cell type labels and not a scored cell type ontology graph (!). Such a call _can_ be made, however, with the understanding that the notion of the best cell type call for any given cell is **not** a well-defined task in general. As mentioned earlier, crude ontology terms (e.g. T cell, B cell) often have higher relevance scores whereas more granular labels (e.g. CD8-positive, alpha-beta T cell, IgG-negative class switched memory B cell) often have lower relevance scores. If the _best call_ is construed as _the most confident call_, then such a call will be naturally too crude and uninformative. Therefore, the best cell type call must be understood as a decision made given the inherent trade-off between confidence and granularity.

CAS provides a simple mechanism to enable navigating this trade-off. Our current notion of the best cell type call is one that that is furthest away from the root node (i.e. _cell_) while at the same time having a relevance score above a user-provided threshold. This definition allows us to sort the cell type ontology terms and report the top-_k_ calls for each cell.

In the next cells, we obtain and visualize the top-3 cell type calls for each cell and each cluster.

In [None]:
import cellarium.cas.postprocessing.ontology_aware as pp
from cellarium.cas.postprocessing.cell_ontology import CellOntologyCache

with suppress_stderr():
    cl = CellOntologyCache()

>**Note:** The following steps assume that the CAS response has already been inserted into the AnnData file. If you skipped any previous steps, please ensure you run the next cell:

In [None]:
from cellarium.cas.postprocessing import insert_cas_ontology_aware_response_into_adata

# Insert the CAS ontology-aware cell type query response into the AnnData object for the visualization application
insert_cas_ontology_aware_response_into_adata(cas_ontology_aware_response, adata)

#### Assign cell type calls to individual cells

In [None]:
# Make the top-3 call for each cell and add the results to adata.obs
pp.compute_most_granular_top_k_calls_single(
    adata=adata,
    cl=cl,
    min_acceptable_score=0.2,  # minimum acceptable evidence score for a cell type call
    top_k=3,  # how many top calls to make?
    obs_prefix="cas_cell_type"  # .obs column to write the top-k calls to
)

>**Note:** If you are running this tutorial on your own dataset, you may need to tune the parameter `min_acceptable_score` to obtain the optimal annotations for your dataset.

>**Note:** The calling the method `compute_most_granular_top_k_calls_single` adds the top-_k_ cell type ontology names and labels to the `adata.obs` for each cell. Let us inspect the resulting `adata.obs` DataFrame:



In [None]:
adata.obs

In [None]:
sc.pl.umap(adata, color='cas_cell_type_label_1')
sc.pl.umap(adata, color='cas_cell_type_label_2')
sc.pl.umap(adata, color='cas_cell_type_label_2')

#### Assign cell type calls to predefined cell clusters

In [None]:
# Make the top-3 call for each cluster and add the results to adata.obs
pp.compute_most_granular_top_k_calls_cluster(
    adata=adata,
    cl=cl,
    min_acceptable_score=0.2,  # minimum acceptable evidence score for a cell type call
    cluster_label_obs_column='cluster_label',  # .obs column containing cluster labels
    top_k=3,  # how many top calls to make?
    obs_prefix='cas_cell_type_cluster'  # .obs column to write the top-k calls to
)

>**Note:** If you are running this tutorial on your own dataset, you may need to tune the parameter `min_acceptable_score` to obtain the optimal annotations for your dataset.

>**Note:** Calling the method `compute_most_granular_top_k_calls_cluster` adds the top-_k_ cell type ontology names and labels to `adata.obs` for each cell. These labels are derived by aggregating CAS relevance scores across user-defined cell clusters, assigning the same labels to all cells within the same cluster. Let us inspect the resulting `adata.obs` DataFrame:

In [None]:
adata.obs

In [None]:
sc.pl.umap(adata, color='cas_cell_type_cluster_label_1')
sc.pl.umap(adata, color='cas_cell_type_cluster_label_2')
sc.pl.umap(adata, color='cas_cell_type_cluster_label_3')