## Using scMulan to annotate cell types in Heart, Lung, Liver, Bone marrow, Blood, Brain, and Thymus

#### we provide a liver dataset sampled (percentage of 20%) from Suo C, 2022 (doi/10.1126/science.abo0510)
you can download the sampled dataset for this tutorial from: https://cloud.tsinghua.edu.cn/f/45a7fd2a27e543539f59/?dl=1  
ckpt could be downloaded from: https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1

In [1]:
import torch
torch.cuda.is_available()

True

In [2]:
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0" ## set your available devices, each use ~2G GPU-MEMORY
#os.environ["CUDA_VISIBLE_DEVICES"] = "-1" # if use CPU only
import scanpy as sc

import scMulan
from scMulan import GeneSymbolUniform

ModuleNotFoundError: No module named 'transformers'

## 1. load h5ad
It's recommended that you use h5ad here with raw count (and after your QC)

In [None]:
## adata = sc.read('Data/liver.h5ad', backup_url='https://cloud.tsinghua.edu.cn/f/45a7fd2a27e543539f59/?dl=1')

In [2]:
adata = sc.read('liver_test.h5ad')
adata

AnnData object with n_obs × n_vars = 27436 × 43878
    obs: 'cid', 'seq_tech', 'donor_ID', 'donor_gender', 'donor_age', 'donor_status', 'original_name', 'organ', 'region', 'subregion', 'sample_status', 'treatment', 'ethnicity', 'cell_type', 'cell_id', 'study_id'
    var: 'vst.mean', 'vst.variance', 'vst.variance.expected', 'vst.variance.standardized', 'vst.variable'
    obsm: 'umap'

## 2. transform original h5ad with uniformed genes (42117 genes)

This step transform the genes in input adata to 42117 gene symbols and reserves the corresponding gene expression values.

In [None]:
adata_GS_uniformed = GeneSymbolUniform(input_adata=adata,
                                 output_dir="Data/",
                                 output_prefix='liver')

## 3. process uniformed data (simply norm and log1p)

In [3]:
## you can read the saved uniformed adata

adata_GS_uniformed=sc.read_h5ad('Data/liver_uniformed.h5ad')

In [4]:
adata_GS_uniformed

AnnData object with n_obs × n_vars = 27436 × 42117
    obs: 'cid', 'seq_tech', 'donor_ID', 'donor_gender', 'donor_age', 'donor_status', 'original_name', 'organ', 'region', 'subregion', 'sample_status', 'treatment', 'ethnicity', 'cell_type', 'cell_id', 'study_id'

In [5]:
# norm and log1p count matrix
if adata_GS_uniformed.X.max() > 10:
    sc.pp.normalize_total(adata_GS_uniformed, target_sum=1e4) 
    sc.pp.log1p(adata_GS_uniformed)

## 4. load scMulan

In [6]:
# you should first download ckpt from https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1
# put it under .ckpt/ckpt_scMulan.pt
# by: wget https://cloud.tsinghua.edu.cn/f/2250c5df51034b2e9a85/?dl=1  -O ckpt/ckpt_scMulan.pt

ckp_path = 'ckpt/ckpt_scMulan.pt'

In [7]:
scml = scMulan.model_inference(ckp_path, adata_GS_uniformed)
base_process = scml.cuda_count()

[32m2024-10-02 17:40:25.153[0m | [1mINFO    [0m | [36mscMulan.model.model[0m:[36m__init__[0m:[36m119[0m - [1mnumber of parameters: 368.80M[0m


✅ adata passed check
👸 scMulan is ready
scMulan is currently available to 0 GPUs.


In [8]:
import torch

torch.cuda.is_available()

False

In [None]:
scml.get_cell_types_and_embds_for_adata(parallel=True, n_process = base_process)
# scml.get_cell_types_and_embds_for_adata(parallel=False) # for only using CPU, but it is really slow.

The predicted cell types are stored in scml.adata.obs['cell_type_from_scMulan'], besides the cell embeddings (for multibatch integration) in scml.adata.obsm['X_scMulan'] (not used in this tutorial).

## 5. visualization

In [13]:
adata_mulan = scml.adata.copy()

In [None]:
sc.pp.pca(adata_mulan)
sc.pl.pca_variance_ratio(adata_mulan)
sc.pp.neighbors(adata_mulan,n_pcs=10)
sc.tl.umap(adata_mulan)

In [None]:
# you can run smoothing function to filter the false positives
scMulan.cell_type_smoothing(adata_mulan, threshold=0.1)

In [None]:
# cell_type_from_scMulan: pred
# cell_type_from_mulan_smoothing: pred+smoothing
# original_name: original annotations by the authors
# cell_type: cell types in hECA-10M that maps original_name to uHAF

sc.pl.umap(adata_mulan,color=["cell_type_from_scMulan","cell_type_from_mulan_smoothing",'cell_type','original_name'],ncols=1)

In [17]:
top_celltypes = adata_mulan.obs.cell_type_from_scMulan.value_counts().index[:20]

In [None]:
# you can select some cell types of interest (from scMulan's prediction) for visulization
# selected_cell_types = ["NK cell", "Kupffer cell", "Conventional dendritic cell 2"] # as example
selected_cell_types = top_celltypes
scMulan.visualize_selected_cell_types(adata_mulan,selected_cell_types,smoothing=True)