This notebook uses an automated pipeline to label spatial transcriptomic datasets. Here's how it works:
1. Use UCE (Universal Cell Embedding) to embedd both single cell and spatial transcriptomic data
2. Train a classifier (on UCE of single cell data) to learn cell type labels
3. Use classifer to predict cell type of each spatial bin

This tutorial will use mouse single cell and spatial data (merscope).

To start, we need the UCE embeddings of the single cell data. Cellxgene census has the UCE embedding of the single cell data already.

In [None]:
import anndict as adt

import cellxgene_census

census = cellxgene_census.open_soma(census_version="2023-12-15")
adata = cellxgene_census.get_anndata(
    census,
    organism = "mus_musculus",
    measurement_name = "RNA",
    obs_value_filter = "(tissue_general == 'heart') |  (tissue_general == 'liver')",
    obs_embeddings = ["uce"]
)

Next, we break the single cell adata into a per-tissue adata_dict and will train a separate classifier on each tissue.

In [None]:
#build dict
adata_dict = adt.build_adata_dict(adata, ['tissue'], ['heart', 'liver'])

#Downsample dict and remove celltypes with a small number of cells
#This helps speed up classifier training
adata_dict = adt.resample_adata_dict(adata_dict, strata_keys=['cell_type'], min_num_cells=50, n_obs=1000)

Now, train a classifier (logistic regression in this case, but could use any classifier, see docs for more info)

In [None]:
#If SLURM_CPUS_PER_TASK and/or SLURM_NTASKS environment variables are set, the function will automatically determine number of cores and multithread using adict.get_slurm_cores()
from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

#When setting max_iterations=1, this function will simply train a classifier independently on each adata in adata_dict
stable_label_results = adt.stable_label_adata_dict(adata_dict,
                        feature_key='uce',
                        label_key='cell_type',
                        classifier_class=LogisticRegression,
                        max_iterations=1,
                        stability_threshold=0.01,
                        moving_average_length=5,
                        random_state=42,
                        max_iter = 1000, #args for classifer construction (here and below)
                        n_jobs=adt.get_slurm_cores())

Great, now the cell type classifiers are trained on each tissue and ready to be applied to spatial transcriptomic data.

Next, We need to calculate UCE Embedding of Merscope data. `anndict` has functions for that.

The first step here is to create anndata from raw spatial data (i.e. transcript coordinates and identity stored in a file called tissue_positions.csv or tissue_positions.parquet)

In [None]:
#This dictionary should be {input_path: output_path}, where input_path is a parquet or csv file path, and output_path is where the anndata will be written
#Note, input paths can be .csv or .parquet!
paths_dict = {
    '~/dat/detected_transcripts_liver.csv': '~/dat/liver_st_merscope.h5ad',
    '~/dat/detected_transcripts_heart.csv': '~/dat/heart_st_merscope.h5ad'
    }

#This function should be used to generate adata from merscope or xenium output. For Visium use adt.build_adata_from_visium(paths_dict, hd=False) (see docs, set hd=True for Visium HD)
adt.build_adata_from_transcript_positions(paths_dict, box_size=16, step_size=16, platform="Merscope")

#Commented-out example for Visium HD
# paths_dict = {
#     '~/visium_hd_runs/liver/16_micron_binsize': '~/dat/liver_visium_hd.h5ad',
#     '~/visium_hd_runs/heart/16_micron_binsize': '~/dat/heart_visium_hd.h5ad'
#     }

#Generate adata from visium
# adt.build_adata_from_visium(paths_dict, hd=True)

Next, we need to UCE the spatial data. Note, the function below (while it will work), is included for demonstration purposes only. It's much slower than doing it properly in the command line on a gpu

In [None]:
adt.UCE_adata(['~/dat/liver_st_merscope.h5ad',
               '~/dat/heart_st_merscope.h5ad'])

In [6]:
#Load the UCE embeddings of the spatial data as an adata_dict
#Note: it's import that the keys of st_dict match the keys of adata_dict
st_dict = {'heart' : sc.read_h5ad('~/UCE/uce_wd/heart_st_merscope_uce_adata.h5ad'),
            'liver' : sc.read_h5ad('~/UCE/uce_wd/liver_st_merscope_uce_adata.h5ad')
            }

In [None]:
#apply trained classifiers to the st adata to annotate it.
predicted_labels_dict = adt.predict_labels_adata_dict(
    st_dict,
    stable_label_results,
    feature_key='uce'
)

#actually assign labels back to the st adata
adt.update_adata_labels_with_predictions_dict(st_dict, predicted_labels_dict, new_label_key='predicted_cell_type')

In [8]:
#Plot the results
adt.plot_spatial_adata_dict(st_dict, ['predicted_cell_type'])