# Getting started with Starling (ST)


In [1]:
%pip install https://github.com/camlab-bioml/starling/archive/main.zip
%pip install lightning_lite

import anndata as ad
import pandas as pd
import pytorch_lightning as pl
import torch
from lightning_lite import seed_everything
from pytorch_lightning.callbacks import EarlyStopping  # ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger

from starling import starling, utility


  from .autonotebook import tqdm as notebook_tqdm


## Setting seed for everything


In [2]:
# pl.utilities.seed.seed_everything(10, workers=True)
seed_everything(10, workers=True)


Global seed set to 10


10

## Loading annData objects


The example below runs Kmeans with 10 clusters read from "sample_input.h5ad" object.


In [3]:
!wget https://github.com/camlab-bioml/starling/raw/main/docs/source/tutorial/sample_input.h5ad

adata = utility.init_clustering("KM", ad.read_h5ad("sample_input.h5ad"), k=10)


NotImplementedError: Equality comparisons are not supported for AnnData objects, instead compare the desired attributes.

- Users might want to arcsinh protein expressions in \*.h5ad (for example, 'sample_input.h5ad').
- The utility.py provides an easy setup of GMM, KM (Kmeans) or PG (PhenoGraph).
- Default settings are applied to each method.
- k can be omitted when PG is used.


## Setting initializations


The example below uses defualt parameter settings based on benchmarking results (more details in manuscript).


In [4]:
st = starling.ST(adata)


  torch.tensor(self.adata.obs[self.cell_size_col_name])


A list of parameters are shown:

- adata: annDATA object of the sample
- dist_option (default: 'T'): T for Student-T (df=2) and N for Normal (Gaussian)
- the proportion of anticipated segmentation error free cells (default: 0.6)
- model_cell_size (default: 'Y'): Y for incoporating cell size in the model and N otherwise
- cell_size_col_name (default: 'area'): area is the column name in anndata.obs dataframe
- model_zplane_overlap (default: 'Y'): Y for modeling z-plane overlap when cell size is modelled and N otherwise
  Note: if the user sets model_cell_size = 'N', then model_zplane_overlap is ignored
- model_regularizer (default: 1): Regularizier term impose on synthetic doublet loss (BCE)
- learning_rate (default: 1e-3): The learning rate of ADAM optimizer for STARLING

Equivalent as the above example:
st = starling.ST(adata, 'T', 'Y', 'area', 'Y', 1, 1e-3)


## Setting trainning log


Once training starts, a new directory 'log' will created.


In [5]:
## log training results via tensorboard
log_tb = TensorBoardLogger(save_dir="log")


One could view the training information via tensorboard. Please refer to torch lightning (https://lightning.ai/docs/pytorch/stable/api_references.html#profiler) for other possible loggers.


## Setting early stopping criterion


In [6]:
## set early stopping criterion
cb_early_stopping = EarlyStopping(monitor="train_loss", mode="min", verbose=False)


Training loss is monitored.


## Training Starling


In [15]:
## train ST
trainer = pl.Trainer(
    max_epochs=100,
    accelerator="auto",
    devices="auto",
    deterministic=True,
    callbacks=[cb_early_stopping],
    logger=[log_tb],
)
trainer.fit(st)


GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
/poetry-env/lib/python3.10/site-packages/pytorch_lightning/callbacks/model_checkpoint.py:630: Checkpoint directory log/lightning_logs/version_2/checkpoints exists and is not empty.

  | Name | Type | Params
------------------------------
------------------------------
0         Trainable params
0         Non-trainable params
0         Total params
0.000     Total estimated model params size (MB)


/poetry-env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py:293: The number of training batches (27) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


Epoch 2: 100%|██████████| 27/27 [00:01<00:00, 25.18it/s, v_num=2, train_loss_step=86.10, train_loss_epoch=86.80]


## Appending STARLING results to annData object


In [8]:
## retrive starling results
st.result()


## The following information can be retrived from annData object:

- st.adata.varm['init_exp_centroids'] -- initial expression cluster centroids (P x C matrix)
- st.adata.varm['st_exp_centroids'] -- ST expression cluster centroids (P x C matrix)
- st.adata.uns['init_cell_size_centroids'] -- initial cell size centroids if STARLING models cell size
- st.adata.uns['st_cell_size_centroids'] -- initial & ST cell size centroids if ST models cell size
- st.adata.obsm['assignment_prob_matrix'] -- cell assignment probability (N x C maxtrix)
- st.adata.obsm['gamma_prob_matrix'] -- gamma probabilitiy of two cells (N x C x C maxtrix)
- st.adata.obs['doublet'] -- doublet indicator
- st.adata.obs['doublet_prob'] -- doublet probabilities
- st.adata.obs['init_label'] -- initial assignments
- st.adata.obs['st_label'] -- ST assignments
- st.adata.obs['max_assign_prob'] -- ST max probabilites of assignments
  - N: # of cells; C: # of clusters; P: # of proteins


## Saving the model


In [9]:
## st object can be saved
torch.save(st, "model.pt")


model.pt will be saved in the same directory as this notebook.


## Showing STARLING results


In [10]:
st.adata


AnnData object with n_obs × n_vars = 13685 × 24
    obs: 'sample', 'id', 'x', 'y', 'area', 'area_convex', 'neighbor', 'init_label', 'st_label', 'doublet_prob', 'doublet', 'max_assign_prob'
    uns: 'init_cell_size_centroids', 'init_cell_size_variances', 'st_cell_size_centroids'
    obsm: 'assignment_prob_matrix', 'gamma_assignment_prob_matrix'
    varm: 'init_exp_centroids', 'init_exp_variances', 'st_exp_centroids'

One could easily perform further analysis such as co-occurance, enrichment analysis and etc.


In [11]:
st.adata.obs


Unnamed: 0,sample,id,x,y,area,area_convex,neighbor,init_label,st_label,doublet_prob,doublet,max_assign_prob
4_1,4,1,0.785714,7.785714,14,14,0,4,7,0.068696,0,0.931299
4_2,4,2,0.823529,22.294117,17,17,0,8,1,0.265032,0,0.455922
4_3,4,3,0.875000,79.500000,16,16,1,5,2,0.018245,0,0.981755
4_4,4,4,0.666667,270.500000,12,12,0,0,7,0.074352,0,0.925553
4_5,4,5,0.823529,279.294130,17,17,1,6,7,0.099415,0,0.852824
...,...,...,...,...,...,...,...,...,...,...,...,...
4_13681,4,13681,997.769200,754.500000,26,26,0,6,6,0.059487,0,0.940513
4_13682,4,13682,998.153900,127.615390,13,13,0,0,6,0.077304,0,0.922694
4_13683,4,13683,998.153900,160.000000,13,13,1,1,7,0.079220,0,0.919557
4_13684,4,13684,997.580600,242.580640,31,33,1,6,6,0.064449,0,0.935550


Starling provides doublet probabilities and cell assignment if it were a singlet for each cell.


## Showing initial expression centriods:


In [12]:
## initial expression centriods (p x c) matrix
pd.DataFrame(st.adata.varm["init_exp_centroids"], index=st.adata.var_names)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
SMA,0.481423,1.309391,2.126096,0.579704,1.034955,2.213308,0.709806,1.934577,1.527963,0.709294
ECadherin,5.337153,0.895151,0.917277,0.957158,1.015014,0.994053,2.059105,0.89487,0.895395,0.903859
Cytokeratin,67.96431,8.256815,7.940267,12.393278,9.756522,7.235658,21.381472,8.161931,8.062714,10.28524
HLADR,9.514902,25.786489,19.902126,100.617966,22.652782,16.925838,31.314178,24.423939,27.201952,107.71537
Vimentin,26.117706,288.943787,613.416626,59.960083,207.838837,854.635254,118.783218,474.137756,373.660522,195.967804
CD28,0.274576,0.364751,0.187042,0.429406,0.396811,0.12542,0.37401,0.261919,0.32306,0.400113
CD15,11.647052,3.405212,12.111903,1.71882,5.436856,8.430137,23.184681,5.207158,2.906089,1.07273
CD45RA,2.80931,9.079441,5.92695,19.871086,9.039453,4.521348,10.214595,7.081024,8.2053,25.244133
CD66b,0.71164,0.398465,0.927381,0.232724,0.447859,0.943825,1.139344,0.591593,0.432269,0.286939
CD20,4.245859,9.033882,6.129274,68.748039,8.606109,4.270152,16.159504,6.945665,8.628428,50.045906


There are 10 centroids since we set Kmeans (KM) as k = 10 earlier.


## Showing Starling expression centriods:


In [13]:
## starling expression centriods (p x c) matrix
pd.DataFrame(st.adata.varm["st_exp_centroids"], index=st.adata.var_names)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
SMA,0.326149,0.712488,2.28981,0.648445,1.437518,2.698419,0.50052,0.667774,2.347505,0.831485
ECadherin,5.52227,0.893311,0.967771,0.702459,1.416449,1.128502,4.662328,0.654967,0.788167,0.957973
Cytokeratin,109.540916,6.996323,7.713223,8.725094,9.962206,8.033975,49.659325,6.500076,7.363036,8.508376
HLADR,4.203127,16.77821,11.952487,64.990593,25.20392,14.187275,12.846581,13.579684,20.612143,46.192272
Vimentin,19.65023,334.947998,567.298767,178.533249,312.820282,675.612854,52.012482,298.594849,362.261017,383.508514
CD28,0.215896,0.340179,0.028124,0.4447,0.363145,0.095138,0.257389,0.109894,0.24894,0.991879
CD15,8.032538,0.880836,19.500408,0.349458,9.159814,12.177804,17.571045,0.976252,2.238995,0.560232
CD45RA,0.999976,6.828957,1.569541,16.214994,10.408425,3.298193,2.739264,4.872447,6.245378,9.806373
CD66b,0.230382,0.324594,0.618504,0.226765,0.615165,0.98074,0.417512,0.260921,0.341933,0.367371
CD20,1.486347,3.349393,1.546079,36.554585,8.142458,3.129491,3.294809,3.762761,6.254664,17.027361


From here one could easily annotate cluster centriods to cell type.


## Showing Assignment Distributions:


In [14]:
## assignment distributions (n x c maxtrix)
pd.DataFrame(st.adata.obsm["assignment_prob_matrix"], index=st.adata.obs.index)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
4_1,1.113865e-10,5.264620e-06,2.516918e-08,9.541354e-09,2.082846e-17,1.410759e-18,9.617835e-09,9.312990e-01,1.275393e-15,4.946894e-20
4_2,1.285833e-12,4.559219e-01,3.993468e-05,8.624844e-03,7.427827e-13,1.459287e-15,7.324902e-07,2.703802e-01,1.466999e-10,8.923895e-13
4_3,8.933782e-19,1.680738e-13,9.817547e-01,6.278157e-18,3.019483e-19,1.835814e-12,1.671272e-13,4.093614e-10,3.552699e-17,5.236710e-22
4_4,6.753823e-09,3.635960e-06,5.880216e-07,8.840111e-07,6.167183e-18,6.442944e-16,8.980181e-05,9.255529e-01,8.516341e-15,8.427887e-18
4_5,4.819373e-09,4.774594e-02,2.864405e-08,1.546968e-05,2.115557e-11,6.512540e-14,2.986656e-08,8.528239e-01,4.656289e-09,9.990017e-14
...,...,...,...,...,...,...,...,...,...,...
4_13681,1.318026e-07,6.869133e-12,1.236429e-08,2.839130e-15,3.862945e-13,8.793653e-16,9.405133e-01,1.908021e-10,3.212980e-16,4.997553e-20
4_13682,2.015888e-06,9.250998e-12,8.058909e-15,1.616779e-13,1.055600e-18,4.173725e-22,9.226938e-01,1.019577e-11,3.841519e-19,1.762351e-20
4_13683,1.464692e-11,2.885174e-05,1.192254e-03,1.018574e-06,1.798452e-19,5.618906e-16,1.272385e-06,9.195569e-01,2.865188e-15,2.730282e-19
4_13684,8.244882e-07,4.049017e-12,1.211718e-10,1.891516e-12,3.930970e-16,9.496104e-18,9.355499e-01,7.185550e-10,8.250338e-18,3.494605e-21


Currently, we assign a cell label based on the maximum probability among all possible clusters. However, there could be mislabeled because maximum and second highest probabilies can be very close that the user might be interested.
