# Getting started with Starling (ST)


In [5]:
import anndata as ad
import pandas as pd
import pytorch_lightning as pl
import torch
from lightning_lite import seed_everything
from pytorch_lightning.callbacks import EarlyStopping  # ModelCheckpoint
from pytorch_lightning.loggers import TensorBoardLogger

from starling import starling, utility


## Setting seed for everything


In [6]:
# pl.utilities.seed.seed_everything(10, workers=True)
seed_everything(10, workers=True)


Global seed set to 10


10

## Loading annData objects


The example below runs Kmeans with 10 clusters read from "sample_input.h5ad" object.


In [7]:
adata = utility.init_clustering(ad.read_h5ad("sample_input.h5ad"), "KM", k=10)


- Users might want to arcsinh protein expressions in \*.h5ad (for example, 'sample_input.h5ad').
- The utility.py provides an easy setup of GMM, KM (Kmeans) or PG (PhenoGraph).
- Default settings are applied to each method.
- k can be omitted when PG is used.


## Setting initializations


The example below uses defualt parameter settings based on benchmarking results (more details in manuscript).


In [8]:
st = starling.ST(adata)


  torch.tensor(self.adata.obs[self.cell_size_col_name])


A list of parameters are shown:

- adata: annDATA object of the sample
- dist_option (default: 'T'): T for Student-T (df=2) and N for Normal (Gaussian)
- the proportion of anticipated segmentation error free cells (default: 0.6)
- model_cell_size (default: 'Y'): Y for incoporating cell size in the model and N otherwise
- cell_size_col_name (default: 'area'): area is the column name in anndata.obs dataframe
- model_zplane_overlap (default: 'Y'): Y for modeling z-plane overlap when cell size is modelled and N otherwise
  Note: if the user sets model_cell_size = 'N', then model_zplane_overlap is ignored
- model_regularizer (default: 1): Regularizier term impose on synthetic doublet loss (BCE)
- learning_rate (default: 1e-3): The learning rate of ADAM optimizer for STARLING

Equivalent as the above example:
st = starling.ST(adata, 'T', 'Y', 'area', 'Y', 1, 1e-3)


## Setting trainning log


Once training starts, a new directory 'log' will created.


In [9]:
## log training results via tensorboard
log_tb = TensorBoardLogger(save_dir="log")


One could view the training information via tensorboard. Please refer to torch lightning (https://lightning.ai/docs/pytorch/stable/api_references.html#profiler) for other possible loggers.


## Setting early stopping criterion


In [10]:
## set early stopping criterion
cb_early_stopping = EarlyStopping(monitor="train_loss", mode="min", verbose=False)


Training loss is monitored.


## Training Starling


In [11]:
## train ST
trainer = pl.Trainer(
    max_epochs=100,
    accelerator="auto",
    devices="auto",
    deterministic=True,
    callbacks=[cb_early_stopping],
    logger=[log_tb],
)
trainer.fit(st)


GPU available: False, used: False
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
Missing logger folder: log/lightning_logs



  | Name | Type | Params
------------------------------
------------------------------
0         Trainable params
0         Non-trainable params
0         Total params
0.000     Total estimated model params size (MB)
/poetry-env/lib/python3.10/site-packages/pytorch_lightning/loops/fit_loop.py:293: The number of training batches (27) is smaller than the logging interval Trainer(log_every_n_steps=50). Set a lower value for log_every_n_steps if you want to see logs for the training epoch.


Epoch 0:   0%|          | 0/27 [00:00<?, ?it/s] 

  prob_data_given_gamma_d1.T + log_delta[1] - prob_data


Epoch 99: 100%|██████████| 27/27 [00:01<00:00, 24.34it/s, v_num=0, train_loss_step=58.10, train_loss_epoch=58.40]

`Trainer.fit` stopped: `max_epochs=100` reached.


Epoch 99: 100%|██████████| 27/27 [00:01<00:00, 24.24it/s, v_num=0, train_loss_step=58.10, train_loss_epoch=58.40]


## Appending STARLING results to annData object


In [12]:
## retrive starling results
st.result()


## The following information can be retrived from annData object:

- st.adata.varm['init_exp_centroids'] -- initial expression cluster centroids (P x C matrix)
- st.adata.varm['st_exp_centroids'] -- ST expression cluster centroids (P x C matrix)
- st.adata.uns['init_cell_size_centroids'] -- initial cell size centroids if STARLING models cell size
- st.adata.uns['st_cell_size_centroids'] -- initial & ST cell size centroids if ST models cell size
- st.adata.obsm['assignment_prob_matrix'] -- cell assignment probability (N x C maxtrix)
- st.adata.obsm['gamma_prob_matrix'] -- gamma probabilitiy of two cells (N x C x C maxtrix)
- st.adata.obs['doublet'] -- doublet indicator
- st.adata.obs['doublet_prob'] -- doublet probabilities
- st.adata.obs['init_label'] -- initial assignments
- st.adata.obs['st_label'] -- ST assignments
- st.adata.obs['max_assign_prob'] -- ST max probabilites of assignments
  - N: # of cells; C: # of clusters; P: # of proteins


## Saving the model


In [13]:
## st object can be saved
torch.save(st, "model.pt")


model.pt will be saved in the same directory as this notebook.


## Showing STARLING results


In [14]:
st.adata


AnnData object with n_obs × n_vars = 13685 × 24
    obs: 'sample', 'id', 'x', 'y', 'area', 'area_convex', 'neighbor', 'init_label', 'st_label', 'doublet_prob', 'doublet', 'max_assign_prob'
    uns: 'init_cell_size_centroids', 'init_cell_size_variances', 'st_cell_size_centroids'
    obsm: 'assignment_prob_matrix', 'gamma_assignment_prob_matrix'
    varm: 'init_exp_centroids', 'init_exp_variances', 'st_exp_centroids'

One could easily perform further analysis such as co-occurance, enrichment analysis and etc.


In [15]:
st.adata.obs


Unnamed: 0,sample,id,x,y,area,area_convex,neighbor,init_label,st_label,doublet_prob,doublet,max_assign_prob
4_1,4,1,0.785714,7.785714,14,14,0,4,7,0.124439,0,0.875560
4_2,4,2,0.823529,22.294117,17,17,0,8,7,0.213168,0,0.783933
4_3,4,3,0.875000,79.500000,16,16,1,5,2,0.039462,0,0.960538
4_4,4,4,0.666667,270.500000,12,12,0,0,7,0.123766,0,0.876190
4_5,4,5,0.823529,279.294130,17,17,1,6,7,0.445688,0,0.506368
...,...,...,...,...,...,...,...,...,...,...,...,...
4_13681,4,13681,997.769200,754.500000,26,26,0,6,6,0.134972,0,0.865028
4_13682,4,13682,998.153900,127.615390,13,13,0,0,6,0.133451,0,0.866547
4_13683,4,13683,998.153900,160.000000,13,13,1,1,7,0.125980,0,0.873849
4_13684,4,13684,997.580600,242.580640,31,33,1,6,6,0.146351,0,0.853648


Starling provides doublet probabilities and cell assignment if it were a singlet for each cell.


## Showing initial expression centriods:


In [16]:
## initial expression centriods (p x c) matrix
pd.DataFrame(st.adata.varm["init_exp_centroids"], index=st.adata.var_names)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
SMA,0.481423,1.309391,2.126096,0.579704,1.034955,2.213308,0.709806,1.934577,1.527963,0.709294
ECadherin,5.337152,0.895151,0.917277,0.957158,1.015015,0.994053,2.059105,0.89487,0.895395,0.90386
Cytokeratin,67.96431,8.256815,7.940268,12.393278,9.756522,7.235658,21.381472,8.161932,8.062714,10.285238
HLADR,9.514898,25.786489,19.902126,100.617966,22.652782,16.925837,31.314178,24.423939,27.201952,107.71537
Vimentin,26.117706,288.943756,613.416626,59.960068,207.838837,854.635315,118.783218,474.137756,373.660522,195.967804
CD28,0.274576,0.364751,0.187042,0.429406,0.396811,0.12542,0.37401,0.261919,0.32306,0.400113
CD15,11.647053,3.405212,12.111904,1.71882,5.436856,8.430137,23.184679,5.207158,2.906089,1.07273
CD45RA,2.809309,9.079441,5.92695,19.871088,9.039453,4.521349,10.214595,7.081024,8.2053,25.244133
CD66b,0.71164,0.398465,0.927381,0.232724,0.447859,0.943825,1.139344,0.591593,0.432269,0.286939
CD20,4.245859,9.033882,6.129276,68.748039,8.606109,4.270152,16.159504,6.945663,8.628428,50.045906


There are 10 centroids since we set Kmeans (KM) as k = 10 earlier.


## Showing Starling expression centriods:


In [17]:
## starling expression centriods (p x c) matrix
pd.DataFrame(st.adata.varm["st_exp_centroids"], index=st.adata.var_names)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
SMA,0.325653,0.708183,2.34341,0.626486,1.393898,2.696629,0.50161,0.678427,2.363899,0.767806
ECadherin,5.488147,0.785653,0.965674,0.692457,1.440384,1.142736,4.687112,0.721282,0.785316,0.884134
Cytokeratin,110.252174,7.032565,7.622612,9.494033,9.43802,8.034536,49.734077,6.832853,7.360841,9.409131
HLADR,4.15969,19.417757,11.987859,84.36129,30.836338,14.175329,12.590346,13.645985,20.510256,45.548115
Vimentin,19.75206,302.11084,588.966492,125.840752,315.902466,673.767517,53.500988,316.413086,364.053619,300.06955
CD28,0.21498,0.471468,0.028023,0.364153,0.363831,0.094744,0.259105,0.0994,0.247456,1.049877
CD15,7.884834,0.643064,21.034025,0.360096,9.176603,12.357427,17.212612,0.924107,2.218135,0.544504
CD45RA,0.99914,7.084351,1.495777,22.826448,13.214501,3.289731,2.703801,5.045905,6.228513,9.315036
CD66b,0.22807,0.300999,0.661562,0.202338,0.649516,0.987546,0.410725,0.276702,0.341298,0.330011
CD20,1.475706,3.539735,1.488153,55.294258,9.70499,3.117444,3.187722,3.615022,6.227522,16.325693


From here one could easily annotate cluster centriods to cell type.


## Showing Assignment Distributions:


In [18]:
## assignment distributions (n x c maxtrix)
pd.DataFrame(st.adata.obsm["assignment_prob_matrix"], index=st.adata.obs.index)


Unnamed: 0,0,1,2,3,4,5,6,7,8,9
4_1,4.052796e-11,1.060743e-07,3.221775e-09,3.175447e-09,2.794050e-18,7.233951e-19,3.422844e-09,8.755605e-01,5.064281e-16,2.695361e-20
4_2,2.571622e-13,2.702302e-03,3.943888e-06,1.924882e-04,1.490247e-13,3.698919e-16,1.516396e-07,7.839328e-01,3.329107e-11,1.434967e-13
4_3,8.414618e-19,1.676079e-15,9.605382e-01,1.789659e-18,2.945540e-19,2.934438e-12,1.746481e-13,8.463262e-09,3.611835e-17,6.279392e-23
4_4,3.952922e-09,3.104086e-07,1.217351e-07,9.558353e-08,1.102919e-18,4.459546e-16,4.362102e-05,8.761897e-01,5.462336e-15,1.231569e-17
4_5,9.505909e-09,4.794304e-02,2.450668e-08,4.090878e-07,1.361369e-11,1.396174e-13,5.839240e-08,5.063681e-01,8.817012e-09,2.074688e-13
...,...,...,...,...,...,...,...,...,...,...
4_13681,1.118555e-07,7.125587e-14,7.956857e-09,8.648547e-15,1.801619e-13,1.214445e-15,8.650281e-01,2.200646e-09,3.344286e-16,1.368419e-20
4_13682,1.676069e-06,5.446406e-13,3.697609e-15,3.618865e-13,3.906404e-19,4.878642e-22,8.665472e-01,9.159413e-11,3.623991e-19,3.265095e-20
4_13683,5.850244e-12,4.349267e-07,1.699824e-04,1.906558e-07,2.773070e-20,2.508562e-16,5.499972e-07,8.738493e-01,1.297691e-15,1.374228e-19
4_13684,8.008877e-07,2.470488e-12,7.188219e-11,4.937634e-13,1.390242e-16,1.486102e-17,8.536483e-01,1.712942e-09,9.804426e-18,1.179606e-20


Currently, we assign a cell label based on the maximum probability among all possible clusters. However, there could be mislabeled because maximum and second highest probabilies can be very close that the user might be interested.
