<a href="https://www.kaggle.com/code/dalloliogm/virtual-cell-challenge-state-via-helical?scriptVersionId=266860559" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Arc Virtual Cell Challenge 

This notebook is related to the Arc Institute Virtual Cell Challenge (https://virtualcellchallenge.org/)

The challenge itself is not hosted on Kaggle. However, I thought it would be useful to share this notebook here.

In particular, here we use the helical library (https://github.com/helicalAI/helical), which is a framework for several Bio Foundation models, to install STATE and other models and compute embeddings with it.

# Install helical and STATE

Here we install the STATE model using the integration in Helical - a library that supports several foundation models.

STATE support is still not officially implemented in helical, but it is available through a PR.

In [1]:
#
!pip install -q git+https://github.com/raschedh/helical.git@state_integration


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m63.5 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 kB[0m [31m2.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone


### Install other libraries

We need scanpy to read the single-cell H5 file, and other bioinformatics tools.

In [2]:
%%capture
# !pip install -q pytorch-lightning
# !pip install -q lightning
!pip install -q scanpy

# Reading Data

We uploaded the Virtual Cell Challenge data as a Dataset in Kaggle. This cannot be made public, so you would have to reupload it by yourself.

In [3]:
import scanpy as sc

In [4]:
adata = sc.read("/kaggle/input/arc-virtual-cell-challenge/vcc_data/adata_Training.h5ad")

In [5]:
adata

AnnData object with n_obs × n_vars = 221273 × 18080
    obs: 'target_gene', 'guide_id', 'batch'
    var: 'gene_id'

# Compute embeddings using STATE

We use the STATE model to transform the gene expression data from the adata object into embeddings, based on the STATE weights.

From the Helical STATE notebook: https://github.com/raschedh/helical/blob/state_integration/docs/notebooks/STATE-tutorial.ipynb

In [6]:
# We subset to 10 cells and 2000 genes. Not sure if Kaggle's free credit GPU can handle the whole data
n_cells = 10
n_genes = 2000
adata = adata[:n_cells, :n_genes].copy()

print(adata.shape)
n_cells = adata.n_obs
print(n_cells)

(10, 2000)
10


In [7]:
from helical.models.state import StateConfig    
from helical.models.state import StateEmbed

state_config = StateConfig(batch_size=16)
state_embed = StateEmbed(configurer=state_config)

INFO:datasets:PyTorch version 2.6.0+cu124 available.
INFO:datasets:Polars version 1.25.0 available.
INFO:datasets:Duckdb version 1.3.2 available.
INFO:datasets:TensorFlow version 2.18.0 available.
INFO:datasets:JAX version 0.5.2 available.
2025-10-09 16:11:57.009821: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1760026317.262742      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760026317.336675      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO:helical.utils.downloader:Downloading 'state/state_embed/protein_embeddings.pt'
INFO:helical.utils.downloader:Starting to download: 'https://helicalpackage.s3.eu-west-2.amazonaws.com/state/stat

In [8]:
processed_data = state_embed.process_data(adata=adata)
embeddings = state_embed.get_embeddings(processed_data)

# note that the STATE model returns a numpy array of shape (n_cells, 1024)
print(embeddings.shape)
print(type(embeddings))

INFO:helical.models.state.state_embeddings:Auto-detected gene column: var.index (overlap: 1967/19790 protein embeddings, 98.4% of genes)
INFO:/usr/local/lib/python3.11/dist-packages/helical/models/state/model_dir/embed_utils/loader.py:1967 genes mapped to embedding file (out of 2000)
INFO:/usr/local/lib/python3.11/dist-packages/helical/models/state/model_dir/embed_utils/loader.py:1967 genes mapped to embedding file (out of 2000)
Encoding: 100%|██████████| 1/1 [03:33<00:00, 213.00s/it]

(10, 2058)
<class 'numpy.ndarray'>





In [9]:
embeddings

array([[-0.00153183,  0.02503779,  0.01310981, ...,  0.18632716,
         0.159468  , -0.004082  ],
       [-0.00370645,  0.0213707 ,  0.01175022, ...,  0.19970626,
         0.17452598, -0.02008047],
       [-0.00711047,  0.03064047,  0.00859526, ...,  0.16809492,
         0.1414577 , -0.02194577],
       ...,
       [-0.01033795,  0.03326062,  0.01219721, ...,  0.16603705,
         0.16525808, -0.02181541],
       [-0.00432739,  0.03346467,  0.01912838, ...,  0.19267794,
         0.1264076 , -0.00334789],
       [-0.01584144,  0.02332704,  0.02281047, ...,  0.17064488,
         0.15859129, -0.0430087 ]], dtype=float32)

# State Perturbation

In [10]:
adata.obs

Unnamed: 0,target_gene,guide_id,batch
AAACAAGCAACCTTGTACTTTAGG-Flex_1_01,CHMP3,CHMP3_P1P2_A|CHMP3_P1P2_B,Flex_1_01
AAACAAGCATTGCCGCACTTTAGG-Flex_1_01,AKT2,AKT2_P1P2_A|AKT2_P1P2_B,Flex_1_01
AAACCAATCAATGTTCACTTTAGG-Flex_1_01,SHPRH,SHPRH_P1P2_A|SHPRH_P1P2_B,Flex_1_01
AAACCAATCCCTCGCTACTTTAGG-Flex_1_01,TMSB4X,TMSB4X_P1_A|TMSB4X_P1_B,Flex_1_01
AAACCAATCTAAATCCACTTTAGG-Flex_1_01,KLF10,KLF10_P2_A|KLF10_P2_B,Flex_1_01
AAACGGGCACCTAAGAACTTTAGG-Flex_1_01,TARBP2,TARBP2_P1P2_A|TARBP2_P1P2_B,Flex_1_01
AAACGTTCACTAAGGCACTTTAGG-Flex_1_01,KDM2B,KDM2B_ENST00000377071.4_A|KDM2B_ENST0000037707...,Flex_1_01
AAACTGGGTAACCCATACTTTAGG-Flex_1_01,non-targeting,non-targeting_00021|non-targeting_03430,Flex_1_01
AAACTGGGTAATGTCCACTTTAGG-Flex_1_01,SV2A,SV2A_P1P2_A|SV2A_P1P2_B,Flex_1_01
AAACTGGGTTGCATCGACTTTAGG-Flex_1_01,CLDN6,CLDN6_P1P2_A|CLDN6_P1P2_B,Flex_1_01


In [11]:
import numpy as np
# some default control and non-control perturbations
perturbations = [
    "[('DMSO_TF', 0.0, 'uM')]",  # Control
    "[('Aspirin', 0.5, 'uM')]",
    "[('Dexamethasone', 1.0, 'uM')]",
]

n_cells = adata.n_obs
# we assign perturbations to cells randomly
# adata.obs['target_gene'] = np.random.choice(perturbations, size=n_cells)
# adata.obs['cell_type'] = adata.obs['LVL1']  # Use your cell type column
# we can also add a batch variable to take into account batch effects
# batch_labels = np.random.choice(['batch_1', 'batch_2', 'batch_3', 'batch_4'], size=n_cells)
adata.obs['batch_var'] = adata.obs["batch"]

config = StateConfig(
    pert_col="target_gene",
    # celltype_col="cell_type",
    control_pert="non-targeting",
    output_path="yolksac_perturbed.h5ad",
)

In [12]:
from helical.models.state import StatePerturb

state_perturb = StatePerturb(configurer=config)

# again we process the data and get the perturbed embeddings
processed_data = state_perturb.process_data(adata)
perturbed_embeds = state_perturb.get_embeddings(processed_data)

print(perturbed_embeds.shape)

INFO:helical.utils.downloader:Downloading 'state/state_transition/pert_onehot_map.pt'
INFO:helical.utils.downloader:Starting to download: 'https://helicalpackage.s3.eu-west-2.amazonaws.com/state/state_transition/pert_onehot_map.pt'
pert_onehot_map.pt: 100%|██████████| 5.50M/5.50M [00:00<00:00, 55.8MB/s]
INFO:helical.utils.downloader:File saved to: '/root/.cache/helical/models/state/state_transition/pert_onehot_map.pt'
INFO:helical.utils.downloader:Downloading 'state/state_transition/batch_onehot_map.pkl'
INFO:helical.utils.downloader:Starting to download: 'https://helicalpackage.s3.eu-west-2.amazonaws.com/state/state_transition/batch_onehot_map.pkl'
batch_onehot_map.pkl: 100%|██████████| 16.0k/16.0k [00:00<00:00, 29.4MB/s]
INFO:helical.utils.downloader:File saved to: '/root/.cache/helical/models/state/state_transition/batch_onehot_map.pkl'
INFO:helical.utils.downloader:Downloading 'state/state_transition/ST_all.pt'
INFO:helical.utils.downloader:Starting to download: 'https://helicalpac

(10, 2000)


In [13]:
from helical.constants.paths import CACHE_DIR_HELICAL
import os 
import pandas as pd
import torch

perturbation_path = os.path.join(CACHE_DIR_HELICAL, "state", "state_transition", "pert_onehot_map.pt")
data = torch.load(perturbation_path, weights_only=False)

rows = []
for key, value in data.items():
    rows.append({
        'perturbation_name': key,
        'batch_encoding': value.numpy()
    })

df = pd.DataFrame(rows)
print(df.head())

                                 perturbation_name  \
0  [('(R)-Verapamil (hydrochloride)', 0.05, 'uM')]   
1   [('(R)-Verapamil (hydrochloride)', 0.5, 'uM')]   
2   [('(R)-Verapamil (hydrochloride)', 5.0, 'uM')]   
3                 [('(S)-Crizotinib', 0.05, 'uM')]   
4                  [('(S)-Crizotinib', 0.5, 'uM')]   

                                      batch_encoding  
0  [1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  
1  [0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  
2  [0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  
3  [0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, ...  
4  [0.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, ...  
