<a href="https://www.kaggle.com/code/dalloliogm/virtual-cell-challenge-state-via-helical?scriptVersionId=266828291" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Arc Virtual Cell Challenge 

This notebook is related to the Arc Institute Virtual Cell Challenge (https://virtualcellchallenge.org/)

The challenge itself is not hosted on Kaggle. However, I thought it would be useful to share this notebook here.

In particular, here we use the helical library (https://github.com/helicalAI/helical), which is a framework for several Bio Foundation models, to install STATE and other models and compute embeddings with it.

# Install helical and STATE

Here we install the STATE model using the integration in Helical - a library that supports several foundation models.

STATE support is still not officially implemented in helical, but it is available through a PR.

In [1]:
%%capture
!pip install -q git+https://github.com/raschedh/helical.git@state_integration


### Install other libraries

We need scanpy to read the single-cell H5 file, and other bioinformatics tools.

In [2]:
%%capture
# !pip install -q pytorch-lightning
# !pip install -q lightning
!pip install -q scanpy

In [3]:
from helical.models.state import StateConfig    
from helical.models.state import StateEmbed

state_config = StateConfig(batch_size=16)
state_embed = StateEmbed(configurer=state_config)

INFO:numexpr.utils:NumExpr defaulting to 4 threads.
INFO:datasets:PyTorch version 2.6.0+cu124 available.
INFO:datasets:Polars version 1.25.0 available.
INFO:datasets:Duckdb version 1.3.2 available.
INFO:datasets:TensorFlow version 2.18.0 available.
INFO:datasets:JAX version 0.5.2 available.
2025-10-09 13:40:15.587886: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1760017215.826669      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760017215.890849      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO:helical.utils.downloader:Downloading 'state/state_embed/protein_embeddings.pt'
INFO:helical.utils.downloader:Starting to download: 'https://

# Reading Data

We uploaded the Virtual Cell Challenge data as a Dataset in Kaggle. This cannot be made public, so you would have to reupload it by yourself.

In [4]:
import scanpy as sc

In [5]:
adata = sc.read("/kaggle/input/arc-virtual-cell-challenge/vcc_data/adata_Training.h5ad")

In [6]:
adata

AnnData object with n_obs × n_vars = 221273 × 18080
    obs: 'target_gene', 'guide_id', 'batch'
    var: 'gene_id'

# Compute embeddings using STATE

We use the STATE model to transform the gene expression data from the adata object into embeddings, based on the STATE weights.

From the Helical STATE notebook: https://github.com/raschedh/helical/blob/state_integration/docs/notebooks/STATE-tutorial.ipynb

In [7]:
# We subset to 10 cells and 2000 genes. Not sure if Kaggle's free credit GPU can handle the whole data
n_cells = 10
n_genes = 2000
adata = adata[:n_cells, :n_genes].copy()

print(adata.shape)
n_cells = adata.n_obs
print(n_cells)

(10, 2000)
10


In [8]:
from helical.models.state import StateConfig    
from helical.models.state import StateEmbed

state_config = StateConfig(batch_size=16)
state_embed = StateEmbed(configurer=state_config)

INFO:helical.models.state.state_embeddings:Using model checkpoint: /root/.cache/helical/models/state/state_embed/se600m_model_weights.pt
INFO:helical.models.state.state_embeddings:Successfully loaded model


In [9]:
processed_data = state_embed.process_data(adata=adata)
embeddings = state_embed.get_embeddings(processed_data)

# note that the STATE model returns a numpy array of shape (n_cells, 1024)
print(embeddings.shape)
print(type(embeddings))

INFO:helical.models.state.state_embeddings:Auto-detected gene column: var.index (overlap: 1967/19790 protein embeddings, 98.4% of genes)
INFO:/usr/local/lib/python3.11/dist-packages/helical/models/state/model_dir/embed_utils/loader.py:1967 genes mapped to embedding file (out of 2000)
INFO:/usr/local/lib/python3.11/dist-packages/helical/models/state/model_dir/embed_utils/loader.py:1967 genes mapped to embedding file (out of 2000)
Encoding: 100%|██████████| 1/1 [03:40<00:00, 220.16s/it]

(10, 2058)
<class 'numpy.ndarray'>





In [10]:
embeddings

array([[-0.00121741,  0.0258666 ,  0.01389548, ...,  0.18738097,
         0.15659331, -0.01879586],
       [-0.00254047,  0.02223591,  0.01457261, ...,  0.21039957,
         0.17376716, -0.01824644],
       [-0.00770989,  0.03015961,  0.00734743, ...,  0.1572932 ,
         0.16152251, -0.02392773],
       ...,
       [-0.01062024,  0.03381889,  0.01018388, ...,  0.15798411,
         0.1774211 , -0.01821146],
       [-0.00690952,  0.03273775,  0.01650602, ...,  0.19858414,
         0.14159259, -0.00183264],
       [-0.01504915,  0.02380545,  0.02135004, ...,  0.17902738,
         0.15516698, -0.03782727]], dtype=float32)