<a href="https://www.kaggle.com/code/dalloliogm/virtual-cell-challenge-state-via-helical?scriptVersionId=266802603" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Install helical and STATE

Here we install the STATE model using the integration in Helical - a library that supports several foundation models.

In [1]:
!pip install -q git+https://github.com/raschedh/helical.git@state_integration


  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m10.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m60.6/60.6 kB[0m [31m2.5 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone


In [2]:
# !pip install -q pytorch-lightning
# !pip install -q lightning
!pip install -q scanpy

In [3]:
from helical.models.state import StateConfig    
from helical.models.state import StateEmbed

state_config = StateConfig(batch_size=16)
state_embed = StateEmbed(configurer=state_config)

INFO:numexpr.utils:NumExpr defaulting to 4 threads.
INFO:datasets:PyTorch version 2.6.0+cu124 available.
INFO:datasets:Polars version 1.25.0 available.
INFO:datasets:Duckdb version 1.3.2 available.
INFO:datasets:TensorFlow version 2.18.0 available.
INFO:datasets:JAX version 0.5.2 available.
2025-10-09 11:43:57.993515: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:477] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1760010238.267315      13 cuda_dnn.cc:8310] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1760010238.339982      13 cuda_blas.cc:1418] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
INFO:helical.utils.downloader:Downloading 'state/state_embed/protein_embeddings.pt'
INFO:helical.utils.downloader:Starting to download: 'https://

# Reading Data

We uploaded the Virtual Cell Challenge data as a Dataset in Kaggle. This cannot be made public, so you would have to reupload it by yourself.

## From the Helical STATE notebook

https://github.com/raschedh/helical/blob/state_integration/docs/notebooks/STATE-tutorial.ipynb

In [4]:
import scanpy as sc

In [5]:
adata = sc.read("/kaggle/input/arc-virtual-cell-challenge/vcc_data/adata_Training.h5ad")

In [6]:
adata

AnnData object with n_obs × n_vars = 221273 × 18080
    obs: 'target_gene', 'guide_id', 'batch'
    var: 'gene_id'

In [7]:
# We subset to 10 cells and 2000 genes
n_cells = 10
n_genes = 2000
adata = adata[:n_cells, :n_genes].copy()

print(adata.shape)
n_cells = adata.n_obs
print(n_cells)

(10, 2000)
10


# Compute embeddings using STATE

We use the STATE model to transform the gene expression data from the adata object into embeddings, based on the STATE weights.

In [8]:
from helical.models.state import StateConfig    
from helical.models.state import StateEmbed

state_config = StateConfig(batch_size=16)
state_embed = StateEmbed(configurer=state_config)

INFO:helical.models.state.state_embeddings:Using model checkpoint: /root/.cache/helical/models/state/state_embed/se600m_model_weights.pt
INFO:helical.models.state.state_embeddings:Successfully loaded model


In [9]:
processed_data = state_embed.process_data(adata=adata)
embeddings = state_embed.get_embeddings(processed_data)

# note that the STATE model returns a numpy array of shape (n_cells, 1024)
print(embeddings.shape)
print(type(embeddings))

INFO:helical.models.state.state_embeddings:Auto-detected gene column: var.index (overlap: 1967/19790 protein embeddings, 98.4% of genes)
INFO:/usr/local/lib/python3.11/dist-packages/helical/models/state/model_dir/embed_utils/loader.py:1967 genes mapped to embedding file (out of 2000)
INFO:/usr/local/lib/python3.11/dist-packages/helical/models/state/model_dir/embed_utils/loader.py:1967 genes mapped to embedding file (out of 2000)
Encoding: 100%|██████████| 1/1 [04:12<00:00, 252.19s/it]

(10, 2058)
<class 'numpy.ndarray'>





In [10]:
embeddings

array([[-0.00141672,  0.02656306,  0.01338097, ...,  0.19579908,
         0.17032143, -0.01344345],
       [-0.00465809,  0.02243981,  0.0129386 , ...,  0.21297266,
         0.16331726, -0.01320603],
       [-0.00734512,  0.02889392,  0.00781043, ...,  0.17525868,
         0.1301231 , -0.02221236],
       ...,
       [-0.01067398,  0.03327807,  0.00886754, ...,  0.16581737,
         0.18036225, -0.03029826],
       [-0.00707141,  0.03258602,  0.01579126, ...,  0.18635646,
         0.1444946 , -0.00632011],
       [-0.01486783,  0.02287571,  0.02098707, ...,  0.18536985,
         0.15529218, -0.05187327]], dtype=float32)