<a href="https://www.kaggle.com/code/dalloliogm/computing-embeddings-using-helix-mrna?scriptVersionId=225740287" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

# Computing Embeddings using Helix model from HelicalAI

This notebook uses the Helix-mRNA model from HelicalAI (https://arxiv.org/abs/2502.13785) to compute embeddings from the training sequences. This model is trained on a large dataset of sequences, and it is able to capture features important for RNA structure. The embeddings can then be used for further training.

## Install libraries

Installing libraries is complicated, because there are many dependencies, and some rely on older versions of pandas and other packages. 

In [1]:
!pip install helical -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m52.6/52.6 kB[0m [31m2.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m3.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m4.8/4.8 MB[0m [31m64.9 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m44.0/44.0 kB[0m [31m2.2 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m45.9/45.9 kB[0m [31m2.4 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [2]:
!pip uninstall -y cupy -q
!pip uninstall -y cupy-cuda12x -q
!pip install cupy-cuda11x -q


[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m99.2/99.2 MB[0m [31m14.4 MB/s[0m eta [36m0:00:00[0m
[?25h

## Import Helix and compute embeddings

In [3]:
from helical import HelixmRNA, HelixmRNAConfig


INFO:numexpr.utils:NumExpr defaulting to 4 threads.
INFO:datasets:PyTorch version 2.5.1+cu121 available.
INFO:datasets:Polars version 1.9.0 available.
INFO:datasets:TensorFlow version 2.17.1 available.
INFO:datasets:JAX version 0.4.33 available.
INFO:helical:Caduceus not available: If you want to use this model, ensure you have a CUDA GPU and have installed the optional helical[mamba-ssm] dependencies.


In [4]:
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

helix_mrna_config = HelixmRNAConfig(batch_size=5, max_length=100, device=device)
helix_mrna = HelixmRNA(configurer=helix_mrna_config)

rna_sequences = ["EACUEGGG", "EACUEGGG", "EACUEGGG", "EACUEGGG", "EACUEGGG"]
dataset = helix_mrna.process_data(rna_sequences)

rna_embeddings = helix_mrna.get_embeddings(dataset)

print("Helix-mRNA embeddings shape: ", rna_embeddings.shape)

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/10.4M [00:00<?, ?B/s]



tokenizer_config.json:   0%|          | 0.00/1.42k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/188 [00:00<?, ?B/s]

INFO:helical.models.helix_mrna.model:Helix-mRNA initialized successfully.
INFO:helical.models.helix_mrna.model:'helical-ai/Helix-mRNA' model is in 'eval' mode, on device 'cpu'.
INFO:helical.models.helix_mrna.model:Processing data for Helix-mRNA.
INFO:helical.models.helix_mrna.model:Successfully processed the data for Helix-mRNA.
INFO:helical.models.helix_mrna.model:Started getting embeddings:
Getting embeddings: 100%|██████████| 1/1 [00:17<00:00, 17.31s/it]
INFO:helical.models.helix_mrna.model:Finished getting embeddings.


Helix-mRNA embeddings shape:  (5, 100, 256)


In [5]:
rna_embeddings

array([[[-0.00111404,  0.00493733,  0.01273472, ...,  0.00239108,
         -0.01228457,  0.00081159],
        [-0.00111404,  0.00493733,  0.01273472, ...,  0.00239108,
         -0.01228457,  0.00081159],
        [-0.00111404,  0.00493733,  0.01273472, ...,  0.00239108,
         -0.01228457,  0.00081159],
        ...,
        [-0.00111076,  0.00482725,  0.01049498, ...,  0.00259772,
         -0.01508733,  0.00096362],
        [-0.00142994,  0.00497706,  0.01132332, ...,  0.00226957,
         -0.01381793,  0.00071597],
        [-0.00117821,  0.00489485,  0.01262088, ...,  0.00183635,
         -0.0125157 ,  0.00090083]],

       [[-0.00111404,  0.00493733,  0.01273472, ...,  0.00239108,
         -0.01228457,  0.00081159],
        [-0.00111404,  0.00493733,  0.01273472, ...,  0.00239108,
         -0.01228457,  0.00081159],
        [-0.00111404,  0.00493733,  0.01273472, ...,  0.00239108,
         -0.01228457,  0.00081159],
        ...,
        [-0.00111076,  0.00482725,  0.01049498, ...,  