# Myllia - echoes of silenced genes
---

The task is to train a model that is able to predict *expression changes in scRNA-seq data induced by CRISPRi perturbations*. For that, we have a dataset of 80 different perturbations and the *average expression values* of genes, plus an unperturbed case (*non-targeting sgRNA*).

## Introduction

This problem esentially consists of inputs of strings given by the `pert_symbol` column and an output vector space of dimension equal to the number of columns minus the one corresponding to the `pert_symbol` (i.e., a 5127-dimensional space).

Clearly, this dataset needs to be preprocessed. In particular, we are going to:

1. treat each of the genes in the sample as tokens and get embeddings from them using a pre-trained neural network,
2. compute delta averages for each of the output vectors,
3. train a machine learning (ML) model using the preprocessed datasets created during steps 1 and 2,
4. predict delta averages on **new** genes using the validation set (`pert_ids_val.csv`).

First we import libraries:

In [1]:
import joblib
import requests
import time
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import pearsonr
from sklearn.decomposition import PCA
from sklearn.gaussian_process import GaussianProcessRegressor
from sklearn.gaussian_process.kernels import RBF, ConstantKernel as C, WhiteKernel, Matern
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import torch
from transformers import AutoTokenizer, AutoModel
from tqdm import tqdm

Import custom-made metric:

In [2]:
import sys
sys.path.append("/kaggle/usr/lib/relative-wmae-multiplied-by-wcosine")

from metric import _score_impl

Define some configuration options:

In [3]:
root_dir = "/kaggle/input/myllia-echoes-of-silenced-genes-competition-data"
model_name = "facebook/esm2_t33_650M_UR50D"
device = "cuda" if torch.cuda.is_available() else "cpu"

## Data loading

We start by loading the required datasets:

1. training dataset,
2. ground truth values for training dataset (used during metric evaluation),
3. validation dataset (with **new** genes used during inference),
4. submission sample.

In [4]:
# train dataset
train_df = pd.read_csv(f"{root_dir}/training_data_means.csv")

# ground truth
ground_truth = pd.read_csv(f"{root_dir}/training_data_ground_truth_table.csv")

# validation dataset
val_df = pd.read_csv(f"{root_dir}/pert_ids_val.csv")

# submission sample
sample_sub = pd.read_csv(f"{root_dir}/sample_submission.csv")

And we split between the perturbed and non-perturbed genes:

In [5]:
pert_df = train_df[train_df.pert_symbol != "non-targeting"]
cntrl_df = train_df[train_df.pert_symbol == "non-targeting"]
pert_df.shape, cntrl_df.shape

((80, 5128), (1, 5128))

Let's now define canonical genes to mantain the same order across different files:

In [6]:
pert_ids = pert_df["pert_symbol"].tolist()

# order dataframes based on `pert_ids`
ground_truth = ground_truth.set_index("pert_id").loc[pert_ids].reset_index()

assert list(ground_truth["pert_id"]) == pert_ids, "ground_truth order does not match embedding order"

Now we need to get ground truth values for later usage:

In [7]:
# align gene column labels to what will be saved in the submission file (what the metric expects to see)
gene_cols = sample_sub.columns[1:]

# save ground truth values
Y_true = ground_truth[gene_cols].values
W_true = ground_truth[[f"w_{g}" for g in gene_cols]].values
baseline_true = ground_truth["baseline_wmae"].values

## Feature engineering

We now work with the input data (`pert_symbol` column of the `training_data_means.csv` file). We are going to treat every of the protein sequences (`pert_symbol`) as tokens and use trained protein languange model to retrieve their embeddings such that genes with similar molecular roles end up having similar embeddings.

In this case, we try with **ESM-2** from *Meta-AI* (`esm2_t33_650M_UR50D`) from the `transformers` library to get embeddings of size 1280.

First, we import define some utility functions that will:

1. get the **ESM-2** embeddings,
2. fetch protein sequences from the *UniProt* database.

In [8]:
def get_esm_embedding(sequence: str, tokenizer, model, device):
    """Retrieve genes sequence embeddings"""
    
    inputs = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True).to(device)
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)

    # last hidden state: (batch, tokens, dim)
    last_hidden = outputs.last_hidden_state

    # Remove special tokens ([CLS], [EOS])
    # Token positions: 0 = CLS, last = EOS
    residue_embeddings = last_hidden[:, 1:-1, :]

    # Mean pool over residues
    protein_embedding = residue_embeddings.mean(dim=1)

    return protein_embedding.squeeze().cpu().numpy()

def fetch_protein_sequence(gene_symbol):
    """Fetch protein sequence from the UniProt database"""
    url = (
        "https://rest.uniprot.org/uniprotkb/search"
        f"?query=gene:{gene_symbol}+AND+organism_id:9606+AND+reviewed:true"
        "&fields=sequence&format=fasta&size=1"
    )
    r = requests.get(url)
    if r.status_code != 200 or not r.text.startswith(">"):
        return None
    
    lines = r.text.strip().split("\n")
    sequence = "".join(lines[1:])
    return sequence

Now we can get the embeddings for every gene in the training dataset:

In [9]:
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.to(device)

# get embeddings for each of the genes
embeddings = {}
failed = []
for gene in tqdm(pert_df["pert_symbol"].unique()):
    seq = fetch_protein_sequence(gene)
    if seq is None:
        failed.append(gene)
        continue

    emb = get_esm_embedding(seq, tokenizer, model, device)
    embeddings[gene] = emb

if len(failed) > 0: print(f"failed genes: {failed}")

tokenizer_config.json:   0%|          | 0.00/95.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

2026-02-03 17:24:25.386278: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1770139465.580584      23 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1770139465.632660      23 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1770139466.086079      23 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770139466.086124      23 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770139466.086127      23 computation_placer.cc:177] computation placer alr

model.safetensors:   0%|          | 0.00/2.61G [00:00<?, ?B/s]

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t33_650M_UR50D and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/80 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|██████████| 80/80 [01:03<00:00,  1.26it/s]


Finally, we can create our input and output (`X`, `Y`) datasets to be used for training a machine learning model:

In [10]:
# stack embeddings
X = np.vstack(pert_df["pert_symbol"].map(embeddings).values)

# create array of delta averages
Y_raw = pert_df[gene_cols].values
Y_control = cntrl_df[gene_cols].values
Y = Y_raw - Y_control

X.shape, Y.shape

((80, 1280), (80, 5127))

Now we have the input data as a numpy array of shape (80,1280). Next task is to preprocess the output that we are going to predict: the arrays of shape 5127.

## Modeling

Next in line is training a set of *gaussian processed* models (`GaussianProcessRegressor`) with the preprocessed dataset.

In [11]:
def make_gp():
    """Parameters suggested by chatgpt optimized for biological data"""
    kernel = (
        C(1.0, (1e-2, 1e2)) *
        Matern(length_scale=1.0, length_scale_bounds=(1e-2, 1e2), nu=1.5)
        + WhiteKernel(noise_level=1e-3, noise_level_bounds=(1e-6, 1e-1))
    )

    return GaussianProcessRegressor(kernel=kernel, alpha=1e-5, normalize_y=True, n_restarts_optimizer=3, random_state=42)

In [12]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

results = {}
for K_latent in [5, 8, 12]:
    scores = []
    for tr, va in kf.split(X):
        X_tr, X_va = X[tr], X[va]
        Y_tr, Y_va = Y[tr], Y_true[va]
        W_va = W_true[va]
        B_va = baseline_true[va]

        # reduce dimensions of latent space of the true deltas (ground_truth values)
        pca = PCA(n_components=K_latent)
        pca.fit(Y_true[tr])
        Z_tr = pca.transform(Y_tr)

        # train one GP per latent dimension
        gps = []
        for k in range(K_latent):
            gp = make_gp()
            gp.fit(X_tr, Z_tr[:, k])
            gps.append(gp)

        # predict on latent space
        Z_val_pred = np.column_stack([gp.predict(X_va) for gp in gps])
        # and jump back to gene space
        Y_pred = pca.inverse_transform(Z_val_pred)

        # metric evaluation
        fold_score = _score_impl(Y_va, Y_pred, W_va, B_va, eps=1e-12, max_log2=5.0, cos_left=0.0, cos_right=0.2)
        scores.append(fold_score)

    results[K_latent] = np.mean(scores)
    print(f"- components: {K_latent} :: metric: {np.mean(scores):.4f}")

best_n = max(results, key=results.get)
print(f"\n- best number of components: {best_n}")

- components: 5 :: metric: -0.0895
- components: 8 :: metric: -0.0892
- components: 12 :: metric: -0.0975

- best number of components: 8


We can now use this exploratory result to create the final model with the best number of components and train it with the entire dataset:

In [13]:
pca = PCA(n_components=best_n)
pca.fit(Y_true)
Z = pca.transform(Y)

# train
gps = []
for k in range(best_n):
    gp = make_gp()
    gp.fit(X, Z[:, k])
    gps.append(gp)

## Inference

We are now ready to use the model to make predictions on **new** genes that were not seen during the training stage.

We first load the **unseen** genes from the validation set and compute their embeddings:

In [14]:
val_genes = val_df["pert"].tolist()
val_ids = val_df["pert_id"].tolist()

val_embeddings = {}
for gene in tqdm(val_genes):
    seq = fetch_protein_sequence(gene)
    emb = get_esm_embedding(seq, tokenizer, model, device)
    val_embeddings[gene] = emb

X_val = np.vstack([val_embeddings[g] for g in val_genes])

100%|██████████| 60/60 [00:48<00:00,  1.22it/s]


And we make predictions with our best pipeline model:

In [15]:
Z_val_pred = np.column_stack([gp.predict(X_val) for gp in gps])
Y_preds = pca.inverse_transform(Z_val_pred)

Finally, we create the submission file with the 60 predictions for the validation genes and 60 more dummy predictions for the not-yet available test genes:

In [16]:
preds = pd.DataFrame(Y_preds, columns=gene_cols)
preds.insert(0, "pert_id", val_ids)

# grab dummy values as in the submission script example
pad = pd.DataFrame(0, index=range(60), columns=train_df.columns[1:])
pad.insert(0, "pert_id", [f"pert_{i}" for i in range(61, 121)])

submission = pd.concat([preds, pad], ignore_index=True)
submission.to_csv("submission.csv", index=False)
submission.head()

Unnamed: 0,pert_id,A1BG,A1CF,AADAC,AAK1,AARS1,AASS,ABCA1,ABCA12,ABCA5,...,ZP3,ZPBP,ZRANB3,ZSCAN18,ZSCAN31,ZSWIM5,ZSWIM6,ZSWIM7,ZWINT,ZYX
0,pert_1,0.000864,-0.001118,-0.00395,0.005061,-0.012558,-0.000404,-0.004013,-0.000816,0.01079,...,0.024222,-0.000229,0.003173,0.002918,0.007432,0.010506,-0.013054,-0.021379,-0.005087,-0.010961
1,pert_2,0.000978,-0.001257,-0.004093,0.00557,-0.012135,-0.000456,-0.004269,-0.000822,0.010892,...,0.024626,-0.000193,0.002987,0.002882,0.007173,0.01004,-0.013159,-0.02123,-0.005564,-0.011393
2,pert_3,0.008357,0.000341,-0.003132,0.000949,-0.009704,-0.000177,-0.002616,-0.000424,0.010696,...,0.025947,-0.000239,0.001596,0.002749,0.0076,0.011755,-0.016253,-0.022938,-0.01232,-0.00842
3,pert_4,0.003785,-0.00063,-0.003606,0.004455,-0.012795,-0.000373,-0.00329,-0.00049,0.011097,...,0.025035,-0.000195,0.001212,0.003061,0.00749,0.010816,-0.013767,-0.022406,-0.007876,-0.009488
4,pert_5,0.008547,0.000378,-0.003263,0.001018,-0.008712,-0.000153,-0.002579,-0.000499,0.010696,...,0.026213,-0.00024,0.002045,0.002571,0.007399,0.011661,-0.016607,-0.022575,-0.012781,-0.008163
