# Myllia - echoes of silenced genes
---

The task is to train a model that is able to predict *expression changes in scRNA-seq data induced by CRISPRi perturbations*. For that, we have a dataset of 80 different perturbations and the *average expression values* of genes, plus an unperturbed case (*non-targeting sgRNA*).

## Introduction

This problem esentially consists of inputs of strings given by the `pert_symbol` column and an output vector space of dimension equal to the number of columns minus the one corresponding to the `pert_symbol` (i.e., a 5127-dimensional space).

Clearly, this dataset needs to be preprocessed. In particular, we are going to:

1. treat each of the genes in the sample as tokens and get embeddings from them using a pre-trained neural network,
2. compute delta averages for each of the output vectors,
3. train a machine learning (ML) model using the preprocessed datasets created during steps 1 and 2,
4. predict delta averages on **new** genes using the validation set (`pert_ids_val.csv`).

First we import libraries:

In [1]:
import joblib
import requests
import time
import warnings
warnings.filterwarnings("ignore")

import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from scipy.stats import pearsonr
from sklearn.cross_decomposition import PLSRegression
from sklearn.model_selection import LeaveOneOut
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
import torch
from transformers import AutoTokenizer, AutoModel
from tqdm import tqdm

Import custom-made metric:

In [2]:
import sys
sys.path.append("/kaggle/usr/lib/relative-wmae-multiplied-by-wcosine")

from metric import _score_impl

Define some configuration options:

In [3]:
model_name = "facebook/esm2_t33_650M_UR50D"
device = "cuda" if torch.cuda.is_available() else "cpu"

## Data loading

We start by loading the required dataset

In [4]:
train_df = pd.read_csv("/kaggle/input/myllia-echoes-of-silenced-genes-competition-data/training_data_means.csv")
sample_sub = pd.read_csv("/kaggle/input/myllia-echoes-of-silenced-genes-competition-data/sample_submission.csv")
train_df.shape, sample_sub.shape

((81, 5128), (120, 5128))

Also, we load the ground truth values for the training set. These values will be used during model training, in order to find the best solution (maximizing the metric):

In [5]:
ground_truth = pd.read_csv("/kaggle/input/myllia-echoes-of-silenced-genes-competition-data/training_data_ground_truth_table.csv")

genes = [c for c in ground_truth.columns if not c.startswith("w_") and c not in ["pert_id", "baseline_wmae"]]
weight_cols = [f"w_{g}" for g in genes]

# true values (used later on, when computing metric)
Y_true = ground_truth[genes].values
W_true = ground_truth[weight_cols].values
baseline_true = ground_truth["baseline_wmae"].values

ground_truth.shape

(80, 10256)

And we split between the perturbed and non-perturbed genes:

In [6]:
pert_df = train_df[train_df.pert_symbol != "non-targeting"]
cntrl_df = train_df[train_df.pert_symbol == "non-targeting"]
pert_df.shape, cntrl_df.shape

((80, 5128), (1, 5128))

## Feature engineering

We now work with the input data (`pert_symbol` column of the `training_data_means.csv` file). We are going to treat every of the protein sequences (`pert_symbol`) as tokens and use trained protein languange model to retrieve their embeddings such that genes with similar molecular roles end up having similar embeddings.

In this case, we try with **ESM-2** from *Meta-AI* (`esm2_t33_650M_UR50D`) from the `transformers` library to get embeddings of size 1280.

First, we import define some utility functions that will:

1. get the **ESM-2** embeddings,
2. fetch protein sequences from the *UniProt* database.

In [7]:
def get_esm_embedding(sequence: str, tokenizer, model, device):
    """Retrieve genes sequence embeddings"""
    
    inputs = tokenizer(sequence, return_tensors="pt", padding=True, truncation=True).to(device)
    model.eval()
    with torch.no_grad():
        outputs = model(**inputs)

    # last hidden state: (batch, tokens, dim)
    last_hidden = outputs.last_hidden_state

    # Remove special tokens ([CLS], [EOS])
    # Token positions: 0 = CLS, last = EOS
    residue_embeddings = last_hidden[:, 1:-1, :]

    # Mean pool over residues
    protein_embedding = residue_embeddings.mean(dim=1)

    return protein_embedding.squeeze().cpu().numpy()

def fetch_protein_sequence(gene_symbol):
    """Fetch protein sequence from the UniProt database"""
    url = (
        "https://rest.uniprot.org/uniprotkb/search"
        f"?query=gene:{gene_symbol}+AND+organism_id:9606+AND+reviewed:true"
        "&fields=sequence&format=fasta&size=1"
    )
    r = requests.get(url)
    if r.status_code != 200 or not r.text.startswith(">"):
        return None
    
    lines = r.text.strip().split("\n")
    sequence = "".join(lines[1:])
    return sequence

Now we can get the embeddings for every gene in the training dataset:

In [8]:
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModel.from_pretrained(model_name)
model.to(device)

# get embeddings for each of the genes
embeddings = {}
failed = []
for gene in tqdm(pert_df["pert_symbol"].unique()):
    seq = fetch_protein_sequence(gene)
    if seq is None:
        failed.append(gene)
        continue

    emb = get_esm_embedding(seq, tokenizer, model, device)
    embeddings[gene] = emb

if len(failed) > 0: print(f"failed genes: {failed}")

tokenizer_config.json:   0%|          | 0.00/95.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/93.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/125 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/724 [00:00<?, ?B/s]

2026-02-03 14:56:28.289382: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
E0000 00:00:1770130588.501753      23 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1770130588.564785      23 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1770130589.080762      23 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770130589.080802      23 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1770130589.080805      23 computation_placer.cc:177] computation placer alr

model.safetensors:   0%|          | 0.00/2.61G [00:00<?, ?B/s]

Some weights of EsmModel were not initialized from the model checkpoint at facebook/esm2_t33_650M_UR50D and are newly initialized: ['pooler.dense.bias', 'pooler.dense.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
  0%|          | 0/80 [00:00<?, ?it/s]Asking to truncate to max_length but no maximum length is provided and the model has no predefined maximum length. Default to no truncation.
100%|██████████| 80/80 [01:06<00:00,  1.21it/s]


Finally, we can create our input and output (`X`, `Y`) datasets to be used for training a machine learning model:

In [9]:
# stack embeddings
X = np.vstack(pert_df["pert_symbol"].map(embeddings).values)

# create array of delta averages
Y_raw = pert_df.iloc[:,1:].values
Y_control = cntrl_df.iloc[:,1:].values
Y = Y_raw - Y_control

X.shape, Y.shape

((80, 1280), (80, 5127))

Now we have the input data as a numpy array of shape (80,1280). Next task is to preprocess the output that we are going to predict: the arrays of shape 5127.

## Modeling

Next in line is training a *partial least squares* regression model (`PLSRegression`) with the preprocessed dataset.

Given that there are very few points to train (only 80 of them) we are going to use a leave-one-out gene cross-validation technique with `PLSRegression` model to lower the dimensionality of the `Y` array, and use the custom-made metric proposed by the host of the competition.

This [post](https://scikit-learn.org/stable/auto_examples/cross_decomposition/plot_pcr_vs_pls.html) by the *scikit-learn* team explains why it is useful in this context of having a small sample size and very high-dimensional (possibly correlated) output space.

In [10]:
from sklearn.model_selection import KFold

kf = KFold(n_splits=5, shuffle=True, random_state=42)

results = {}
for n_comp in [3, 5, 7, 10, 15]:
    scores = []
    for tr, va in kf.split(X):
        X_tr, X_va = X[tr], X[va]
        Y_tr, Y_va = Y_true[tr], Y_true[va]
        W_va = W_true[va]
        B_va = baseline_true[va]

        # train model pipeline
        pipeline = Pipeline([
            ("scaler", StandardScaler()),
            ("pls", PLSRegression(n_components=n_comp))
        ])
        pipeline.fit(X_tr, Y_tr)
        Y_pred = pipeline.predict(X_va)

        # metric evaluation
        fold_score = _score_impl(Y_va, Y_pred, W_va, B_va, eps=1e-12, max_log2=5.0, cos_left=0.0, cos_right=0.2)
        scores.append(fold_score)

    results[n_comp] = np.mean(scores)
    print(f"- components: {n_comp} :: metric: {np.mean(scores):.4f}")

best_n = max(results, key=results.get)
print(f"\n- best number of components: {best_n}")

- components: 3 :: metric: -0.0374
- components: 5 :: metric: 0.0546
- components: 7 :: metric: 0.0583
- components: 10 :: metric: 0.0007
- components: 15 :: metric: -0.0289

- best number of components: 7


We can now use this exploratory result to create the final model with the best number of components and train it with the entire dataset:

In [11]:
final_pipeline = Pipeline([
    ("scaler", StandardScaler()),
    ("pls", PLSRegression(n_components=best_n))
])

final_pipeline.fit(X, Y)

# save it as a joblib file
joblib.dump(final_pipeline, "pls_model.joblib")

['pls_model.joblib']

## Inference

We are now ready to use the model to make predictions on **new** genes.

We first load the **unseen** genes from the validation set:

In [12]:
val_df = pd.read_csv("/kaggle/input/myllia-echoes-of-silenced-genes-competition-data/pert_ids_val.csv")
val_genes = val_df["pert"].tolist()
val_ids = val_df["pert_id"].tolist()
val_df.shape

(60, 3)

Compute their embeddings:

In [13]:
val_embeddings = {}

for gene in tqdm(val_genes):
    seq = fetch_protein_sequence(gene)
    emb = get_esm_embedding(seq, tokenizer, model, device)
    val_embeddings[gene] = emb

X_val = np.vstack([val_embeddings[g] for g in val_genes])

100%|██████████| 60/60 [00:50<00:00,  1.18it/s]


And we make predictions with our best pipeline model:

In [14]:
Y_preds = final_pipeline.predict(X_val)

Finally, we create the submission file with the 60 predictions for the validation genes and 60 more dummy predictions for the not-yet available test genes:

In [15]:
preds = pd.DataFrame(
    Y_preds,
    columns=train_df.columns[1:],
)

preds.insert(0, "pert_id", val_ids)

pad = pd.DataFrame(0, index=range(60), columns=train_df.columns[1:])
pad.insert(
    0,
    "pert_id",
    [f"pert_{i}" for i in range(61, 121)],
)

submission = pd.concat([preds, pad], ignore_index=True)
submission.to_csv("submission.csv", index=False)
submission.head()

Unnamed: 0,pert_id,A1BG,A1CF,AADAC,AAK1,AARS1,AASS,ABCA1,ABCA12,ABCA5,...,ZP3,ZPBP,ZRANB3,ZSCAN18,ZSCAN31,ZSWIM5,ZSWIM6,ZSWIM7,ZWINT,ZYX
0,pert_1,0.00166,-0.000629,-0.013861,0.022698,-0.030428,-0.000633,-0.024339,-0.004135,0.034498,...,0.029567,-0.000542,0.008004,0.004664,0.004154,0.013844,-0.026778,-0.029954,0.01314,-0.003059
1,pert_2,-0.002818,-0.000289,-0.003878,0.013061,-0.023784,0.000229,-0.005068,0.0023,0.019087,...,0.023223,-0.000433,-0.000249,8.1e-05,0.004436,0.00846,-0.008515,-0.047856,0.004898,0.015301
2,pert_3,0.004208,0.001828,-0.004489,0.005775,-0.02007,-0.000915,-0.001749,0.001641,0.011315,...,0.033968,0.0003,0.000525,-0.000969,0.001798,0.012857,-0.025819,-0.035501,-0.010345,-0.006904
3,pert_4,0.009696,-0.002796,-0.004809,0.019337,-0.018152,0.000598,-0.013174,-0.001985,0.02503,...,0.019953,-0.000526,0.001986,0.005343,0.009924,0.009311,-0.019517,-0.045759,0.00307,0.015191
4,pert_5,0.020579,-0.000406,-0.001561,-0.01756,0.004744,-0.000507,-0.025558,-0.005731,0.028293,...,0.019539,-0.000488,0.008905,0.007291,0.014021,0.009139,-0.025452,-0.025134,-0.016098,-0.035905
