## Introduction

Linear models are often the first step in machine learning for biological sequence-to-function prediction. Despite their simplicity, they can provide strong baseline performance, interpretability, and insight into key features driving biological outcomes. In this notebook, we demonstrate how to apply linear regression and regularized models (like Lasso and Ridge) to predict sequence-derived properties.

The data we are using for the sequence-to-fitness prediction is from ProteinGym (https://proteingym.org), a collection of benchmarks aiming at comparing the ability of models to predict the effects of protein mutations. We first import essential libraries, and data from the fitness folder.

We use scikit-learn, a nice package for building and evaluating machine learning models.

In [4]:
import sklearn
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LinearRegression, Lasso, Ridge, ElasticNet
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.preprocessing import StandardScaler

In [None]:
CAPSD = pd.read_csv("data_fitness/CAPSD_AAV2S_Sinai_2021.csv")
PHOT = pd.read_csv("data_fitness/PHOT_CHLRE_Chen_2023.csv")
POLG = pd.read_csv("data_fitness/POLG_DEN26_Suphatrakul_2023.csv")

We first process the data from CAPSD dataset, doing one-hot encoding for the amino acid sequences and then the train-test split. We take the first 5000 entries to keep runtime manageable.

mutated_sequence contains amino acid sequences.

DMS_score is the experimentally measured fitness.

In [15]:
sequences = CAPSD["mutated_sequence"].values[:5000]
scores = CAPSD["DMS_score"].values[:5000]

In [16]:
amino_acids = list("ACDEFGHIKLMNPQRSTVWY") 
encoder = OneHotEncoder(categories=[amino_acids] * len(sequences[0]))

In [17]:
seq_list = [list(seq) for seq in sequences]
X = encoder.fit_transform(seq_list)
X.shape

(5000, 14700)

We use an 80/20 trainâ€“test split. Since fitness values vary in scale, we apply StandardScaler to the labels.

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train_noscale, y_test_noscale = train_test_split(
    X, scores, test_size=0.2, random_state=42
)

y_scaler = StandardScaler()

y_train= y_scaler.fit_transform(y_train_noscale.reshape(-1, 1)).ravel()
y_test = y_scaler.transform(y_test_noscale.reshape(-1, 1)).ravel()

X_train.shape, X_test.shape

((4000, 14700), (1000, 14700))

We then define the helper function to print different metrics for evaluating machine learning models on the held-out test set:

-R2 score (coefficient of determination)

-Mean Squared Error (MSE)

-Mean Absolute Error (MAE)

Higher R2 score, lower MSE and MAE indicate a better machine learning model.

In [19]:
def evaluate_model(name, y_true, y_pred):
    print(f"--- {name} ---")
    print("R2:", r2_score(y_true, y_pred))
    print("MSE:", mean_squared_error(y_true, y_pred))
    print("MAE:", mean_absolute_error(y_true, y_pred))
    print()

We now train four linear regression models and evaluate each.

#### 1. Linear Regression
A baseline model with no regularization.

In [20]:
lr = LinearRegression()
lr.fit(X_train, y_train)
y_pred_lr = lr.predict(X_test)

evaluate_model("Linear Regression", y_test, y_pred_lr)

--- Linear Regression ---
R2: 0.5500420178990411
MSE: 0.4567800668557709
MAE: 0.5093609866269868



#### 2. Ridge Regression (L2 Regularization)
Improves stability by shrinking coefficients.

In [21]:
ridge = Ridge(alpha=0.1)
ridge.fit(X_train, y_train)
y_pred_ridge = ridge.predict(X_test)

evaluate_model("Ridge Regression (L2)", y_test, y_pred_ridge)

--- Ridge Regression (L2) ---
R2: 0.5662414139903585
MSE: 0.4403350619353897
MAE: 0.5054188511765942



#### 3. Lasso Regression (L1 Regularization)
Performs feature selection by driving some coefficients to zero.

In [22]:
lasso = Lasso(alpha=0.0001, max_iter=500)
lasso.fit(X_train, y_train)
y_pred_lasso = lasso.predict(X_test)

evaluate_model("Lasso Regression (L1)", y_test, y_pred_lasso)

--- Lasso Regression (L1) ---
R2: 0.5996939056106281
MSE: 0.40637537688335224
MAE: 0.4959062595602137



  model = cd_fast.sparse_enet_coordinate_descent(


#### 4. Elastic Net (L1 + L2)
Combines L1 regularization with L2 regularization.

In [23]:
elastic = ElasticNet(alpha=0.0001, l1_ratio=0.5, max_iter=500)
elastic.fit(X_train, y_train)
y_pred_elastic = elastic.predict(X_test)
evaluate_model("Elastic Net (L1 + L2)", y_test, y_pred_elastic)

--- Elastic Net (L1 + L2) ---
R2: 0.5948361151840126
MSE: 0.4113068192048775
MAE: 0.49773791995906425



  model = cd_fast.sparse_enet_coordinate_descent(


The results show that plain Linear Regression performs the worst, while both L1 and L2 regularization substantially improve performance. This suggests that the unregularized model likely overfits due to the high dimensionality of the one-hot encoded sequences.