# Let's train something

Now that we have our data, let's train a model. 

But before we train anything, we should discuss what kind of model we are training. Most typical ML problems fall into one of two categories: classification and regression. Classification models try to predict a class for each datum; for example, what dog breed is present in an image. Regression tasks try to predict a number. Our problem is a classical regression task: given a sequence, the model tries to predict the temperature.

Machine learning models are trained based on how far they are from the desired value. This is called a `loss` metric, and it's one of the most important aspects of training. A typical loss metric in regression tasks is the root mean square deviation ([RMSD](https://en.wikipedia.org/wiki/Root_mean_square_deviation)).

Another important concept is performance metric - how do we determine if our model is better or worse. For regression problems, a good candidate is the [Pearson correlation coefficient](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html#pearsonr) (otherwise known as Pearson's R, related to the coefficient of determination $R^2$). Let's use this.


In [20]:
# Load data
import pandas as pd

test_df = pd.read_csv('../data/test_df.csv')
train_df = pd.read_csv('../data/train_df.csv')

train_df.head()


Unnamed: 0,sequence,temperature
0,MKGIRARLTANFMIIIIITVTILEVLLIYTVRQNYYGSLEGSLTNQ...,28.0
1,MGGVGVLTLMVGVRVSPEPAVLGLLERYRDALNYSIRVLIESKTIS...,77.333333
2,MAKKKDTPGDGEFPGFSDTLQRTPKLEKPHYAGHRDRLKQRFRDAP...,30.0
3,MNFGDKVRYVRKKLSLSTEQLAKLLDVTQSYISHIENNRRLLGRDK...,35.0
4,MKNDINIKNKRAYFDYNLLDKYVAGIALLGTEIKAIRQGKANMTDA...,18.0


Let's use an [XGBoost](https://xgboost.readthedocs.io/en/latest/index.html) model. It's a very powerful and useful model architecture. It excels at tabular data and is pretty foolproof on its own (assuming your dataset is well prepared). It doesn't require any GPUs and can handle thousands of columns in a dataset, which will become important in the next notebook.

Let's start with random inputs and attempt to predict the result. This will tell us what "the floor" is. It's quite common for a model to find a way to "cheat" the predictions. For example, in our case (based on the distributions of temperatures we explored in the previous notebook), the model could just predict `36` and it would give pretty good results. We avoided this outcome by balancing our dataset, but such patterns are always something to keep an eye on.

Training on random data will show what score such a "cheat" would get. Our goal is to meaningfully improve on it.

In [21]:
import numpy as np
from sklearn.ensemble import GradientBoostingRegressor
import pandas as pd
from sklearn.model_selection import train_test_split

# Create random features as input (same number of rows as training data)
n_features = 200  # arbitrary number of random features
random_features = np.random.normal(size=(train_df.shape[0], n_features))
random_feature_names = [f'random_{i}' for i in range(n_features)]

random_df = pd.DataFrame(random_features, columns=random_feature_names)
random_df["temperature"] = train_df["temperature"]

random_df_train, random_df_test = train_test_split(random_df, shuffle=False, test_size=0.2)


Now let's train the model on our random data. As you'll see, XGBoost is really fast to train.

In [22]:
from xgboost import XGBRegressor
from sklearn.metrics import mean_squared_error, r2_score

# Specify model hyperparameters
model = XGBRegressor(
    random_state=42
)

X_train = random_df_train[random_feature_names]
y_train = random_df_train['temperature']
X_test = random_df_test[random_feature_names]
y_test = random_df_test['temperature']

# Train the model
model.fit(X_train, y_train)

# Make predictions and evaluate
train_pred = model.predict(X_train) 
test_pred = model.predict(X_test)

print("Train MSE:", mean_squared_error(y_train, train_pred))  
print("Test MSE:", mean_squared_error(y_test, test_pred))
print("Train R2:", r2_score(y_train, train_pred))
print("Test R2:", r2_score(y_test, test_pred))

# Calculate Pearson correlation
pearson_corr = np.corrcoef(y_test, test_pred)[0,1]
print(f"\nPearson correlation on the test set: {pearson_corr:.4f}")

Train MSE: 0.011074818505570464
Test MSE: 992.6919691817417
Train R2: 0.9999857180017291
Test R2: -0.15596125491738877

Pearson correlation on test set: 0.0057


Our correlation with random data on our held-out validation set is almost `0`. That's a good sign.

## Train on some data

Next, let's go for the simplest approach to training: using integer representation of amino acids. This is similar to the original dataset representation, so we will reverse our processing in a way. It's unlikely that the model will perform well here as well, but it will show us if there are any further issues with our dataset. An example could be that proteins that start with `M` have higher temperature stability, which doesn't make a lot of sense on its own, but could be a sign of an imbalanced dataset.

In [23]:
# Convert sequences to integer encoding
def sequence_to_integers(sequence, max_length=650):
    # Pad or truncate sequence to max_length
    sequence = sequence[:max_length].ljust(max_length, '-')
    # Convert to list of integers (using ord() for simple integer encoding)
    return [ord(aa) for aa in sequence]

# Apply encoding to train sequences
train_sequences = train_df['sequence'].apply(sequence_to_integers)
# Convert to DataFrame with columns aa_1, aa_2, etc
train_encoded = pd.DataFrame(train_sequences.tolist(), 
                           columns=[f'aa_{i+1}' for i in range(650)])
# Add temperature column back
train_encoded['temperature'] = train_df['temperature']

# Repeat for test set
test_sequences = test_df['sequence'].apply(sequence_to_integers)
test_encoded = pd.DataFrame(test_sequences.tolist(),
                          columns=[f'aa_{i+1}' for i in range(650)])
test_encoded['temperature'] = test_df['temperature']

# Get feature names (all columns except temperature)
feature_names = [col for col in train_encoded.columns if col != 'temperature']

train_encoded

Unnamed: 0,aa_1,aa_2,aa_3,aa_4,aa_5,aa_6,aa_7,aa_8,aa_9,aa_10,...,aa_642,aa_643,aa_644,aa_645,aa_646,aa_647,aa_648,aa_649,aa_650,temperature
0,77,75,71,73,82,65,82,76,84,65,...,45,45,45,45,45,45,45,45,45,28.000000
1,77,71,71,86,71,86,76,84,76,77,...,45,45,45,45,45,45,45,45,45,77.333333
2,77,65,75,75,75,68,84,80,71,68,...,45,45,45,45,45,45,45,45,45,30.000000
3,77,78,70,71,68,75,86,82,89,86,...,45,45,45,45,45,45,45,45,45,35.000000
4,77,75,78,68,73,78,73,75,78,75,...,45,45,45,45,45,45,45,45,45,18.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1515,77,73,72,75,82,82,82,69,82,73,...,45,45,45,45,45,45,45,45,45,75.000000
1516,77,83,84,76,73,70,71,72,75,78,...,45,45,45,45,45,45,45,45,45,30.000000
1517,77,75,76,73,73,71,86,76,77,84,...,45,45,45,45,45,45,45,45,45,9.500000
1518,77,83,69,80,73,80,65,80,65,84,...,45,45,45,45,45,45,45,45,45,45.000000


In [24]:
# Train the model
X_train = train_encoded[feature_names]
y_train = train_encoded['temperature']
X_test = test_encoded[feature_names]
y_test = test_encoded['temperature']

model = XGBRegressor(
    random_state=42
)

# Train the model
model.fit(X_train, y_train)

# Make predictions and evaluate
train_pred = model.predict(X_train) 
test_pred = model.predict(X_test)

print("Train MSE:", mean_squared_error(y_train, train_pred))  
print("Test MSE:", mean_squared_error(y_test, test_pred))
print("Train R2:", r2_score(y_train, train_pred))
print("Test R2:", r2_score(y_test, test_pred))

# Calculate Pearson correlation
pearson_corr = np.corrcoef(y_test, test_pred)[0,1]
print(f"\nPearson correlation on the test set: {pearson_corr:.4f}")

Train MSE: 0.022066787655929985
Test MSE: 842.4654707183843
Train R2: 0.9999721433627133
Test R2: -0.07565621068268258

Pearson correlation on test set: 0.1607


To illustrate the need for a train/test split, let's show the Pearson correlation of our model on this somewhat nonsensical dataset.

In [25]:
from scipy.stats import pearsonr

print(f"Train Pearson correlation: {np.corrcoef(y_train, train_pred)[0,1]}")
print(f"Test Pearson correlation: {np.corrcoef(y_test, test_pred)[0,1]}")

Train Pearson correlation: 0.9999885866632156
Test Pearson correlation: 0.16068236013273213


As you can see, the model almost perfectly predicts the training set but completely fails on the test set.

Alright, just amino acid integers don't do much better than random. This is expected and good news. In the next part, we'll train an actual working model using deep learning embeddings pushed to XGBoost.