# Feature engineering

We use the following features for the Random Forest model:

+ **Amino acid composition**: The percentage of each standard amino acid in the protein sequence.
+ **Dipeptide frequencies**: The frequency of each dipeptide (pair of amino acids) in the protein sequence.
+ **GRAVY**: The grand average of hydropathy (GRAVY) of the protein sequence, which is a measure of hydrophobicity.
+ **pI**: The isoelectric point of the protein sequence, which is the pH at which the protein has no net charge.
+ **Sequence length**: The length of the protein sequence.
+ **Original length**: The length of the original protein sequence (before cleaning).

In [None]:
import os

import pandas as pd
from sklearn.model_selection import train_test_split

from mapexploc.features import build_feature_matrix

In [None]:
INPUT_ANN = "data/processed/annotations.csv"
INPUT_FASTA = "data/processed/nonredundant.fasta"
OUTPUT_FEAT = "data/processed/features.csv"

OUTPUT_X_TRAIN = "data/processed/X/train.csv"
OUTPUT_X_TMP   = "data/processed/X/tmp.csv"
OUTPUT_X_VAL   = "data/processed/X/val.csv"
OUTPUT_X_TEST  = "data/processed/X/test.csv"

OUTPUT_Y_TRAIN = "data/processed/Y/train.csv"
OUTPUT_Y_TMP   = "data/processed/Y/tmp.csv"
OUTPUT_Y_VAL   = "data/processed/Y/val.csv"
OUTPUT_Y_TEST  = "data/processed/Y/test.csv"

In [None]:
os.makedirs(os.path.dirname(OUTPUT_X_TRAIN), exist_ok=True)
os.makedirs(os.path.dirname(OUTPUT_Y_TRAIN), exist_ok=True)

In [None]:
df = build_feature_matrix(INPUT_FASTA, INPUT_ANN)
df.to_csv(OUTPUT_FEAT, index=False)
print(f"Wrote features to {OUTPUT_FEAT}")
print(f"Feature matrix shape: {df.shape}")
print(f"Features per sequence: {df.shape[1] - 2}")

For the purposes of this project, we will only stratify the first split to preserve overall class balance in the training set. We will use a plain random split (no stratify) when dividing the held-out data into validation and test sets.

**Why not stratify the held-out data?** Even after removing global singletons, we can still end up with some classes that have only one member in the temporary pool. Therefore, a stratified split there will always fail for any class with fewer than 2 samples.

In [None]:
df = pd.read_csv(OUTPUT_FEAT)
y = df.pop("localization")

X_train, X_tmp, y_train, y_tmp = train_test_split(
    df, y, test_size=0.30, stratify=y, random_state=42
)

X_val, X_test, y_val, y_test = train_test_split(
    X_tmp, y_tmp, test_size=0.50, random_state=42
)

print("Train dist:\n", y_train.value_counts(normalize=True), "\n")
print("Val dist (approx):\n", y_val.value_counts(normalize=True), "\n")
print("Test dist (approx):\n", y_test.value_counts(normalize=True))

In [None]:
X_train.to_csv(OUTPUT_X_TRAIN, index=False)
X_tmp.to_csv(OUTPUT_X_TMP, index=False)
X_val.to_csv(OUTPUT_X_VAL, index=False)
X_test.to_csv(OUTPUT_X_TEST, index=False)

y_train.to_csv(OUTPUT_Y_TRAIN, index=False)
y_tmp.to_csv(OUTPUT_Y_TMP, index=False)
y_val.to_csv(OUTPUT_Y_VAL, index=False)
y_test.to_csv(OUTPUT_Y_TEST, index=False)

print("Split data into train/val/test sets:")
print(f"X_train: {X_train.shape}, y_train: {y_train.shape}")
print(f"X_val: {X_val.shape}, y_val: {y_val.shape}")
print(f"X_test: {X_test.shape}, y_test: {y_test.shape}")
print(f"Data written to {OUTPUT_X_TRAIN}, {OUTPUT_Y_TRAIN}, etc.")