# Finetuning AptaNet for aptamer protein binding prediction

## Overview
This notebook demonstrates how to finetune the `AptaNetClassifier` and `AptaNetRegressor` components using sklearn's hyperparameter search utilities `GridSearchCV` and `RandomizedSearchCV`. What this notebooks covers is:

1. How to set up hyperparameter grids for AptaNet components
2. Using `GridSearchCV` for exhaustive hyperparameter search
3. Using `RandomizedSearchCV` for efficient search in larger spaces
4. Evaluating and selecting the best model

## Data Preparation
(same data setup as the main AptaNet tutorial)
* Aptamer sequences of length > 30
* Amino acid sequences from the 1GNH protein molecule
* Binary labels indicating binding

In [40]:
import warnings
warnings.filterwarnings("ignore", category=UserWarning)

In [41]:
import numpy as np
import torch
from pyaptamer.datasets import load_1gnh

In [42]:
aptamer_sequence = [
    "GGGAGGACGAAGACGACUCGAGACAGGCUAGGGAGGGA",
    "AAGCGUCGGAUCUACACGUGCGAUAGCUCAGUACGCGGU",
    "CGGUAUCGAGUACAGGAGUCCGACGGAUAGUCCGGAGC",
    "UAGCUAGCGAACUAGGCGUAGCUUCGAGUAGCUACGGAA",
    "GCUAGGACGAUCGCACGUGACCGUCAGUAGCGUAGGAGA",
]

gnh = load_1gnh()
protein_sequence = gnh.to_df_seq()["sequence"].tolist()

# Build all combinations (aptamer, protein), duplicated to increase dataset size
X = [(a, p) for a in aptamer_sequence for p in protein_sequence] * 5

# Dummy binary labels for the pairs
y = torch.randint(0, 2, (len(X),), dtype=torch.float32)

print(f"Number of samples: {len(X)}")

Number of samples: 25


## Feature extraction
Before hyperparameter tuning, we convert the sequence pairs to feature vectors. This step is done once to avoid repeated computation during cross validation.

In [43]:
from pyaptamer.utils._aptanet_utils import pairs_to_features

# Convert sequence pairs to feature vectors
X_features = pairs_to_features(X)

print(f"Feature matrix shape: {X_features.shape}")

Feature matrix shape: (25, 690)


# 1. Aptanet Classifier

## Hyperparameter tuning with GridSearchCV
This is ideal when you have a small set of hyperparameters to tune.

In [44]:
from sklearn.model_selection import GridSearchCV
from pyaptamer.aptanet import AptaNetClassifier

In [45]:
# Define the hyperparameter grid
param_grid = {
    "hidden_dim": [64, 128],
    "n_hidden": [5, 7],
    "dropout": [0.2, 0.3],
    "lr": [0.0001, 0.0005],
}

# Create the classifier
clf = AptaNetClassifier(max_epochs=50, verbose=0)

# Set up GridSearchCV
grid_search = GridSearchCV(
    estimator=clf,
    param_grid=param_grid,
    cv=3,
    scoring="accuracy",
    n_jobs=1,  #set to -1 for parallel execution
    verbose=1,
)

print(f"Total combinations to try: {np.prod([len(v) for v in param_grid.values()])}")

Total combinations to try: 16


In [46]:
# Run grid search
grid_search.fit(X_features, y) 

print(f"\nBest parameters: {grid_search.best_params_}")
print(f"Best cross-validation score: {grid_search.best_score_:.4f}")

Fitting 3 folds for each of 16 candidates, totalling 48 fits



Best parameters: {'dropout': 0.2, 'hidden_dim': 64, 'lr': 0.0005, 'n_hidden': 7}
Best cross-validation score: 0.6019


## Hyperparameter tuning with RandomizedSearchCV
This is ideal for larger search spaces.

In [47]:
from scipy.stats import loguniform, randint, uniform
from sklearn.model_selection import RandomizedSearchCV

In [48]:
# Define parameter distributions for random search
param_distributions = {
    "hidden_dim": [32, 64, 128, 256],
    "n_hidden": randint(3, 10),
    "dropout": uniform(0.1, 0.4),  # Uniform between 0.1 and 0.5
    "lr": loguniform(1e-5, 1e-3),  # Log-uniform for learning rate
    "max_epochs": [50, 100, 150],
}

# Create the classifier
clf = AptaNetClassifier(verbose=0)

# Set up RandomizedSearchCV
random_search = RandomizedSearchCV(
    estimator=clf,
    param_distributions=param_distributions,
    n_iter=10,  # Number of random combinations to try
    cv=3,
    scoring="accuracy",
    n_jobs=1,
    verbose=1,
    random_state=42,
)

In [49]:
# Run the random search
random_search.fit(X_features, y)

print(f"\nBest parameters: {random_search.best_params_}")
print(f"Best cross-validation score: {random_search.best_score_:.4f}")

Fitting 3 folds for each of 10 candidates, totalling 30 fits

Best parameters: {'dropout': np.float64(0.30990986410335564), 'hidden_dim': 128, 'lr': np.float64(1.9010245319870364e-05), 'max_epochs': 150, 'n_hidden': 8}
Best cross-validation score: 0.6019


## Evaluating the best model
After hyperparameter tuning, use the best estimator for final predictions.

In [50]:
from sklearn.metrics import accuracy_score, classification_report

In [51]:
# Get the best estimator from random search
best_clf = random_search.best_estimator_

#predictions
y_pred = best_clf.predict(X_features)

print(f"Training Accuracy: {accuracy_score(y, y_pred):.4f}")
print("\nClassification Report:")
print(classification_report(y, y_pred))

Training Accuracy: 0.6000

Classification Report:
              precision    recall  f1-score   support

         0.0       0.60      1.00      0.75        15
         1.0       0.00      0.00      0.00        10

    accuracy                           0.60        25
   macro avg       0.30      0.50      0.38        25
weighted avg       0.36      0.60      0.45        25



## Using the full AptaNetPipeline with tuned parameters
Once the best hyperparameters are found, we can use them with the full pipeline.

In [52]:
from pyaptamer.aptanet import AptaNetClassifier, AptaNetPipeline

# Create a classifier with the best parameters
best_params = random_search.best_params_
tuned_clf = AptaNetClassifier(**best_params, verbose=0)

# Use it in the full pipeline (which handles feature extraction)
pipeline = AptaNetPipeline(estimator=tuned_clf)

# Fit on the original sequence pairs (not pre-extracted features)
pipeline.fit(X, y)

# Predict
y_pred = pipeline.predict(X)
print(f"Pipeline Training Accuracy: {accuracy_score(y, y_pred):.4f}")

Pipeline Training Accuracy: 0.4000


## 2. AptaNetRegressor

The same finetuning approach works for *AptaNetRegressor*. Just replace:
1. AptaNetClassifier() to AptaNetRegressor()  
2. `scoring="accuracy"` to `scoring="r2"` (or `"neg_mean_squared_error"`)

The hyperparameter grid remains the same (`hidden_dim`, `n_hidden`, `dropout`, `lr`, etc.)