# Benchmarking aptamer evaluation algorithms
Step-by-step guide for using `AptaNet` for benchmarking.

## Overview
This notebook introduces the Benchmarking class, a utility for systematically comparing machine learning estimators on a given dataset using cross-validation. It is designed to streamline model evaluation across multiple metrics and provide results in a unified, interpretable format.

The output is a summary table that makes it easy to compare different models and metrics at a glance.

## Data preparation
To train the `AptaNetPipeline` the notebook uses the dataset used to train the `AptaTrans` algorithm, this dataset can be found in `pyaptamer/datasets/data/train_li2014`.

In [1]:
# Data imports
import numpy as np

from pyaptamer.datasets import load_csv_dataset

  from .autonotebook import tqdm as notebook_tqdm


In [4]:
# Load dataset (returns DataFrames)
X_raw, y_raw = load_csv_dataset("train_li2014", "label", return_X_y=True)

# Build combinations in the form of (aptamer, protein)
X = list(zip(X_raw.iloc[:, 0], X_raw.iloc[:, 1], strict=False))[:100]

# Binary labels for the pairs
y = np.where(y_raw == "positive", 1, 0)[:100]

## Different workflows
Benchmarking offers 2 main workflows, both depending on how you want to use `cv` (cross validation) in your benchmarking experiment:
1. Using normal k-fold cross-validation
2. Using `PredefinedSplit` to create a fixed train/test split

### 1. Using normal k-fold cross validation for benchmarking

In [5]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold

from pyaptamer.aptanet import AptaNetPipeline
from pyaptamer.benchmarking import Benchmarking

# Example estimator
clf = AptaNetPipeline(k=4)

# Define a 5-fold CV strategy
cv = KFold(n_splits=5, shuffle=True, random_state=42)

# Run benchmarking with CV
bench = Benchmarking(
    estimators=[clf],
    metrics=[accuracy_score],
    X=X,
    y=y,
    cv=cv,
)
results_cv = bench.run()
print(results_cv)

  seq = clean_protein_seq(protein_sequence)
  y = column_or_1d(y, warn=True)
  seq = clean_protein_seq(protein_sequence)
  seq = clean_protein_seq(protein_sequence)
  seq = clean_protein_seq(protein_sequence)
  y = column_or_1d(y, warn=True)
  seq = clean_protein_seq(protein_sequence)
  seq = clean_protein_seq(protein_sequence)
  y = column_or_1d(y, warn=True)
  seq = clean_protein_seq(protein_sequence)
  seq = clean_protein_seq(protein_sequence)
  seq = clean_protein_seq(protein_sequence)
  y = column_or_1d(y, warn=True)
  seq = clean_protein_seq(protein_sequence)
  seq = clean_protein_seq(protein_sequence)
  y = column_or_1d(y, warn=True)
  seq = clean_protein_seq(protein_sequence)
  seq = clean_protein_seq(protein_sequence)


                                train  test
estimator       metric                     
AptaNetPipeline accuracy_score    1.0   1.0


### 2. Using PredefinedSplit for benchmarking with a fixed train/test split

In [6]:
from sklearn.model_selection import PredefinedSplit

# Define a custom train/test split
# Here, last 10 samples are used as test set
test_fold = np.ones(len(y)) * -1
test_fold[-10:] = 0
cv = PredefinedSplit(test_fold)

# Run benchmarking with fixed split
bench_fixed = Benchmarking(
    estimators=[clf],
    metrics=[accuracy_score],
    X=X,
    y=y,
    cv=cv,
)
results_fixed = bench_fixed.run()
print(results_fixed)

  seq = clean_protein_seq(protein_sequence)
  y = column_or_1d(y, warn=True)
  seq = clean_protein_seq(protein_sequence)


                                train  test
estimator       metric                     
AptaNetPipeline accuracy_score    1.0   1.0
