# Binding prediction using AptaNet
Step-by-step guide to using AptaNet for binary aptamer–protein binding prediction.

## Overview
- **pairs_to_features**: converts `(aptamer_seq, protein_seq)` pairs into feature vectors using k-mer + PSeAAC.
- **FeatureSelector**: a Random Forest-based transformer that selects important features.
- **SkorchAptaNet**: a PyTorch MLP wrapped in Skorch for binary classification with a configurable threshold.
- **load_pfoa_structure**: helper to load PFOA molecule structure from PDB file.

## Imports
Import the core functions and classes.

In [42]:
# General imports
import torch  # noqa: I001
import torch.optim as optim

# Data imports
from pyaptamer.datasets.loader import load_1gnh_structure
from pyaptamer.utils.struct_to_aaseq import struct_to_aaseq

# If you want to use the aptamer pipeline, you should use the following imports
from pyaptamer.aptanet.pipeline import pipe

# If you to build your own aptamer pipeline, you should use the following imports
from sklearn.pipeline import Pipeline
from pyaptamer.aptanet import FeatureSelector, SkorchAptaNet
from pyaptamer.utils._aptanet_utils import pairs_to_features

## Data preparation
To train the skorch network the notebook uses:
* As `X`:
    * 5 random aptamer sequences of length>30 (to satisfy the default lambda value of 30 set in the PSeAAC algorithm).
    * Amino-acid sequences extracted from the 1GNH protein molecule.
* As `y`:
 * A random binary value (0/1) equal to the number of `(aptamer_seq, protein_seq)` pairs as dummy data.

In [43]:
aptamer_sequences = [
    "GGGAGGACGAAGACGACUCGAGACAGGCUAGGGAGGGA",
    "AAGCGUCGGAUCUACACGUGCGAUAGCUCAGUACGCGGU",
    "CGGUAUCGAGUACAGGAGUCCGACGGAUAGUCCGGAGC",
    "UAGCUAGCGAACUAGGCGUAGCUUCGAGUAGCUACGGAA",
    "GCUAGGACGAUCGCACGUGACCGUCAGUAGCGUAGGAGA",
]

gnh = load_1gnh_structure()
protein_sequence = struct_to_aaseq(gnh)

unique_proteins = list(set(protein_sequence))
unique_aptamers = list(set(aptamer_sequences))

# Build all combinations (protein, aptamer)
X = [(a, p) for a in unique_aptamers for p in unique_proteins]

# Dummy binary labels for the pairs
y = torch.randint(0, 2, (len(X),))

## Build your own pipeline
 To build a scikit-learn pipeline, follow these steps:
1. Convert the input to the desired (aptamer_sequence, protein_sequence) format.
    * OPTIONAL: As mentioned in the paper, perform under-sampling using the  
    Neighborhood Cleaning Rule to balance the classes.
2. Get the PSeAAC feature vectors for your converted input (using `pairs_to_features`).
3. Select the number of features to use from the feature vector (using `FeatureSelector`).
4. Define the skorch neural network (using `SkorchAptaNet`).

In [44]:
# OPTIONAL: If you want to use the Neighborhood Cleaning Rule for under-sampling
# %pip install imblearn
# from imblearn.under_sampling import NeighbourhoodCleaningRule

# ncr = NeighbourhoodCleaningRule()
# X, y = ncr.fit_resample(X, y)

In [45]:
# Define the classifier
net = SkorchAptaNet(
    module__hidden_dim=128,
    module__n_hidden=7,
    module__dropout=0.3,
    max_epochs=200,
    lr=1.4e-4,
    batch_size=310,
    optimizer=optim.RMSprop,
    device="cuda" if torch.cuda.is_available() else "cpu",
    threshold=0.5,
    verbose=1,
)

# Option 1: build a new pipeline
pipeline = Pipeline(
    [
        ("features", pairs_to_features),
        ("selector", FeatureSelector()),
        ("clf", net),
    ]
)

# Option 2: import the pre-defined pipeline (which does the same)
pipeline = pipe

## Model Training and Prediction

Now that we’ve defined our AptaNet pipeline, we proceed to train the model, and use it to predict, on our aptamer-protein dataset.

In [46]:
# Fit the pipeline on the aptamer-protein pairs
pipeline.fit(X, y)

# Predict the labels for the training data
y_pred = pipeline.predict(X)

ValueError: n_splits=5 cannot be greater than the number of members in each class.

In [None]:
# Optional: Evaluate training accuracy
from sklearn.metrics import accuracy_score

print("Training Accuracy:", accuracy_score(y, y_pred))