# Binding prediction using AptaNet
Step-by-step guide to using AptaNet for binary aptamer–protein binding prediction.

## Overview
- **pairs_to_features**: converts `(aptamer_seq, protein_seq)` pairs into feature vectors using k-mer + PSeAAC.
- **FeatureSelector**: a Random Forest-based transformer that selects important features.
- **SkorchAptaNet**: a PyTorch MLP wrapped in Skorch for binary classification with a configurable threshold.
- **load_pfoa_structure**: helper to load PFOA molecule structure from PDB file.

## Imports
Import the core functions and classes.

In [14]:
# General imports
import torch  # noqa: I001
import torch.optim as optim

# Data imports
from pyaptamer.datasets.loader import load_1gnh_structure
from pyaptamer.utils.struct_to_aaseq import struct_to_aaseq

# If you want to use the aptamer pipeline, you should use the following imports

# If you to build your own aptamer pipeline, you should use the following imports
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from pyaptamer.aptanet import FeatureSelector, SkorchAptaNet
from pyaptamer.utils._aptanet_utils import pairs_to_features

## Data preparation
To train the skorch network the notebook uses:
* As `X`:
    * 5 random aptamer sequences of length>30 (to satisfy the default lambda value of 30 set in the PSeAAC algorithm).
    * Amino-acid sequences extracted from the 1GNH protein molecule.
* As `y`:
 * A random binary value (0/1) equal to the number of `(aptamer_seq, protein_seq)` pairs as dummy data.

In [15]:
aptamer_sequences = [
    "GGGAGGACGAAGACGACUCGAGACAGGCUAGGGAGGGA",
    "AAGCGUCGGAUCUACACGUGCGAUAGCUCAGUACGCGGU",
    "CGGUAUCGAGUACAGGAGUCCGACGGAUAGUCCGGAGC",
    "UAGCUAGCGAACUAGGCGUAGCUUCGAGUAGCUACGGAA",
    "GCUAGGACGAUCGCACGUGACCGUCAGUAGCGUAGGAGA",
]

gnh = load_1gnh_structure()
protein_sequence = struct_to_aaseq(gnh)

unique_proteins = list(set(protein_sequence))
unique_aptamers = list(set(aptamer_sequences))

# Build all combinations (aptamer, protein)
X = [(a, p) for a in unique_aptamers for p in unique_proteins]

# Dummy binary labels for the pairs
y = torch.randint(0, 2, (len(X),), dtype=torch.float32)

## Build your own pipeline
 To build a scikit-learn pipeline, follow these steps:
1. Convert the input to the desired (aptamer_sequence, protein_sequence) format.
    * OPTIONAL: As mentioned in the paper, perform under-sampling using the  
    Neighborhood Cleaning Rule to balance the classes.
2. Get the PSeAAC feature vectors for your converted input (using `pairs_to_features`).
3. Select the number of features to use from the feature vector (using `FeatureSelector`).
4. Define the skorch neural network (using `SkorchAptaNet`).

In [16]:
# OPTIONAL: If you want to use the Neighborhood Cleaning Rule for under-sampling
# %pip install imblearn
# from imblearn.under_sampling import NeighbourhoodCleaningRule

# ncr = NeighbourhoodCleaningRule()
# X, y = ncr.fit_resample(X, y)

In [17]:
# Define the classifier
net = SkorchAptaNet(
    module__hidden_dim=128,
    module__n_hidden=7,
    module__dropout=0.3,
    max_epochs=200,
    lr=1.4e-4,
    batch_size=310,
    optimizer=optim.RMSprop,
    device="cuda" if torch.cuda.is_available() else "cpu",
    threshold=0.5,
    train_split=None,
    verbose=1,
)

feature_transformer = FunctionTransformer(
    func=pairs_to_features,
    validate=False,
    # Optional arguments for pairs_to_features
    # example: kw_args={'k': 4, 'pseaac_kwargs': {'lambda_value': 30}}
    kw_args={},
)

# Option 1: build a new pipeline
pipeline = Pipeline(
    [
        ("features", feature_transformer),
        ("selector", FeatureSelector()),
        ("clf", net),
    ]
)

# Option 2: import the pre-defined pipeline (which does the same)
# pipeline = pipe

## Model Training and Prediction

Now that we’ve defined our AptaNet pipeline, we proceed to train the model, and use it to predict, on our aptamer-protein dataset.

In [18]:
# Fit the pipeline on the aptamer-protein pairs
pipeline.fit(X, y)

# Predict the labels for the training data
y_pred = pipeline.predict(X)

  epoch    train_loss     dur
-------  ------------  ------
      1        [36m0.8461[0m  0.1419
      2        [36m0.6572[0m  0.0161
      3        0.7241  0.0000
      4        0.6856  0.0183
      5        [36m0.5405[0m  0.0114
      6        0.9785  0.0000
      7        0.8498  0.0159
      8        0.8247  0.0000
      9        0.7223  0.0157
     10        0.6711  0.0000
     11        0.6652  0.0158
     12        0.6684  0.0000
     13        0.7197  0.0158
     14        0.6808  0.0000
     15        0.5502  0.0158
     16        0.7807  0.0000
     17        0.6591  0.0159
     18        0.6840  0.0157
     19        0.6722  0.0000
     20        0.5827  0.0159
     21        0.8277  0.0090
     22        0.6763  0.0000
     23        0.7158  0.0162
     24        0.7339  0.0000
     25        0.7955  0.0152
     26        0.8719  0.0000
     27        0.6885  0.0157
     28        0.6226  0.0000
     29        0.6545  0.0150
     30        0.6366  0.0000
     31      

In [19]:
# Optional: Evaluate training accuracy
from sklearn.metrics import accuracy_score

print("Training Accuracy:", accuracy_score(y, y_pred))

Training Accuracy: 0.6
