# Binding prediction using AptaNet
Step-by-step guide to using AptaNet for binary aptamer–protein binding prediction.

## Overview
- **pairs_to_features**: converts `(aptamer_seq, protein_seq)` pairs into feature vectors using k-mer + PSeAAC.
- **FeatureSelector**: a Random Forest-based transformer that selects important features.
- **SkorchAptaNet**: a PyTorch MLP wrapped in Skorch for binary classification with a configurable threshold.
- **load_pfoa_structure**: helper to load PFOA molecule structure from PDB file.

## Imports
Import the core functions and classes.

In [None]:
# General imports
import torch  # noqa: I001

# Data imports
from pyaptamer.datasets import load_1gnh_structure
from pyaptamer.utils.struct_to_aaseq import struct_to_aaseq

# If you want to use the AptaNet pipeline, you can import it directly
from pyaptamer.aptanet import AptaNetPipeline

# If you to build your own aptamer pipeline, you should use the following imports
from skorch import NeuralNetBinaryClassifier
from sklearn.preprocessing import FunctionTransformer
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from pyaptamer.utils._aptanet_utils import pairs_to_features
from pyaptamer.aptanet._aptanet_nn import AptaNetMLP

## Data preparation
To train the skorch network the notebook uses:
* As `X`:
    * 5 random aptamer sequences of length>30 (to satisfy the default lambda value of 30 set in the PSeAAC algorithm).
    * Amino-acid sequences extracted from the 1GNH protein molecule.
* As `y`:
    * A random binary value (0/1) equal to the number of `(aptamer_seq, protein_seq)` pairs as dummy data.

In [2]:
aptamer_sequence = [
    "GGGAGGACGAAGACGACUCGAGACAGGCUAGGGAGGGA",
    "AAGCGUCGGAUCUACACGUGCGAUAGCUCAGUACGCGGU",
    "CGGUAUCGAGUACAGGAGUCCGACGGAUAGUCCGGAGC",
    "UAGCUAGCGAACUAGGCGUAGCUUCGAGUAGCUACGGAA",
    "GCUAGGACGAUCGCACGUGACCGUCAGUAGCGUAGGAGA",
]

gnh = load_1gnh_structure()
protein_sequence = struct_to_aaseq(gnh)

# Build all combinations (aptamer, protein)
X = [(a, p) for a in aptamer_sequence for p in protein_sequence]

# Dummy binary labels for the pairs
y = torch.randint(0, 2, (len(X),), dtype=torch.float32)

## Build your own pipeline
 To build a scikit-learn pipeline, follow these steps:
1. Convert the input to the desired (aptamer_sequence, protein_sequence) format.
    * OPTIONAL: As mentioned in the paper, perform under-sampling using the  
    Neighborhood Cleaning Rule to balance the classes.
2. Get the PSeAAC feature vectors for your converted input (using `pairs_to_features`).
3. Select the number of features to use from the feature vector (using `FeatureSelector`).
4. Define the skorch neural network (using `AptaNetMLP`).

## Different workflows
In this notebook we will cover 2 different workflows one can follow, in ascending order of customizability:

1. A minimal workflow with no dataset balancing, while using the in-built pipeline.
3. Dataset balancing using under-sampling, while using your own pipeline.

### First workflow
A minimal workflow with no dataset balancing, while using the in-built pipeline.

In [3]:
pipeline = AptaNetPipeline()

### Second workflow
Dataset balancing using under-sampling, while using your own pipeline.

In [None]:
# OPTIONAL: If you want to use the Neighborhood Cleaning Rule for under-sampling
# %pip install imblearn
from imblearn.under_sampling import NeighbourhoodCleaningRule  # noqa: E402

In [None]:
# NOTE: If you want to use the under-sampling, you need to install imbalanced-learn
# and use the Pipeline from imblearn
from imblearn.pipeline import Pipeline

pipeline = AptaNetPipeline()

feature_transformer = FunctionTransformer(
    func=pairs_to_features,
    validate=False,
    # Optional arguments for pairs_to_features
    # example: kw_args={'k': 4, 'pseaac_kwargs': {'lambda_value': 30}}
    kw_args={},
)

selector = SelectFromModel(
    estimator=RandomForestClassifier(
        n_estimators=300,
        max_depth=9,
        random_state=None,
    ),
    threshold="mean",
)

# Define the classifier
net = NeuralNetBinaryClassifier(
    module=AptaNetMLP,
    module__input_dim=None,
    module__hidden_dim=128,
    module__n_hidden=7,
    module__dropout=0.3,
    module__output_dim=1,
    module__use_lazy=True,
    criterion=torch.nn.BCEWithLogitsLoss,
    max_epochs=200,
    lr=0.00014,
    optimizer=torch.optim.RMSprop,
    optimizer__alpha=0.9,
    optimizer__eps=1e-08,
    device="cuda" if torch.cuda.is_available() else "cpu",
)

pipeline = Pipeline(
    [
        ("features", feature_transformer),
        # Optional under-sampling, use sklearn's Pipeline if you do not need it
        ("ncr", NeighbourhoodCleaningRule()),
        ("selector", selector),
        ("clf", net),
    ]
)

## Model Training and Prediction

Now that we’ve defined our AptaNet pipeline, we proceed to train the model, and use it to predict, on our aptamer-protein dataset.

In [6]:
# Fit the pipeline on the aptamer-protein pairs
pipeline.fit(X, y)

# Predict the labels for the training data
y_pred = pipeline.predict(X)

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m0.7270[0m       [32m0.2857[0m        [35m0.7139[0m  0.1351
      2        0.7593       0.2857        [35m0.7133[0m  0.0065
      3        [36m0.6690[0m       0.2857        [35m0.7128[0m  0.0060
      4        0.7504       0.2857        [35m0.7124[0m  0.0070
      5        0.7646       0.2857        [35m0.7121[0m  0.0065
      6        0.7945       0.2857        [35m0.7117[0m  0.0075
      7        [36m0.6541[0m       0.2857        [35m0.7114[0m  0.0052
      8        0.7842       0.2857        [35m0.7111[0m  0.0075
      9        [36m0.6305[0m       0.2857        [35m0.7107[0m  0.0065
     10        0.7223       0.2857        [35m0.7104[0m  0.0075
     11        0.6717       0.2857        [35m0.7101[0m  0.0075
     12        0.7168       0.2857        [35m0.7098[0m  0.0085
     13        0.7049       0.2857        [35m0.

In [7]:
# Optional: Evaluate training accuracy
from sklearn.metrics import accuracy_score

print("Training Accuracy:", accuracy_score(y, y_pred))

Training Accuracy: 0.48
