# Binding prediction using AptaNet
Step-by-step guide for using AptaNet for binary aptamerâ€“protein binding prediction.

## Overview
This notebook showcases the AptaNet algorithm, a deep learning method that combines sequence-derived (k-mer + PSeAAC) features with RandomForest-based feature selection, and a multi-layer perceptron to predict whether an aptamer and a protein interact (binary classification: aptamer binds/does not bind with the target protein). An overview of the classes and helper functions used in this notebook:

- **pairs_to_features**: helper that converts `(aptamer_sequence, protein_sequence)` pairs into feature vectors using k-mer + PSeAAC.
- **SkorchAptaNet**: a PyTorch MLP wrapped in Skorch for binary classification.
- **load_1gnh**: helper to load the 1GNH molecule structure from PDB file into our in-memory `MoleculeLoader` format.

## Data preparation
To train the `AptaNetMLP` the notebook uses:
* For `X`:
    * 5 random aptamer sequences of length>30 (to satisfy the default lambda value of 30 set in the PSeAAC algorithm).
    * Amino acid sequences extracted from the 1GNH protein molecule.
    
    The aptamer sequences and the amino acid sequences form tuples `(aptamer_sequence, protein_sequence)` to form `X`.
* For `y`:
    * A random binary value (to indicate if the aptamer binds or not) as dummy data.

In [23]:
# Data imports
import torch

from pyaptamer.datasets import load_1gnh

In [24]:
aptamer_sequence = [
    "GGGAGGACGAAGACGACUCGAGACAGGCUAGGGAGGGA",
    "AAGCGUCGGAUCUACACGUGCGAUAGCUCAGUACGCGGU",
    "CGGUAUCGAGUACAGGAGUCCGACGGAUAGUCCGGAGC",
    "UAGCUAGCGAACUAGGCGUAGCUUCGAGUAGCUACGGAA",
    "GCUAGGACGAUCGCACGUGACCGUCAGUAGCGUAGGAGA",
]

gnh = load_1gnh()
protein_sequence = gnh.to_df_seq()["sequence"].tolist()

# Build all combinations (aptamer, protein), duplicated to increase dataset size
X = [(a, p) for a in aptamer_sequence for p in protein_sequence] * 5

# Dummy binary labels for the pairs
y = torch.randint(0, 2, (len(X),), dtype=torch.float32)

## Building the pipeline
### Dataset balancing using the Neighbourhood cleaning rule
In the `AptaNet` paper, the authors mention using the `NeighbourhoodCleaningRule` from `imblearn` in order to balance the dataset, as in their dataset they had more negative (0) values than positives (1).

 To build a scikit-learn pipeline, follow these steps:
1. Convert the input to the desired (aptamer_sequence, protein_sequence) format.
    * OPTIONAL: As mentioned in the paper, perform under-sampling using the  
    Neighborhood Cleaning Rule to balance the classes.
2. Get the PSeAAC feature vectors for your converted input (using `pairs_to_features`).
3. Select the number of features to use from the feature vector (using `RandomForestClassifier`).
4. Define the skorch neural network (using `AptaNetMLP`).
### Different workflows
In this first half of the notebook we will cover 3 different workflows one can follow, in ascending order of customizability:

1. A minimal workflow with no dataset balancing, while using the in-built pipeline.
2. Using your own custom pipeline.
3. Dataset balancing, while using your own pipeline.

### First workflow
A minimal workflow with no dataset balancing, while using the in-built pipeline.

In [25]:
# Suppress warnings for cleaner output
import warnings

warnings.filterwarnings("ignore")

In [26]:
# If you want to use the AptaNet pipeline, you can import it directly
from pyaptamer.aptanet import AptaNetPipeline

In [27]:
pipeline = AptaNetPipeline()

In [28]:
# Fit the pipeline on the aptamer-protein pairs
pipeline.fit(X, y)

# Predict the labels for the training data
y_pred = pipeline.predict(X)

In [29]:
# Optional: Evaluate training accuracy
from sklearn.metrics import accuracy_score

print("Training Accuracy:", accuracy_score(y, y_pred))

Training Accuracy: 0.64


### Second Worflow

Your own custom-built pipeline.

In [30]:
# If you want to build your own aptamer pipeline, you should use the following imports
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectFromModel
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import FunctionTransformer
from skorch import NeuralNetBinaryClassifier

from pyaptamer.aptanet._aptanet_nn import AptaNetMLP
from pyaptamer.utils._aptanet_utils import pairs_to_features

In [31]:
feature_transformer = FunctionTransformer(
    func=pairs_to_features,
    validate=False,
    # Optional arguments for pairs_to_features
    # example: kw_args={'k': 4, 'pseaac_kwargs': {'lambda_value': 30}}
    kw_args={},
)

selector = SelectFromModel(
    estimator=RandomForestClassifier(
        n_estimators=300,
        max_depth=9,
        random_state=None,
    ),
    threshold="mean",
)

# Define the classifier
net = NeuralNetBinaryClassifier(
    module=AptaNetMLP,
    module__input_dim=None,
    module__hidden_dim=128,
    module__n_hidden=7,
    module__dropout=0.3,
    module__output_dim=1,
    module__use_lazy=True,
    criterion=torch.nn.BCEWithLogitsLoss,
    max_epochs=200,
    lr=0.00014,
    optimizer=torch.optim.RMSprop,
    optimizer__alpha=0.9,
    optimizer__eps=1e-08,
    device="cuda" if torch.cuda.is_available() else "cpu",
)

pipeline = Pipeline(
    [
        ("features", feature_transformer),
        ("selector", selector),
        ("clf", net),
    ]
)

In [32]:
# Fit the pipeline on the aptamer-protein pairs
pipeline.fit(X, y)

# Predict the labels for the training data
y_pred = pipeline.predict(X)

  epoch    train_loss    valid_acc    valid_loss     dur
-------  ------------  -----------  ------------  ------
      1        [36m0.7558[0m       [32m0.2000[0m        [35m0.7114[0m  0.0068
      2        [36m0.6823[0m       0.2000        [35m0.7110[0m  0.0078
      3        0.7409       0.2000        [35m0.7105[0m  0.0083
      4        0.7738       0.2000        [35m0.7101[0m  0.0074
      5        0.7082       0.2000        [35m0.7098[0m  0.0070
      6        [36m0.6472[0m       0.2000        [35m0.7095[0m  0.0068
      7        0.6722       0.2000        [35m0.7092[0m  0.0079
      8        0.7144       0.2000        [35m0.7089[0m  0.0056
      9        0.6902       0.2000        [35m0.7087[0m  0.0072
     10        0.6749       0.2000        [35m0.7085[0m  0.0082
     11        0.6673       0.2000        [35m0.7083[0m  0.0065
     12        0.7027       0.2000        [35m0.7081[0m  0.0073
     13        0.7487       0.2000        [35m0.7079[0m 

In [33]:
# Optional: Evaluate training accuracy
from sklearn.metrics import accuracy_score

print("Training Accuracy:", accuracy_score(y, y_pred))

Training Accuracy: 0.64


### Third workflow
Dataset balancing using under-sampling, while using your own pipeline.

In [34]:
# If you want to build your own aptamer pipeline, you should use the following imports
from sklearn.preprocessing import FunctionTransformer

from pyaptamer.aptanet import AptaNetClassifier, AptaNetPipeline
from pyaptamer.utils._aptanet_utils import pairs_to_features

In [35]:
# OPTIONAL: If you want to use the Neighborhood Cleaning Rule for under-sampling
# NOTE: If you want to use under-sampling, you need to install imbalanced-learn
# and use the Pipeline from imblearn
# %pip install imblearn
from imblearn.pipeline import Pipeline
from imblearn.under_sampling import NeighbourhoodCleaningRule

In [36]:
feature_transformer = FunctionTransformer(
    func=pairs_to_features,
    validate=False,
    # Optional arguments for pairs_to_features
    # example: kw_args={'k': 4, 'pseaac_kwargs': {'lambda_value': 30}}
    kw_args={},
)

# AptaNetClassifier encapsulates the "selector" and "net" mentioned in Workflow 2
clf = AptaNetClassifier()

model = Pipeline(
    [
        ("ncr", NeighbourhoodCleaningRule()),
        ("clf", clf),
    ]
)

pipeline = AptaNetPipeline(estimator=model)

In [37]:
# Fit the pipeline on the aptamer-protein pairs
pipeline.fit(X, y)

# Predict the labels for the training data
y_pred = pipeline.predict(X)

In [38]:
# Optional: Evaluate training accuracy
from sklearn.metrics import accuracy_score

print("Training Accuracy:", accuracy_score(y, y_pred))

Training Accuracy: 0.36


# Implementing `AptaNet` for real-world use cases

## Data preparation
To train the `AptaNetMLP` the notebook uses the dataset used to train the `AptaTrans` algorithm, this dataset can be found in `pyaptamer/datasets/data/train_li2014`.

In [39]:
# Data imports

from pyaptamer.datasets import load_li2014

In [40]:
X, y = load_li2014(split="train")

In [41]:
# Display first few rows of X which contains aptamer-protein pairs
X.head()

Unnamed: 0,aptamer,protein
0,UCGGAGGUGGUUCAGCUCUGCAUCGACAGUUGGC,MDSTNYISKLFEYAQRQGQISDIKFEEVGTDGPDHLKTFTLRVVIK...
1,CGGGGTGTTGTCCTGTGCTCTCCGGAGAGCACAGGACAACACCCCG,MDELFPLIFPAEPAQASGPYVEIIEQPKQRGMRFRYKCEGRSAGSI...
2,GCTGCAGTTGCACTGAATTCGCCTCTCGCCTCCGTACACTTAGTCG...,MDEVGAQVAAPMFIHQSLGRKRDLYYPMSNRLVQSQPQRRDEWNSK...
3,TATTTGCCCTTGCAGGCCGCAGGAGTGCTAGCAGT,PISPIDTVPVKLKPGMDGPKVKQWPLTEEKIKALTEICNEMEKEGK...
4,UCUCUGGGCUCUUAGGAGAACGGAUAGGAGUGUGCUCGCU,LQENFLPQPRQQHHGTLVLHYRPHREEAGMQHPCLWPGSSHCSDDS...


In [42]:
# Display first few rows of y which contains binary binding labels
y.head()

Unnamed: 0,label
0,positive
1,positive
2,positive
3,positive
4,positive


## DIfferent Real-world examples
In the second half of this notebook we will explore 3 real-world use cases of the algorithm. These include:
1. Training on the entire AptaTrans dataset, and predict binding probability (`predict_proba`) between a protein and a DNA sequence.
2. TODO: Same for using DNA sequences from a `fasta` file.
3. Using MCTS combined with a trained `AptaNet` to propose new aptamers for a new pdb file, i.e., a form of in-silico Selex.

In [43]:
# Fit the pipeline on the aptamer-protein pairs from the `AptaTrans` dataset
pipeline = AptaNetPipeline()
pipeline.fit(X, y)

### First workflow

Predicting binding probability (`predict_proba`) between a protein (1GNH) and a DNA sequence.

In [44]:
aptamer_sequence = "GGGAGGACGAAGACGACUCGAGACAGGCUAGGGAGGGA"

X = [(aptamer_sequence, p) for p in protein_sequence]

pipeline.predict_proba(X)

array([[0.9648417 , 0.03515826]], dtype=float32)

### Second workflow (TODO)

Predicting binding probability (`predict_proba`) between a protein (1GNH) and DNA sequences from a `fasta` file.

### Third workflow

Using MCTS combined with a trained `AptaNet` to propose new aptamers for a protein (1GNH), i.e., a form of in-silico Selex.

In [45]:
from pyaptamer.experiments import AptamerEvalAptaNet
from pyaptamer.mcts import MCTS

# There can only be one target sequence
eval = AptamerEvalAptaNet(pipeline=pipeline, target=protein_sequence[0])

mcts = MCTS(experiment=eval)

output = mcts.run(verbose=True)


 ----- Round: 1 -----
##################################################
Best subsequence: _CC_U_
Depth: 3
##################################################

 ----- Round: 2 -----
##################################################
Best subsequence: _CC_U_C__GC_
Depth: 6
##################################################

 ----- Round: 3 -----
##################################################
Best subsequence: _CC_U_C__GC__CA__C
Depth: 9
##################################################

 ----- Round: 4 -----
##################################################
Best subsequence: _CC_U_C__GC__CA__CC_C_U_
Depth: 12
##################################################

 ----- Round: 5 -----
##################################################
Best subsequence: _CC_U_C__GC__CA__CC_C_U_C__U_U
Depth: 15
##################################################

 ----- Round: 6 -----
##################################################
Best subsequence: _CC_U_C__GC__CA__CC_C_U_C__U_UC_A_U_
Depth: 18
####

In [46]:
# Get the best sequence
output["candidate"]

'GUACCUCCACCUCCGCCUUC'