# Peptides classification with subsequence string kernel

This notebook details the utilization of Scikit-Learn to search for the best Support Vector Machine (SVM) model for the classification of peptides sequences using the subsequence string kernel.

## 1. Dataset preparation

This example is about antimicrobial peptides classification. We used the data and experimental methodology of the research conducted by P. Bhadra and collaborators.

The data consists of a dataset with a 1:3 positive to negative ratio, AMP/non-AMP peptide sequences. The dataset containing AMP and non-AMP data is freely available at https://sourceforge.net/projects/axpep/files/. 

The original work employs a 10-fold cross-validation for training a Random Forest model and obtains an MCC score of 0.90.

**Reference**: P. Bhadra, J. Yan, J. Li, S. Fong, and S. W. Siu. AmPEP: Sequence-based prediction of antimicrobial peptides using distribution patterns of amino acid properties and random forest. Scientific Reports, vol. 8, no. 1, pp. 1–10, 2018.

Loading required packages.

In [1]:
from sys import path as sys_path
sys_path.append('..')

from os import path
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import (make_scorer, 
                             matthews_corrcoef,
                             accuracy_score,
                             recall_score,
                             confusion_matrix,
                             roc_auc_score)

from strkernels import SubsequenceStringKernel

The first step involves loading the dataset and creating a dataframe.

Defining a function to create a dataframe from a FASTA file.

In [2]:
def read_fasta_to_dataframe(fasta_file):
    data = []
    with open(fasta_file, 'r') as file:
        sequence_id = None
        sequence = []
        
        for line in file:
            line = line.strip()
            if line.startswith('>'):
                if sequence_id is not None:
                    data.append([sequence_id, ''.join(sequence)])
                sequence_id = line[1:]
                sequence = []
            else:
                sequence.append(line)
        
        if sequence_id is not None:
            data.append([sequence_id, ''.join(sequence)])
    
    df = pd.DataFrame(data, columns=['seqid', 'sequence'])
    return df

Loading the positive sequences.

In [3]:
amp_seqs_file_path = path.join('data', 'Bhadra-et-al-2018', 'train_AMP_3268.fasta')
amp_df = read_fasta_to_dataframe(amp_seqs_file_path)
amp_df['label'] = 1
amp_df

Loading the negative sequences.

In [4]:
non_amp_seqs_file_path = path.join('data', 'Bhadra-et-al-2018', 'train_nonAMP_9777.fasta')
non_amp_df = read_fasta_to_dataframe(non_amp_seqs_file_path)
non_amp_df['label'] = -1
non_amp_df

Creating a single dataframe with the positive and negative sequences.

In [5]:
data_df = pd.concat([amp_df, non_amp_df])
data_df

Separating 25% of the sequences for testing.

In [6]:
random_seed = 1708  # for reproducing

X_train, X_test, y_train, y_test = train_test_split(data_df['sequence'], 
                                                    data_df['label'], 
                                                    stratify=data_df['label'], 
                                                    random_state=random_seed)
print('Number of train sequences:',len(X_train))
print('Number of test sequences:',len(X_test))

## 2. Hyperparameters selection

Now, we will search for the best value for the hyperparameters maximum subsequence length and decay of the subsequence string kernel and C hyperparameter of SVM for this dataset. 

For better performance, we will only use 10% of the train samples in hyperparameters selection.

In [7]:
train_df = pd.concat([X_train, y_train], axis=1)
pos_train_df = train_df[train_df['label'] == 1]
neg_train_df = train_df[train_df['label'] == -1]

sampled_pos_train_df = pos_train_df.sample(n=len(pos_train_df) // 5, random_state=random_seed)
sampled_pos_train_df

In [8]:
sampled_balanced_neg_train_df = neg_train_df.sample(n=len(sampled_pos_train_df), random_state=random_seed)
sampled_balanced_neg_train_df

In [9]:
sampled_train_df = pd.concat([sampled_pos_train_df, sampled_balanced_neg_train_df])
sampled_train_df

Creating the subsequence string kernel instance.

In [10]:
subsequence_kernel = SubsequenceStringKernel(maxlen=1, ssk_lambda=1)

Creating a support vector classifier with the kernel.

In [11]:
clf = SVC(kernel=subsequence_kernel)

Running grid search with 10-fold cross-validation for searching the better subsequence string kernel hyperparameters.

In [1]:
# set parameters for grid search
param_grid = {
    'kernel__maxlen': [4, 5, 6],
    'kernel__ssk_lambda': [0.9, 1.0, 1.1, 1.2, 1.3],
}

# create the evaluation metric
mcc_scorer = make_scorer(matthews_corrcoef)

# create the GridSearchCV object
grid_search = GridSearchCV(estimator=clf, 
                           param_grid=param_grid, 
                           scoring=mcc_scorer, 
                           cv=10,
                           n_jobs=-1, 
                           verbose=3)

# fit the model to the training data
grid_search.fit(sampled_train_df['sequence'], sampled_train_df['label'])

# show the best parameters
best_params = grid_search.best_params_
print("\nBest parameters:", best_params)

# show the best mean validation score
best_score = grid_search.best_score_
print(f"Best mean validation score: {best_score}")

Searching a better C hyperparameter of SVM.

In [None]:
# set parameters for grid search
param_grid = {
    'kernel__maxlen': [5],
    'kernel__ssk_lambda': [1.1],
    'C': [0.1, 1.0, 10.0]
}

# create the evaluation metric
mcc_scorer = make_scorer(matthews_corrcoef)

# create the GridSearchCV object
grid_search = GridSearchCV(estimator=clf, 
                           param_grid=param_grid, 
                           scoring=mcc_scorer, 
                           cv=10,
                           n_jobs=-1, 
                           verbose=3)

# fit the model to the training data
grid_search.fit(sampled_train_df['sequence'], sampled_train_df['label'])

# show the best parameters
best_params = grid_search.best_params_
print("\nBest parameters:", best_params)

# show the best mean validation score
best_score = grid_search.best_score_
print(f"Best mean validation score: {best_score}")

### 3. Best model evaluation

Training an SVM model with the best hyperparameters found, using the full training dataset.

In [None]:
# create the kernel
subsequence_kernel = SubsequenceStringKernel(maxlen=5, ssk_lambda=1.1)

# create a support vector classifier with the kernel
clf = SVC(C=1.0, kernel=subsequence_kernel)

# train the classifier
clf.fit(X_train, y_train)

Performing the classification on the test dataset and obtaining the scores calculated by the model.

In [15]:
pred_scores = clf.decision_function(X_test)

Defining sequence labels from scores.

In [16]:
pred_labels = np.where(pred_scores > 0, 1, -1)

Calculating and showing evaluation metrics.

In [None]:
MCC = round(matthews_corrcoef(y_test, pred_labels), 4)
accuracy = round(accuracy_score(y_test, pred_labels)*100, 2)
sensitivity = round(recall_score(y_test, pred_labels)*100, 2)
TN, FP, FN, TP = confusion_matrix(y_test, pred_labels).ravel()
specificity = round(TN / (TN + FP)*100, 2)
AUROC = round(roc_auc_score(y_test, pred_scores), 4)

print("MCC:", MCC)
print("Accuracy:", accuracy)
print("Sensitivity:", sensitivity)
print("Specificity:", specificity)
print("AUROC:", AUROC)

We observe that the best SVM model achieved an MCC equal to the original tool, indicating good performance for the proposed problem.