# DNA classification with Fixed Degree string kernel

This notebook details the utilization of Scikit-Learn to search for the best Support Vector Machine (SVM) model for the classification of DNA sequences using the Fixed Degree string kernel.

## 1. Dataset preparation

The dataset employed in this notebook comprises 2,000 artificial DNA sequences, each with a length of 50 bases. Each sequence is associated with a label indicating the presence (label 1) or absence (label 0) of a motif (CGACCGAACTCC) that hypothetically enables binding to a protein. The dataset originates from the tutorial linked to the article "A Primer on Deep Learning in Genomics" (Nature Genetics, 2019) by James Zou, Mikael Huss, Abubakar Abid, Pejman Mohammadi, Ali Torkamani & Amalio Telentil. It comprises 987 positive sequences (label 1) and 1,013 negative sequences (label 0).

We aim to train a model using this dataset capable of classifying unknown sequences as either capable or incapable of binding to the protein.

The first step involves loading the dataset and creating a dataframe.

In [1]:
import pandas as pd

def read_fasta_to_dataframe(fasta_file):
    data = []
    with open(fasta_file, 'r') as file:
        sequence_id = None
        sequence = []
        
        for line in file:
            line = line.strip()
            if line.startswith('>'):
                if sequence_id is not None:
                    data.append([sequence_id, ''.join(sequence)])
                sequence_id = line[1:]
                sequence = []
            else:
                sequence.append(line)
        
        if sequence_id is not None:
            data.append([sequence_id, ''.join(sequence)])
    
    df = pd.DataFrame(data, columns=['seqid', 'sequence'])
    return df

In [2]:
from os import path
amp_seqs_file_path = path.join('data', 'Bhadra-et-al-2018', 'train_AMP_3268.fasta')
amp_df = read_fasta_to_dataframe(amp_seqs_file_path)
amp_df['label'] = 1
amp_df

Unnamed: 0,seqid,sequence,label
0,AMP_1,AACSDRAHGHICESFKSFCKDSGRNGVKLRANCKKTCGLC,1
1,AMP_2,AAEFPDFYDSEEQMGPHQEAEDEKDRADQRVLTEEEKKELENLAAM...,1
2,AMP_3,AAFFAQQKGLPTQQQNQVSPKAVSMIVNLEGCVRNPYKCPADVWTN...,1
3,AMP_4,AAFRGCWTKNYSPKPCL,1
4,AMP_5,AAGMGFFGAR,1
...,...,...,...
3263,AMP_3264,YIRDFITRRPPFGNI,1
3264,AMP_3265,GILDALTGIL,1
3265,AMP_3266,IFKAIWSGIKRLC,1
3266,AMP_3267,ILGKFCDEIKRIV,1


In [3]:
random_seed = 1708  # for reproducing

sampled_amp_df = amp_df.sample(n=len(amp_df) // 15, random_state=random_seed)
sampled_amp_df

Unnamed: 0,seqid,sequence,label
2969,AMP_2970,RECKTESNTFPGICITKPPCRKACISEKFTDGHCSKILRRCLCTKPC,1
2415,AMP_2416,RMRRSKSGKGSGGSKGSGSKGSKGSKGSGSKGSGSKGGSRPGGGSS...,1
1121,AMP_1122,GLFDIIKKIAESI,1
204,AMP_205,AVDFSSCARMDVPGLSKVAQGLCISSCKFQNCGTGHCEKRGGRPTC...,1
2680,AMP_2681,TLYRRFLCKKMKGRCETACLSFEKKIGTCRADLTPLCCKEKKKH,1
...,...,...,...
2315,AMP_2316,QLPFVAGVACEMCQCVYCAASKKC,1
3010,AMP_3011,KLCQRPSGTWSGVCGNNNACKNQCINLEKARHGSCNYVFPAHKCIC...,1
10,AMP_11,AALKGCWTKSIPPKPCSGKR,1
981,AMP_982,GIGGKPVQTAFVDNDGIYD,1


In [4]:
amp_seqs_file_path = path.join('data', 'Bhadra-et-al-2018', 'train_nonAMP_9777.fasta')
non_amp_df = read_fasta_to_dataframe(amp_seqs_file_path)
non_amp_df['label'] = -1
non_amp_df

Unnamed: 0,seqid,sequence,label
0,nonamp_1,MNNNTTAPTYTLRGLQLIGWRDMQHALDYLFADGHLKQGTLVAINA...,-1
1,nonamp_2,MKSLLPLAILAALAVAALCYESHESMESYEVSPFTTRRNANTFISP...,-1
2,nonamp_3,MASVTDGKTGIKDASDQNFDYMFKLLIIGNSSVGKTSFLFRYADDT...,-1
3,nonamp_4,MASFQDRAQHTIAQLDKELSKYPVLNNLERQTSVPKVYVILGLVGI...,-1
4,nonamp_5,MRHRSGLRKLNRTSSHRQAMFRNMANSLLRHEVIKTTLPKAKELRR...,-1
...,...,...,...
9772,nonamp_9773,MDNEMTLTFLALSENEALARVAVTGFIAQLDPTIDELSEFKTVVSE...,-1
9773,nonamp_9774,MSKTVVRKNESLDDALRRFKRSVSKAGTLQESRKREFYEKPSVKRK...,-1
9774,nonamp_9775,MRHLVLIGFMGSGKSSLAQELGLALKLEVLDTDMIISERVGLSVRE...,-1
9775,nonamp_9776,MRDLKTYLSVAPVLSTLWFGSLAGLLIEINRFFPDALTFPFFLIRV...,-1


In [5]:
sampled_non_amp_df = non_amp_df.sample(n=len(non_amp_df) // 15, random_state=random_seed)
sampled_non_amp_df

Unnamed: 0,seqid,sequence,label
8753,nonamp_8754,MIHKLTSEERKTRLEGLPHWTAVPGRDAIQRSLRFADFNEAFGFMT...,-1
8257,nonamp_8258,MFLNTIKPGEGAKHAKRRVGRGIGSGLGKTAGRGHKGQKSRSGGFH...,-1
5885,nonamp_5886,MYQPDFPPVPFRLGLYPVVDSVQWIERLLDAGVRTLQLRIKDRRDE...,-1
369,nonamp_370,MSFKNPVLGLCQQAAFMLSAARVDQCPADDGLEVAFAGRSNAGKSS...,-1
6825,nonamp_6826,MRVKATLINFKSKLSKSCNRFVSLFRFRVKRPVFIRPLRARHGNVK...,-1
...,...,...,...
5590,nonamp_5591,LGKSVTN,-1
8446,nonamp_8447,MDKKKNISMAVIRRLPKYHRYLYELLKNDVDRISSKELSEKIGFTA...,-1
5740,nonamp_5741,MYHDLIRSELNEAADTLANFLKDDSNIDAIQRAAILLADSFKAGGK...,-1
5986,nonamp_5987,MLYRSISCPKGTFFMTTPPAAAEIFGDNLEKAIAYHESLATDGSVR...,-1


In [6]:
data_df = pd.concat([amp_df, non_amp_df], axis=0)
data_df

Unnamed: 0,seqid,sequence,label
0,AMP_1,AACSDRAHGHICESFKSFCKDSGRNGVKLRANCKKTCGLC,1
1,AMP_2,AAEFPDFYDSEEQMGPHQEAEDEKDRADQRVLTEEEKKELENLAAM...,1
2,AMP_3,AAFFAQQKGLPTQQQNQVSPKAVSMIVNLEGCVRNPYKCPADVWTN...,1
3,AMP_4,AAFRGCWTKNYSPKPCL,1
4,AMP_5,AAGMGFFGAR,1
...,...,...,...
9772,nonamp_9773,MDNEMTLTFLALSENEALARVAVTGFIAQLDPTIDELSEFKTVVSE...,-1
9773,nonamp_9774,MSKTVVRKNESLDDALRRFKRSVSKAGTLQESRKREFYEKPSVKRK...,-1
9774,nonamp_9775,MRHLVLIGFMGSGKSSLAQELGLALKLEVLDTDMIISERVGLSVRE...,-1
9775,nonamp_9776,MRDLKTYLSVAPVLSTLWFGSLAGLLIEINRFFPDALTFPFFLIRV...,-1


In [7]:
sampled_data_df = pd.concat([sampled_amp_df, sampled_non_amp_df], axis=0)
sampled_data_df

Unnamed: 0,seqid,sequence,label
2969,AMP_2970,RECKTESNTFPGICITKPPCRKACISEKFTDGHCSKILRRCLCTKPC,1
2415,AMP_2416,RMRRSKSGKGSGGSKGSGSKGSKGSKGSGSKGSGSKGGSRPGGGSS...,1
1121,AMP_1122,GLFDIIKKIAESI,1
204,AMP_205,AVDFSSCARMDVPGLSKVAQGLCISSCKFQNCGTGHCEKRGGRPTC...,1
2680,AMP_2681,TLYRRFLCKKMKGRCETACLSFEKKIGTCRADLTPLCCKEKKKH,1
...,...,...,...
5590,nonamp_5591,LGKSVTN,-1
8446,nonamp_8447,MDKKKNISMAVIRRLPKYHRYLYELLKNDVDRISSKELSEKIGFTA...,-1
5740,nonamp_5741,MYHDLIRSELNEAADTLANFLKDDSNIDAIQRAAILLADSFKAGGK...,-1
5986,nonamp_5987,MLYRSISCPKGTFFMTTPPAAAEIFGDNLEKAIAYHESLATDGSVR...,-1


In [8]:
data_df = sampled_data_df

In [9]:
from sklearn.model_selection import train_test_split


X_train, X_test, y_train, y_test = train_test_split(data_df['sequence'].values, 
                                                    data_df['label'].values, 
                                                    stratify=data_df['label'], 
                                                    random_state=random_seed)
print('Train sequences:',len(X_train))
print('Test sequences:',len(X_test))

Train sequences: 488
Test sequences: 163


In [10]:
# kernel class import
from sys import path as sys_path
sys_path.append('..')
from strkernels import SubsequenceStringKernel

# create a kernel
subsequence_kernel = SubsequenceStringKernel(maxlen=1, ssk_lambda=1)

In [11]:
# create a support vector classifier with the kernel
from sklearn.svm import SVC
clf = SVC(kernel=subsequence_kernel)

# train the classifier
clf.fit(X_train, y_train)

In [12]:
# make predictions using the classifier
predictions = clf.predict(X_test)

# calculate accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)

# print the accuracy
print("Accuracy of classification:", accuracy)

Accuracy of classification: 0.9079754601226994


In [13]:
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer, matthews_corrcoef

# create a support vector classifier with the kernel
clf = SVC(kernel=subsequence_kernel)

# set parameters for grid search
param_grid = {
    'kernel__maxlen': [4],
    'kernel__ssk_lambda': [0.9, 0.95, 1.0],
}

mcc_scorer = make_scorer(matthews_corrcoef)

# create the GridSearchCV object
grid_search = GridSearchCV(estimator=clf, 
                           param_grid=param_grid, 
                           scoring=mcc_scorer, 
                           cv=10,
                           n_jobs=-1, 
                           verbose=3)

# fit the model to the training data
grid_search.fit(X_train, y_train)

# get the best parameters
best_params = grid_search.best_params_

# get the best trained model
best_model = grid_search.best_estimator_

# make predictions using the best model
predictions = best_model.predict(X_test)

# calculate MCC score
MCC_score = matthews_corrcoef(y_test, predictions)

# print the results
print("\nBest parameters:", best_params)
print("MCC score of the best model:", MCC_score)

Fitting 10 folds for each of 1 candidates, totalling 10 fits
[CV 1/10] END kernel__maxlen=9, kernel__ssk_lambda=0.9;, score=0.588 total time= 4.4min
[CV 2/10] END kernel__maxlen=9, kernel__ssk_lambda=0.9;, score=0.890 total time= 4.3min
[CV 3/10] END kernel__maxlen=9, kernel__ssk_lambda=0.9;, score=0.840 total time= 4.2min
[CV 4/10] END kernel__maxlen=9, kernel__ssk_lambda=0.9;, score=0.733 total time= 4.3min
[CV 5/10] END kernel__maxlen=9, kernel__ssk_lambda=0.9;, score=0.649 total time= 4.3min
[CV 6/10] END kernel__maxlen=9, kernel__ssk_lambda=0.9;, score=0.945 total time= 4.3min
[CV 7/10] END kernel__maxlen=9, kernel__ssk_lambda=0.9;, score=0.905 total time= 4.2min
[CV 8/10] END kernel__maxlen=9, kernel__ssk_lambda=0.9;, score=0.791 total time= 4.1min
[CV 9/10] END kernel__maxlen=9, kernel__ssk_lambda=0.9;, score=0.778 total time= 4.2min
[CV 10/10] END kernel__maxlen=9, kernel__ssk_lambda=0.9;, score=0.830 total time= 4.5min

Best parameters: {'kernel__maxlen': 9, 'kernel__ssk_lambd