# DNA classification with Fixed Degree string kernel

This notebook details the utilization of Scikit-Learn to search for the best Support Vector Machine (SVM) model for the classification of DNA sequences using the Fixed Degree string kernel.

## 1. Dataset preparation

The dataset employed in this notebook comprises 2,000 artificial DNA sequences, each with a length of 50 bases. Each sequence is associated with a label indicating the presence (label 1) or absence (label 0) of a motif (CGACCGAACTCC) that hypothetically enables binding to a protein. The dataset originates from the tutorial linked to the article "A Primer on Deep Learning in Genomics" (Nature Genetics, 2019) by James Zou, Mikael Huss, Abubakar Abid, Pejman Mohammadi, Ali Torkamani & Amalio Telentil. It comprises 987 positive sequences (label 1) and 1,013 negative sequences (label 0).

We aim to train a model using this dataset capable of classifying unknown sequences as either capable or incapable of binding to the protein.

The first step involves loading the dataset and creating a dataframe.

In [1]:
from os import path
sequences_file_path = path.join('data', 'Zou-et-al-2019', 'sequences.txt')
labels_file_path = path.join('data', 'Zou-et-al-2019', 'labels.txt')

import pandas as pd
seqs_df = pd.read_csv(sequences_file_path, header=None, names=['sequence'])
labels_df = pd.read_csv(labels_file_path, header=None, names=['label'])
data_df = pd.concat([seqs_df, labels_df], axis=1)

print('Sequences length:', len(data_df['sequence'][0]))
print('Positive sequences:', len(data_df[data_df['label'] == 1]))
print('Negative sequences:', len(data_df[data_df['label'] == 0]))
print('\nDataset dataframe:')
data_df

Sequences length: 50
Positive sequences: 987
Negative sequences: 1013

Dataset dataframe:


Unnamed: 0,sequence,label
0,CCGAGGGCTATGGTTTGGAAGTTAGAACCCTGGGGCTTCTCGCGGA...,0
1,GAGTTTATATGGCGCGAGCCTAGTGGTTTTTGTACTTGTTTGTCGC...,0
2,GATCAGTAGGGAAACAAACAGAGGGCCCAGCCACATCTAGCAGGTA...,0
3,GTCCACGACCGAACTCCCACCTTGACCGCAGAGGTACCACCAGAGC...,1
4,GGCGACCGAACTCCAACTAGAACCTGCATAACTGGCCTGGGAGATA...,1
...,...,...
1995,GTCGCGCGGGTGCGGAGGATGAGTCGCAGACGCATTTATGTCGCCC...,0
1996,GTTCGCAGCGTATTGAGTAATGTTTGACTCGACCGAACTCCATATT...,1
1997,ACTCGCTGTCCACGTCTATTCCTAGGGGTTTTATTTCGCAAGGTGA...,0
1998,TGCAAAGGGGCGACCGAACTCCCTTTACCGCGGAGTTATTCATAAT...,1


The subsequent step is to partition the dataset into training and testing sets using Scikit-Learn.

Here, we allocate 1,500 sequences for training and 500 for testing.

In [2]:
from sklearn.model_selection import train_test_split
random_seed = 1708  # for reproducing

X_train, X_test, y_train, y_test = train_test_split(data_df['sequence'].values, 
                                                    data_df['label'].values, 
                                                    stratify=data_df['label'], 
                                                    random_state=random_seed)
print('Train sequences:',len(X_train))
print('Test sequences:',len(X_test))

Train sequences: 1500
Test sequences: 500


## 2. String kernel example

String kernels are functions that take two strings as input and return a real number, quantifying their similarity.

In this notebook, we employ the Fixed Degree string kernel, which counts how many times the two input strings have equal-length subsequences at the same positions. The hyperparameter *degree* determines the length of the subsequences.



In [3]:
# kernel class import
from sys import path as sys_path
sys_path.append('..')
from strkernels import FixedDegreeStringKernel

# create a kernel with degree 1
fixed_degree_kernel = FixedDegreeStringKernel(degree=1)

Using a kernel function and a dataset, we can construct a kernel matrix containing the results of the function applied to all pairs of samples in the dataset.

The kernel matrix serves as the data representation input for kernel methods, with SVM being the most popular in classification problems.

Typically, the values in the kernel matrix are normalized so that the maximum value is 1.

In [4]:
train_kernel_matrix = fixed_degree_kernel(X_train, X_train)

print('Train kernel matrix shape:', train_kernel_matrix.shape)

print('\nTrain kernel matrix:')
train_kernel_matrix

Train kernel matrix shape: (1500, 1500)

Train kernel matrix:


array([[1.  , 0.3 , 0.34, ..., 0.22, 0.18, 0.22],
       [0.3 , 1.  , 0.2 , ..., 0.24, 0.18, 0.22],
       [0.34, 0.2 , 1.  , ..., 0.2 , 0.3 , 0.26],
       ...,
       [0.22, 0.24, 0.2 , ..., 1.  , 0.16, 0.24],
       [0.18, 0.18, 0.3 , ..., 0.16, 1.  , 0.26],
       [0.22, 0.22, 0.26, ..., 0.24, 0.26, 1.  ]])

## 3. Scikit-learn basic integration

Now, using the SVM classifier available in Scikit-learn, we will train a model using the training data and the Fixed Degree string kernel. Next, we will classify the sequences in the test dataset and check the accuracy of the predictions.

In [5]:
# create a kernel with degree 1
fixed_degree_kernel = FixedDegreeStringKernel(degree=1)

# create a support vector classifier with the kernel
from sklearn.svm import SVC
clf = SVC(kernel=fixed_degree_kernel)

# train the classifier
clf.fit(X_train, y_train)

In [6]:
# make predictions using the classifier
predictions = clf.predict(X_test)

# calculate accuracy
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, predictions)

# print the accuracy
print("Accuracy of classification:", accuracy)

Accuracy of classification: 0.704


## 4. Scikit-learn grid search integration

Now, we will search for the best value for the hyperparameter *degree* of the Fixed Degree string kernel for this dataset. 

We test all values for *degree* from 1 to 15, using the grid search cross-validation class of Scikit-learn.

In [7]:
from sklearn.model_selection import GridSearchCV

# create a kernel
fixed_degree_kernel = FixedDegreeStringKernel()

# create a support vector classifier with the kernel
clf = SVC(kernel=fixed_degree_kernel)

# set parameters for grid search
param_grid = {
    'kernel__degree': [i for i in range(1, 16)],
}

# create the GridSearchCV object
grid_search = GridSearchCV(estimator=clf, 
                           param_grid=param_grid, 
                           scoring='accuracy', 
                           cv=5, 
                           verbose=3)

# fit the model to the training data
grid_search.fit(X_train, y_train)

# get the best parameters
best_params = grid_search.best_params_

# get the best trained model
best_model = grid_search.best_estimator_

# make predictions using the best model
predictions = best_model.predict(X_test)

# calculate accuracy
accuracy = accuracy_score(y_test, predictions)

# print the results
print("\nBest parameters:", best_params)
print("Accuracy of the best model:", accuracy)

Fitting 5 folds for each of 15 candidates, totalling 75 fits
[CV 1/5] END ..................kernel__degree=1;, score=0.683 total time=   0.5s
[CV 2/5] END ..................kernel__degree=1;, score=0.710 total time=   0.5s
[CV 3/5] END ..................kernel__degree=1;, score=0.663 total time=   0.6s
[CV 4/5] END ..................kernel__degree=1;, score=0.730 total time=   0.6s
[CV 5/5] END ..................kernel__degree=1;, score=0.683 total time=   0.5s
[CV 1/5] END ..................kernel__degree=2;, score=0.813 total time=   0.5s
[CV 2/5] END ..................kernel__degree=2;, score=0.823 total time=   0.5s
[CV 3/5] END ..................kernel__degree=2;, score=0.837 total time=   0.5s
[CV 4/5] END ..................kernel__degree=2;, score=0.820 total time=   0.5s
[CV 5/5] END ..................kernel__degree=2;, score=0.800 total time=   0.5s
[CV 1/5] END ..................kernel__degree=3;, score=0.913 total time=   0.5s
[CV 2/5] END ..................kernel__degree=3;

We observe that the model correctly classified all sequences with a *degree* value between 7 and 11, indicating good performance for the proposed problem.