# Kernel Methods challenge

Importing base libraries...

In [1]:
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

## Debugging requirements
import pdb

## Performance metrics requirements
import time

## Kernel SVM requirements
from cvxopt import matrix
from cvxopt import solvers
import mosek

from scipy.spatial.distance import cdist
from numpy.core.defchararray import not_equal

## 1.Loading the data + sanity checks

In [2]:
%run data_handler.py

## Loading training data
tr0 = load_data(0, 'tr')
tr1 = load_data(1, 'tr')
tr2 = load_data(2, 'tr')

## Loading test data
te0 = load_data(0, 'te')
te1 = load_data(1, 'te')
te2 = load_data(2, 'te')

Some sanity checks...

<b>Training set 0</b>

In [None]:
tr0['Bound'].describe()

In [None]:
tr0.head(5)

In [None]:
tr0.tail(5)

<b>Training set 1</b>

In [None]:
tr1['Bound'].describe()

In [None]:
tr1.head(5)

In [None]:
tr1.tail(5)

<b>Training set 2</b>

In [None]:
tr2['Bound'].describe()

In [None]:
tr2.head(5)

In [None]:
tr2.tail(5)

<b>Test set 0</b>

In [None]:
te0['Sequence'].describe()

In [None]:
te0.head(5)

In [None]:
te0.tail(5)

<b>Test set 1</b>

In [None]:
te1['Sequence'].describe()

In [None]:
te1.head(5)

In [None]:
te1.tail(5)

<b>Test set 2</b>

In [None]:
te2['Sequence'].describe()

In [None]:
te2.head(5)

In [None]:
te2.tail(5)

First idea: use some distance on the strings as a kernel.
However, note that some distances (Hamming) are only defined for sequences of the same size.
What is the mininimum and maximum length of the DNA sequences in this first train set?

In [None]:
min_length = tr0['Sequence'].str.len().max(0)
max_length = tr0['Sequence'].str.len().max(0)
print('Min sequence length: {}'.format(min_length))
print('Max sequence length: {}'.format(max_length))
print('Length amplitude: {}'.format(max_length-min_length))

## 2. Defining first kernels + running simple classification model

### First kernels

Ok, so here all sequences have the same length. That means that we can start by something simple like Hamming. However, we may want to use something that would seamlessly extend to DNA sequences of different lengths...
Here I will test both the Hamming and the Levenshtein distance as kernels for mapping DNA sequences:

In [20]:
%run kernels.py

Testing kernel computation speed (debugging only):

In [None]:
t0 = time.time()
Ktr0 = build_kernel(tr0['Sequence'], tr0['Sequence'], kernel_fct = hamming_distance)
t1 = time.time()
Ktr1 = build_kernel(tr1['Sequence'], tr1['Sequence'], kernel_fct = hamming_distance)
t2 = time.time()
Ktr2 = build_kernel(tr2['Sequence'], tr2['Sequence'], kernel_fct = hamming_distance)
t3 = time.time()

In [None]:
print('Preparing kernel matrix for a training dataset 1 took {0:d}min {1:d}s with this method'.format(int((t1-t0)/60),int(t1-t0)%60))
print('Preparing kernel matrix for a training dataset 2 took {0:d}min {1:d}s with this method'.format(int((t2-t1)/60),int(t2-t1)%60))
print('Preparing kernel matrix for a training dataset 3 took {0:d}min {1:d}s with this method'.format(int((t3-t2)/60),int(t3-t2)%60))

### Tools

Defining a couple of losses functions that will be useful:

In [4]:
%run metrics.py
        
m_binary = Metric('Match rate', lambda preds,labels: 1 - ls_binary(preds,labels))

### Kernel method parent & kernel SVM

Throughout the challenge we will need to use different kernel methods, which will share some attributes and methods. I will thus create an "abstract" class kernelMethod, and derive a kernelSVM class from it.

The first thing I will try out is a kernel SVM method:

In [21]:
%run kernel_methods.py

## Testing SVM implementation with a linear SVM on iris dataset

First let's test the KernelSVM class that we've built on simple data, coming from the IRIS dataset...

In [10]:
iris_file  = 'misc_data/Iris.csv'
iris = pd.read_csv(iris_file)

In [11]:
iris = iris.assign(label=(iris['Species']=='Iris-setosa'))
_ = iris.pop('Species')

In [12]:
lbda2 = 0.5
lSVM = kernelSVM(lbda2)
iris_X = iris.drop(['Id','label'], axis=1).as_matrix()
iris_Y = iris['label'].as_matrix()
lSVM.run(iris_X, iris_Y, kernel_fct=None, stringsData=False)

Building kernel matrix from 150 samples...
...done in 0.73s
     pcost       dcost       gap    pres   dres
 0: -2.6452e+01 -2.3720e+00  1e+03  3e+01  1e-13
 1: -9.4290e-01 -2.3555e+00  2e+01  5e-01  1e-13
 2: -5.2683e-01 -1.9421e+00  2e+00  2e-02  6e-15
 3: -5.3956e-01 -7.4080e-01  2e-01  2e-03  2e-15
 4: -5.9536e-01 -6.5119e-01  6e-02  3e-04  2e-15
 5: -6.1134e-01 -6.2889e-01  2e-02  7e-05  2e-15
 6: -6.1649e-01 -6.2179e-01  5e-03  2e-05  2e-15
 7: -6.1876e-01 -6.1895e-01  2e-04  2e-07  3e-15
 8: -6.1885e-01 -6.1885e-01  6e-06  4e-09  3e-15
 9: -6.1885e-01 -6.1885e-01  6e-08  4e-11  3e-15
Optimal solution found.


In [13]:
iris_preds = np.ravel(format_preds(lSVM.predict(iris_X, stringsData=False)))

In [None]:
iris_Y

In [14]:
iris_preds.astype(bool)

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True,  True,  True, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,

Great! It looks like the kernelSVM class is fully functionnal on a linear kernel with vector data.

## KernelSVM for predicting transcription factor binding
Now let's try our kernelSVM with some basic kernels:
- based on Hamming distance (acceptable in terms of computation time for our purpose)
- based on Levenshtein distance? (would seem more relevant to the problem, however computational issues are abound)

In [22]:
## Method defini ion
lbda = 0.005
kSVM = kernelSVM(lbda)

In [23]:
## Training SVM + performance assessment on training data
kSVM.run(tr0['Sequence'], tr0['Bound'], hamming_kernel)
preds_kSVM_tr0, perf_kSVM_tr0 = kSVM.assess(tr0['Sequence'], tr0['Bound'], metrics=[m_binary])
print('Training dataset {0:d}: {1:s}: {2:.1f}%'.format(0, list(perf_kSVM_tr0.keys())[0], 100*list(perf_kSVM_tr0.values())[0]))
kSVM_te0_raw = np.sign(kSVM.predict(te0['Sequence'])).astype(int)

Building kernel matrix from 2000 samples...


  if arr1.ndim == 1:
  if arr2.ndim == 1:


...done in 65.67s
     pcost       dcost       gap    pres   dres
 0: -3.5025e+03 -4.2278e+02  3e+04  8e+01  1e-14
 1: -3.9363e+02 -3.9398e+02  2e+03  4e+00  9e-15
 2: -1.6487e+02 -3.3071e+02  2e+02  6e-16  2e-15
 3: -1.8903e+02 -1.9917e+02  1e+01  2e-16  1e-15
 4: -1.9590e+02 -1.9606e+02  2e-01  2e-16  1e-15
 5: -1.9600e+02 -1.9601e+02  2e-03  2e-16  9e-16
 6: -1.9601e+02 -1.9601e+02  2e-05  2e-16  1e-15
Optimal solution found.


  print('Error: Data and labels have different length')


Training dataset 0: Match rate: 65.0%


In [24]:
## Training SVM + performance assessment on training data
kSVM.run(tr1['Sequence'], tr1['Bound'], hamming_kernel)
preds_kSVM_tr1, perf_kSVM_tr1 = kSVM.assess(tr1['Sequence'], tr1['Bound'], metrics=[m_binary])
print('Training dataset {0:d}: {1:s}: {2:.1f}%'.format(1, list(perf_kSVM_tr1.keys())[0], 100*list(perf_kSVM_tr1.values())[0]))
kSVM_te1_raw = np.sign(kSVM.predict(te1['Sequence'])).astype(int)

Building kernel matrix from 2000 samples...


  if arr1.ndim == 1:
  if arr2.ndim == 1:


...done in 70.07s
     pcost       dcost       gap    pres   dres
 0: -3.2124e+03 -4.7333e+02  4e+04  9e+01  1e-14
 1: -3.4629e+02 -4.4677e+02  2e+03  5e+00  1e-14
 2: -1.5221e+02 -3.6977e+02  2e+02  7e-16  1e-15
 3: -1.7446e+02 -2.0061e+02  3e+01  2e-16  1e-15
 4: -1.8635e+02 -1.8775e+02  1e+00  2e-16  1e-15
 5: -1.8708e+02 -1.8715e+02  7e-02  2e-16  9e-16
 6: -1.8712e+02 -1.8713e+02  2e-03  2e-16  9e-16
 7: -1.8713e+02 -1.8713e+02  2e-05  2e-16  9e-16
Optimal solution found.


  print('Error: Data and labels have different length')


Training dataset 1: Match rate: 66.2%


In [25]:
## Training SVM + performance assessment on training data
kSVM.run(tr2['Sequence'], tr2['Bound'], hamming_kernel)
preds_kSVM_tr2, perf_kSVM_tr2 = kSVM.assess(tr2['Sequence'], tr2['Bound'], metrics=[m_binary])
print('Training dataset {0:d}: {1:s}: {2:.1f}%'.format(2, list(perf_kSVM_tr2.keys())[0], 100*list(perf_kSVM_tr2.values())[0]))
kSVM_te2_raw = np.sign(kSVM.predict(te2['Sequence'])).astype(int)

Building kernel matrix from 2000 samples...


  if arr1.ndim == 1:
  if arr2.ndim == 1:


...done in 69.18s
     pcost       dcost       gap    pres   dres
 0: -3.5386e+03 -4.2603e+02  3e+04  8e+01  8e-15
 1: -3.8617e+02 -4.0054e+02  2e+03  5e+00  7e-15
 2: -1.6318e+02 -3.3638e+02  2e+02  7e-16  2e-15
 3: -1.8727e+02 -2.0100e+02  1e+01  2e-16  1e-15
 4: -1.9526e+02 -1.9550e+02  2e-01  2e-16  9e-16
 5: -1.9540e+02 -1.9541e+02  2e-03  2e-16  1e-15
 6: -1.9540e+02 -1.9540e+02  2e-05  2e-16  9e-16
Optimal solution found.


  print('Error: Data and labels have different length')


Training dataset 2: Match rate: 62.7%


<b>Current performance rate</b>: ~65% on training set, ?? on test set

<b>Next steps - Results</b>:
- What is the reason for such a poor performance rate, even on the training data?
- If this is due to Hamming being mostly irrelevant, implement the Levenshtein distance and retry with this new kernel

<b>Next steps - Computing speed</b>:
- Find a way to vectorize the kernel matrix computation

In [32]:
## Predictions on test data
kSVM_te0 = pd.DataFrame(
    data = format_preds(kSVM_te0_raw),
    columns = ['Bound'])

kSVM_te1 = pd.DataFrame(
    data = format_preds(kSVM_te1_raw),
    columns = ['Bound'])
kSVM_te1.index = kSVM_te1.index + 1000

kSVM_te2 = pd.DataFrame(
    data = format_preds(kSVM_te2_raw),
    columns = ['Bound'])
kSVM_te2.index = kSVM_te2.index + 2000

frames = [kSVM_te0, kSVM_te1, kSVM_te2]
kSVM_te = pd.concat(frames)
kSVM_te.index = kSVM_te.index.set_names(['Id'])

kSVM_te.to_csv('predictions/kSVM_te.csv')