# Kernel Methods challenge

Importing base libraries...

In [2]:
import numpy as np
import pandas as pd
import seaborn as sn
import matplotlib.pyplot as plt

## Debugging requirements
import pdb

## Performance metrics requirements
import time

## Kernel SVM requirements
from cvxopt import matrix
from cvxopt import solvers
import mosek

from scipy.spatial.distance import cdist
from numpy.core.defchararray import not_equal

## 1.Loading the data + sanity checks

In [3]:
%run data_handler.py

## Loading training data
tr0 = load_data(0, 'tr')
tr1 = load_data(1, 'tr')
tr2 = load_data(2, 'tr')

## Loading test data
te0 = load_data(0, 'te')
te1 = load_data(1, 'te')
te2 = load_data(2, 'te')

Some sanity checks...

<b>Training set 0</b>

In [None]:
tr0['Bound'].describe()

In [None]:
tr0.head(5)

In [None]:
tr0.tail(5)

<b>Training set 1</b>

In [None]:
tr1['Bound'].describe()

In [None]:
tr1.head(5)

In [None]:
tr1.tail(5)

<b>Training set 2</b>

In [None]:
tr2['Bound'].describe()

In [None]:
tr2.head(5)

In [None]:
tr2.tail(5)

<b>Test set 0</b>

In [None]:
te0['Sequence'].describe()

In [None]:
te0.head(5)

In [None]:
te0.tail(5)

<b>Test set 1</b>

In [None]:
te1['Sequence'].describe()

In [None]:
te1.head(5)

In [None]:
te1.tail(5)

<b>Test set 2</b>

In [None]:
te2['Sequence'].describe()

In [None]:
te2.head(5)

In [None]:
te2.tail(5)

First idea: use some distance on the strings as a kernel.
However, note that some distances (Hamming) are only defined for sequences of the same size.
What is the mininimum and maximum length of the DNA sequences in this first train set?

In [None]:
min_length = tr0['Sequence'].str.len().max(0)
max_length = tr0['Sequence'].str.len().max(0)
print('Min sequence length: {}'.format(min_length))
print('Max sequence length: {}'.format(max_length))
print('Length amplitude: {}'.format(max_length-min_length))

## 2. Defining first kernels + running simple classification model

### First kernels

Ok, so here all sequences have the same length. That means that we can start by something simple like Hamming. However, we may want to use something that would seamlessly extend to DNA sequences of different lengths...
Here I will test both the Hamming and the Levenshtein distance as kernels for mapping DNA sequences:

In [None]:
%run kernels.py

Testing kernel computation speed (debugging only):

In [None]:
t0 = time.time()
Ktr0 = build_kernel(tr0['Sequence'], tr0['Sequence'], kernel_fct = hamming_distance)
t1 = time.time()
Ktr1 = build_kernel(tr1['Sequence'], tr1['Sequence'], kernel_fct = hamming_distance)
t2 = time.time()
Ktr2 = build_kernel(tr2['Sequence'], tr2['Sequence'], kernel_fct = hamming_distance)
t3 = time.time()

In [None]:
print('Preparing kernel matrix for a training dataset 1 took {0:d}min {1:d}s with this method'.format(int((t1-t0)/60),int(t1-t0)%60))
print('Preparing kernel matrix for a training dataset 2 took {0:d}min {1:d}s with this method'.format(int((t2-t1)/60),int(t2-t1)%60))
print('Preparing kernel matrix for a training dataset 3 took {0:d}min {1:d}s with this method'.format(int((t3-t2)/60),int(t3-t2)%60))

### Tools

Defining a couple of losses functions that will be useful:

In [None]:
%run metrics.py
        
m_binary = Metric('Match rate', lambda preds,labels: 1 - ls_binary(preds,labels), quantized=True)

### Kernel method parent & kernel SVM

Throughout the challenge we will need to use different kernel methods, which will share some attributes and methods. I will thus create an "abstract" class kernelMethod, and derive a kernelSVM class from it.

The first thing I will try out is a kernel SVM method:

In [None]:
%run kernel_methods.py

## Testing SVM implementation with a linear SVM on iris dataset

First let's test the KernelSVM class that we've built on simple data, coming from the IRIS dataset...

In [None]:
iris_file  = 'misc_data/Iris.csv'
iris = pd.read_csv(iris_file)

In [None]:
iris = iris.assign(label=(iris['Species']=='Iris-setosa'))
_ = iris.pop('Species')

In [None]:
lbda2 = 0.01
lSVM = kernelSVM(lbda2)
iris_X = iris.drop(['Id','label'], axis=1).as_matrix()
iris_X_res = iris_X[:,:2]
iris_Y = iris['label'].as_matrix()

First let's test training a linear SVM on the whole dataset

In [None]:
lSVM.train(iris_X_res, iris_Y, kernel_fct=None, stringsData=False)

In [None]:
raw_preds = lSVM.predict(iris_X_res, stringsData=False)
iris_preds = np.ravel(lSVM.classify(raw_preds))

In [None]:
plt.scatter(iris_X_res[:,0],iris_X_res[:,1],c=iris_Y)
plt.show()

In [None]:
plt.scatter(iris_X_res[:,0],iris_X_res[:,1],c=iris_preds)
plt.show()

Then let's try a cross-validation with 5 folds:

In [None]:
lSVM.grid_search(iris_X_res, iris_Y, 0.0000001, 100, 10, n_folds=5, scale='log')

In [None]:
lSVM.lbda = 0.01
_ = lSVM.assess(iris_X_res, iris_Y, n_folds=5, stringsData=False)

Great! It looks like the kernelSVM class is fully functionnal on a linear kernel with vector data.

## KernelSVM for predicting transcription factor binding
Now let's try our kernelSVM with some basic kernels:
- based on Hamming distance (acceptable in terms of computation time for our purpose)
- based on Levenshtein distance? (would seem more relevant to the problem, however computational issues are abound)

In [None]:
## Method definition
lbda = 0.5
kSVM = kernelSVM(lbda)

## We'll try out the spectrum kernel first
substring_length = 3
dict0 = create_dictionary(tr0['Sequence'], substring_length)

In [None]:
## Training SVM + performance assessment on training data
kSVM_tr0_score = kSVM.assess(tr0['Sequence'].as_matrix(), tr0['Bound'].as_matrix(), kernel_fct = lambda x,y: spectrum_kernel(x, y, substring_length, dict0), n_folds = 5, metric=m_binary)
kSVM.train(tr0['Sequence'].as_matrix(), tr0['Bound'].as_matrix(), kernel_fct = lambda x,y: spectrum_kernel(x, y, substring_length, dict0))
kSVM_te0_raw = kSVM.classify(kSVM.predict(te0['Sequence'].as_matrix()))

In [None]:
## Training SVM + performance assessment on training data
kSVM_tr1_score = kSVM.assess(tr1['Sequence'].as_matrix(), tr1['Bound'].as_matrix(), kernel_fct = hamming_kernel, n_folds = 5, metric=m_binary)
kSVM.train(tr1['Sequence'].as_matrix(), tr1['Bound'].as_matrix(), hamming_kernel)
kSVM_te1_raw = np.sign(kSVM.predict(te1['Sequence'])).astype(int)

In [None]:
## Training SVM + performance assessment on training data
kSVM_tr2_score = kSVM.assess(tr2['Sequence'].as_matrix(), tr2['Bound'].as_matrix(), kernel_fct = hamming_kernel, n_folds = 5, metric=m_binary)
kSVM.train(tr2['Sequence'].as_matrix(), tr2['Bound'].as_matrix(), hamming_kernel)
kSVM_te2_raw = np.sign(kSVM.predict(te2['Sequence'])).astype(int)

<b>Next steps - Results</b>:
- What is the reason for such a poor performance rate, even on the training data?
- If this is due to Hamming being mostly irrelevant, implement the Levenshtein distance and retry with this new kernel

<b>Next steps - Computing speed</b>:
- Find a way to vectorize the kernel matrix computation

In [None]:
## Predictions on test data
kSVM_te0 = pd.DataFrame(
    data = format_preds(kSVM_te0_raw),
    columns = ['Bound'])

kSVM_te1 = pd.DataFrame(
    data = format_preds(kSVM_te1_raw),
    columns = ['Bound'])
kSVM_te1.index = kSVM_te1.index + 1000

kSVM_te2 = pd.DataFrame(
    data = format_preds(kSVM_te2_raw),
    columns = ['Bound'])
kSVM_te2.index = kSVM_te2.index + 2000

frames = [kSVM_te0, kSVM_te1, kSVM_te2]
kSVM_te = pd.concat(frames)
kSVM_te.index = kSVM_te.index.set_names(['Id'])

kSVM_te.to_csv('predictions/kSVM_te.csv')

### Trying out kNN with Hamming (debugging only)

In [None]:
k = 10
kNN = kernelKNN(k)
kNN.train(tr0['Sequence'], tr0['Bound'])
kNN.predict(te0['Sequence'], hamming_kernel)

# Neural Networks for classification

In [14]:
from tqdm import tqdm

import keras
from keras import regularizers
from keras.layers import Activation, Conv2D, Dense, Dropout, Embedding, Flatten, Input, LSTM, MaxPooling2D
from keras.models import Model, Sequential
from keras.preprocessing.sequence import pad_sequences
from keras.utils import plot_model, np_utils
from keras.utils.vis_utils import model_to_dot
from keras.wrappers.scikit_learn import KerasRegressor

from sklearn.model_selection import KFold, cross_val_score

seq_len = 101
nucl_map={'A':[1,0,0,0], 'C':[0,1,0,0], 'G':[0,0,1,0], 'T':[0,0,0,1]}
nb_conv = 15

seed = 9
np.random.seed(seed)

In [5]:
## Preprocessing data for NN

# 1 - Transform text to integers using keras.preprocessing.text.one_hot function
#     https://keras.io/preprocessing/text/

def one_hot_batch(sequences, sep=" "):
    oh_seqs = []
    for seq in sequences:
        split_seq = seq.split(sep)
        oh_seq = [nucl_map[nucl] for nucl in split_seq]
        oh_seqs.append(oh_seq)
    return np.array(oh_seqs)

def split_seqs(sequences, split=" "):
    return [split.join(seq) for seq in sequences]

tr0_split = split_seqs(tr0['Sequence'].as_matrix().tolist())
tr0_oh = one_hot_batch(tr0_split)

tr1_split = split_seqs(tr1['Sequence'].as_matrix().tolist())
tr1_oh = one_hot_batch(tr1_split)

tr2_split = split_seqs(tr2['Sequence'].as_matrix().tolist())
tr2_oh = one_hot_batch(tr2_split)


te0_split = split_seqs(te0['Sequence'].as_matrix().tolist())
te0_oh = one_hot_batch(te0_split)

te1_split = split_seqs(te1['Sequence'].as_matrix().tolist())
te1_oh = one_hot_batch(te1_split)

te2_split = split_seqs(te2['Sequence'].as_matrix().tolist())
te2_oh = one_hot_batch(te2_split)

In [6]:
## One-hot encodding of labels
tr0_oh_labels = np_utils.to_categorical(tr0['Bound'].as_matrix(), 2)
tr1_oh_labels = np_utils.to_categorical(tr1['Bound'].as_matrix(), 2)
tr2_oh_labels = np_utils.to_categorical(tr2['Bound'].as_matrix(), 2)

In [16]:
s_conv = 5

shapeconv = Sequential()
shapeconv.add(Conv2D(10, (s_conv,4), activation='relu',
                input_shape=(seq_len, 4, 1)))
shapeconv.add(Flatten())

shapeconv.add(Dense(128, activation='relu', kernel_regularizer=regularizers.l2(1.0/np.power(10,4))))
        
shapeconv.add(Dense(2, activation='softmax'))

## Optimization
shapeconv.compile(loss = 'mean_squared_error',
                 optimizer = 'SGD',
                 metrics = ['accuracy'])

## Fitting
shapeconv.fit(tr0_oh.reshape((2000,101,4,1)), tr0_oh_labels, batch_size=32, epochs=200, verbose=1)
    
## Evaluation
scores = shapeconv.evaluate(tr0_oh.reshape((2000,101,4,1)), tr0_oh_labels, verbose=1)

## Plotting selected model
# display(SVG(model_to_dot(shapeconv).create(prog='dot', format='svg')))

Epoch 1/200
Epoch 2/200
Epoch 3/200
Epoch 4/200
Epoch 5/200
Epoch 6/200
Epoch 7/200
Epoch 8/200
Epoch 9/200
Epoch 10/200
Epoch 11/200
Epoch 12/200
Epoch 13/200
Epoch 14/200
Epoch 15/200
Epoch 16/200
Epoch 17/200
Epoch 18/200
Epoch 19/200
Epoch 20/200
Epoch 21/200
Epoch 22/200
Epoch 23/200

KeyboardInterrupt: 

In [None]:
print(scores)

In [18]:
kfold = KFold(n_splits=6, shuffle=True, random_state=seed)
cv_scores = []
count = 0
regu_dense=[0.01, 0.1]
regu_conv = [0, 0]

X = tr0_oh.reshape((2000,101,4,1))
Y = tr0_oh_labels

for train, val in kfold.split(X, Y):

    s1 = 1 + count%3
    s2 = int(count/3)

    shapeconv = Sequential()
    shapeconv.add(Conv2D(10, (6,4), activation='relu',
                input_shape=(seq_len,4, 1), kernel_regularizer=regularizers.l2(regu_conv[s2]), kernel_initializer='glorot_uniform'))
    # shapeconv.add(MaxPooling1D(pool_size=1+count))
    shapeconv.add(Flatten())

    shapeconv.add(Dense(128, activation='relu', kernel_regularizer=regularizers.l2(regu_dense[s2])))
        
    shapeconv.add(Dense(2, activation='softmax'))

    ## Optimization
    shapeconv.compile(loss = 'mean_squared_error',
                 optimizer = 'SGD',
                 metrics = ['accuracy'])

    ## Fitting
    shapeconv.fit(X[train], Y[train], batch_size=32, epochs=50*s1, verbose=0)
    
    ## Evaluation
    scores_train = shapeconv.evaluate(X[train], Y[train], verbose=0)
    scores_val = shapeconv.evaluate(X[val], Y[val], verbose=0)

    print("Test {0:d}: Accuracy on val: {1:.4f} - Accuracy on train: {2:.4f}".format(1+count, scores_val[1], scores_train[1]))
    cv_scores.append(scores[1])    
    count = count+1

Test 1: Accuracy on val: 0.5000 - Accuracy on train: 0.6981
Test 2: Accuracy on val: 0.5030 - Accuracy on train: 0.7833
Test 3: Accuracy on val: 0.4925 - Accuracy on train: 0.6047
Test 4: Accuracy on val: 0.4745 - Accuracy on train: 0.6233
Test 5: Accuracy on val: 0.5495 - Accuracy on train: 0.6287
Test 6: Accuracy on val: 0.5225 - Accuracy on train: 0.6911
