# Challenge Scratchbook

* This notebook explores methods for the Kernel Methods for Machine Learning Kaggle [challenge](https://www.kaggle.com/c/kernel-methods-for-machine-learning-2018-2019/data).

* Note that this is a binary classification challenge.

Our first goal is to implement two baseline methods:
1. Random classification
2. All instances are 0s (Doing so we get an idea of the proportion of 0's in the public test set)
3. Implement the Simple Pattern Recognition Algorithm (SPR) from Learning with Kernels 

Before that, we have to implement some data loaders

## Imports

In [1]:
import csv
import os
import numpy as np
from scipy import optimize
from itertools import product
import matplotlib.pyplot as plt
from tqdm import tqdm_notebook
import scipy as sp
from time import time
from utils.data import load_data, save_results
from utils.models import SVM, SPR, PCA
from utils.kernels import GaussianKernel

## Paths and Globals

In [2]:
CWD = os.getcwd()
DATA_DIR = os.path.join(CWD, "data")
RESULT_DIR = os.path.join(CWD, "results")

FILES = {0: {"train_mat": "Xtr0_mat100.csv",
             "train": "Xtr0.csv",
             "test_mat": "Xte0_mat100.csv",
             "test": "Xte0.csv",
             "label": "Ytr0.csv"},
         1: {"train_mat": "Xtr1_mat100.csv",
             "train": "Xtr1.csv",
             "test_mat": "Xte1_mat100.csv",
             "test": "Xte1.csv",
             "label": "Ytr1.csv"},
         2: {"train_mat": "Xtr2_mat100.csv",
             "train": "Xtr2.csv",
             "test_mat": "Xte2_mat100.csv",
             "test": "Xte2.csv",
             "label": "Ytr2.csv"}}

## 0 entries

In [3]:
#with open(os.path.join(RESULT_DIR, "results.csv"), 'w', newline='') as csvfile:
 #   writer = csv.writer(csvfile, delimiter=',')
    
  #  writer.writerow(["Id", "Bound"])
   # for i in range(3000):
    #    writer.writerow([i, 0])

**Comment:**

* We get 0.51266 which means that the dataset is pretty balanced.

## SPR: A Simple Pattern Recognition Algorithm

In [4]:
γ = 284.7
kernel = GaussianKernel(γ)

results = np.zeros(3000)
len_files = len(FILES)
for i in range(len_files):
    X_train, Y_train, X_test = load_data(i, data_dir=DATA_DIR, files_dict=FILES)
    X_val = X_train[1600:]
    Y_val = Y_train[1600:]
    X_train = X_train[:1600]
    Y_train = Y_train[:1600]
    clf = SPR(kernel)
    clf.fit(X_train, Y_train)
    y_pred_train =clf.predict(X_train)
    y_pred_val = clf.predict(X_val)
    score_train = clf.score(y_pred_train, Y_train)
    score_val = clf.score(y_pred_val, Y_val)
    #results[i*1000:i*1000 + 1000] = y_pred_test
    print(f"Accuracy on train set / val set {i} : {score_train} / {score_val} (γ: {γ})")

Accuracy on train set / val set 0 : 0.963125 / 0.58 (γ: 284.7)
Accuracy on train set / val set 1 : 1.0 / 0.755 (γ: 284.7)
Accuracy on train set / val set 2 : 0.99625 / 0.6025 (γ: 284.7)


## Define Kernels (to delete)

In [6]:
def pol_kernel(x,y,c): #c=0
    return (x.dot(y) + c)**2

def gaussian_kernel(x,y, gamma): #c=100
    return np.exp(-gamma*np.linalg.norm(x-y)**2)

def linear_kernel(x,y,c): 
    return x.dot(y)

def laplace_kernel(x,y,gamma):
    return np.exp(-gamma*np.linalg.norm(x-y,1))

## SVM

In [5]:
γ = 275
λ = 6e-6
kernel = GaussianKernel(γ)

results = np.zeros(3000)
len_files = len(FILES)
for i in range(len_files):
    X_train, Y_train, X_test = load_data(i, data_dir=DATA_DIR, files_dict=FILES)
    X_val = X_train[1600:]
    Y_val = Y_train[1600:]
    X_train = X_train[:1600]
    Y_train = Y_train[:1600]
    clf = SVM(_lambda=λ, kernel=kernel)
    clf.fit(X_train, Y_train)
    y_pred_train =clf.predict(X_train)
    y_pred_val = clf.predict(X_val)
    score_train = clf.score(y_pred_train, Y_train)
    score_val = clf.score(y_pred_val, Y_val)
    #results[i*1000:i*1000 + 1000] = y_pred_test
    print(f"Accuracy on train set / val set {i} : {score_train} / {score_val} (λ: {λ},γ: {γ})")

Accuracy on train set / val set 0 : 1.0 / 0.56 (λ: 6e-06,γ: 275)
Accuracy on train set / val set 1 : 1.0 / 0.75 (λ: 6e-06,γ: 275)
Accuracy on train set / val set 2 : 1.0 / 0.6325 (λ: 6e-06,γ: 275)


## Tuning SVM

In [23]:
γ = 350
λ = 1e-5
gamma_list = np.linspace(50,γ,5, endpoint = True)
lambda_list = np.linspace(5e-7, λ, 5, endpoint = True)
settings = list(product(gamma_list,lambda_list))
best_score = {i: 0 for i in range(3)}
best_lambda = {i: 0 for i in range(3)}
best_gamma = {i: 0 for i in range(3)}

for k, tup in enumerate(settings):
    
    γ, λ = tup
    kernel = GaussianKernel(γ)

    results = np.zeros(3000)
    len_files = len(FILES)
    for i in range(len_files):
        X_train, Y_train, X_test = load_data(i, data_dir=DATA_DIR, files_dict=FILES)
        X_val = X_train[1600:]
        Y_val = Y_train[1600:]
        X_train = X_train[:1600]
        Y_train = Y_train[:1600]
        clf = SVM(_lambda=λ, kernel=kernel)
        clf.fit(X_train, Y_train)
        y_pred_train =clf.predict(X_train)
        y_pred_val = clf.predict(X_val)
        score_train = clf.score(y_pred_train, Y_train)
        score_val = clf.score(y_pred_val, Y_val)
        #results[i*1000:i*1000 + 1000] = y_pred_test
        print(f"Accuracy on train set / val set {i} : {score_train} / {score_val} (λ: {λ},γ: {γ})")
        
        if score_val > best_score[i]:
            best_score[i] = score_val
            best_lambda[i] = λ
            best_gamma[i] = γ
        
    print('\n')

Accuracy on train set / val set 0 : 1.0 / 0.5825 (λ: 5e-07,γ: 50.0)
Accuracy on train set / val set 1 : 1.0 / 0.705 (λ: 5e-07,γ: 50.0)
Accuracy on train set / val set 2 : 1.0 / 0.6175 (λ: 5e-07,γ: 50.0)


Accuracy on train set / val set 0 : 1.0 / 0.5825 (λ: 2.875e-06,γ: 50.0)
Accuracy on train set / val set 1 : 1.0 / 0.705 (λ: 2.875e-06,γ: 50.0)
Accuracy on train set / val set 2 : 1.0 / 0.6175 (λ: 2.875e-06,γ: 50.0)


Accuracy on train set / val set 0 : 1.0 / 0.5825 (λ: 5.2500000000000006e-06,γ: 50.0)
Accuracy on train set / val set 1 : 1.0 / 0.705 (λ: 5.2500000000000006e-06,γ: 50.0)
Accuracy on train set / val set 2 : 1.0 / 0.6175 (λ: 5.2500000000000006e-06,γ: 50.0)


Accuracy on train set / val set 0 : 1.0 / 0.5825 (λ: 7.625000000000001e-06,γ: 50.0)
Accuracy on train set / val set 1 : 1.0 / 0.705 (λ: 7.625000000000001e-06,γ: 50.0)
Accuracy on train set / val set 2 : 1.0 / 0.6175 (λ: 7.625000000000001e-06,γ: 50.0)


Accuracy on train set / val set 0 : 1.0 / 0.5825 (λ: 1e-05,γ: 50.0)
A

### Notebook Result:

**SVM:**
- score : {0: 0.6075, 1: 0.73, 2: 0.66}
- lambda : {0: 1e-05, 1: 1e-05, 2: 1e-05}
- gamma : {0: 350.0, 1: 400.0, 2: 500.0}
___
- score : {0: 0.5775, 1: 0.745, 2: 0.6375}
- lambda : {0: 1e-05, 1: 1e-05, 2: 1e-05}
- gamma : {0: 400.0, 1: 300.0, 2: 500.0}
___
- Accuracy on train set / val set 0 : 1.0 / 0.56 (λ: 6e-06,γ: 275)
- Accuracy on train set / val set 1 : 1.0 / 0.75 (λ: 6e-06,γ: 275)
- Accuracy on train set / val set 2 : 1.0 / 0.6325 (λ: 6e-06,γ: 275)


## Biological Sequence Modeling with Convolutional Kernel Networks

**Define function to encode k-mer of x centered at position i**

In [5]:
ENCODING = {'A': [1.,0.,0.,0.],
                'C': [0.,1.,0.,0.],
                'G': [0.,0.,1.,0.],
                'T': [0.,0.,0.,1.]
               }

def P(i, x, k):
    
    not_in = True # True when the k-mers computed is at the edge of the sequence x
    if i-(k+1)//2 + 1 < 0:
        k_mer_i = x[len(x) + i-(k+1)//2 + 1:] + x[0 :  i + (k+2)//2]
    elif i + (k+2)//2 > len(x):
        k_mer_i = x[i-(k+1)//2 + 1 : ] +  x[:i + (k+2)//2 - len(x)]
    else:
        k_mer_i = x[i-(k+1)//2 + 1 :  i + (k+2)//2]
        not_in = False
        
    # concatenate one hot encoding
    L = []
    for c in k_mer_i:
        L += ENCODING[c]
    
    return np.array(L), not_in
    

In [6]:
ENCODING = {'A': [1.,0.,0.,0.],
            'C': [0.,1.,0.,0.],
            'G': [0.,0.,1.,0.],
            'T': [0.,0.,0.,1.],
            'Z': [0.,0.,0.,0.]} # used in zero-padding

def P(i, seq, k, zero_padding=True):
    """
    Compute the a k_mers at a given position in a nucleotides sequence
    
    Parameters
    -----------
    - i : int
        Position in the sequence
        
    - k : int
        Size of k-mer to be returned
        
    - seq : str
        Sequence of nucleotides
        
    - zero_padding : boolean (optional)
        Whether to use zero-padding on the sequence edges
        Default: True
        
    Returns
    -----------
    - L : numpy.array
        One-hot encoding of the string sequence
        
    - not_in : boolean
        Whether the k-mer was computed on the sequence edges
        Always set to False when using zero-padding
    """
    # Setup
    not_in = True
    if zero_padding:
        not_in = False
    
    # lower edge
    if i-(k+1)//2 + 1 < 0:
        # Use heading zero padding here
        n_zeros = abs(i - (k+1) // 2 + 1)
        k_mer_i = 'Z'*n_zeros + seq[:  i + (k+2)//2]
    # upper edge
    elif i + (k+2)//2 > len(seq):
        # Use trailing zero padding here
        n_zeros = i + (k+2) // 2 - len(seq)
        k_mer_i = seq[i - (k+1)//2 + 1:] + 'Z'*n_zeros
    # in the middle
    else:
        k_mer_i = seq[i-(k+1)//2 + 1 :  i + (k+2)//2]
        not_in = False
        
    # concatenate one hot encoding
    L = []
    for c in k_mer_i:
        L += ENCODING[c]
    
    # Sanity check
    assert len(L) == 4 * k
    
    # Convert to array and return
    return np.array(L), not_in

**Define Kernels**

In [7]:
def κ(u, σ):
    return np.exp((u-1)/σ**2)

# define the kernel on k-mers with the norm
def K0(z1, z2, σ):
    z1_norm = np.linalg.norm(z1)
    z2_norm = np.linalg.norm(z2)
    z1z2_norm = z1_norm*z2_norm
    return z1z2_norm*np.exp(-(1/(2*z1z2_norm*(σ**2)))*np.linalg.norm(z1-z2)**2)

# same kernel but with the scalar product
def K1(z1, z2, σ):
    z1_norm = np.linalg.norm(z1)
    z2_norm = np.linalg.norm(z2)
    z1z2_norm = z1_norm*z2_norm
    u = z1.dot(z2)/z1z2_norm
    return z1z2_norm*κ(u,σ)

# define the kernel on sequences of k-mers
def conv_kernel(x,y,k, σ):
    mx = len(x)
    my = len(y)    
    Px = np.array([P(i,x,k)[0] for i in range(mx)])
    Py = np.array([P(i,y,k)[0] for i in range(mx)])
    PxPyt = Px.dot(Py.T)/k
    s = k*np.exp((1/(σ**2))*(PxPyt-1))
        
    return np.sum(s)/(mx*my)

## Unsupervised learning of the anchor points

**Choose parameters k of k-mer here**

In [8]:
k = 12

**Load all the k-mers**

In [9]:
def compute_kmers_list(idx, k):
    
    
    """This function compute all the k-mers of a list of sequences
    
    Parameters
    ------------
    - idx : int
        index of the dataset (0,1, or 2)
    - k : int
        length of the k-mers
    """
    
    X_train, Y_train, X_test = load_data(idx, data_dir=DATA_DIR, files_dict=FILES, mat = False)
    n = len(X_train)
    m = len(X_train[0])
    
    kmers = []
    for x in X_train:
        for i in range(m):
            p = P(i,x,k)
            if p[1] == False:
                kmers.append(p[0])
            
    kmers = np.array(kmers)
    
    return kmers

idx= 0
kmers = compute_kmers_list(idx, k)

print(f"Number of k-mers : {len(kmers)}")

Number of k-mers : 202000


**Test some $\sigma$**

In [10]:
σ0 = 0.3

In [11]:
X_train, Y_train, X_test = load_data(0, data_dir=DATA_DIR, files_dict=FILES, mat = False)


print("Check the kernel on k-mers:")

print('- Value of the kernel with two identical k-mer as input')
print(K0(kmers[0],kmers[0], σ0 ), K0(kmers[1000],kmers[1000],σ0), K0(kmers[2000],kmers[2000],σ0), 
      K0(kmers[3000],kmers[3000],σ0))

print('- Value of the kernel with two random different k-mer as input')
print(K0(kmers[0],kmers[1],σ0), K0(kmers[1000],kmers[1],σ0), K0(kmers[2000],kmers[1],σ0), K0(kmers[3000],kmers[1],σ0))

print("\n Check the kernel on sequences:")
print('- Value of the kernel with two identical sequences as input')
print(conv_kernel(X_train[10],X_train[10],k, σ0))
print(conv_kernel(X_train[100],X_train[100],k, σ0))
print(conv_kernel(X_train[1000],X_train[1000],k, σ0))
print(conv_kernel(X_train[1100],X_train[1100],k, σ0))
print(conv_kernel(X_train[1200],X_train[1200],k, σ0))

print('- Value of the kernel with two random different sequences as input')
print(conv_kernel(X_train[0],X_train[10],k, σ0))
print(conv_kernel(X_train[10],X_train[100],k, σ0))
print(conv_kernel(X_train[100],X_train[1000],k, σ0))
print(conv_kernel(X_train[1000],X_train[1100],k, σ0))
print(conv_kernel(X_train[1100],X_train[1200],k, σ0))

Check the kernel on k-mers:
- Value of the kernel with two identical k-mer as input
7.000000000000001 11.999999999999998 11.999999999999998 11.999999999999998
- Value of the kernel with two random different k-mer as input
0.1828140344788838 0.003496519940911987 0.00011644911110618748 0.0011249573706532483

 Check the kernel on sequences:
- Value of the kernel with two identical sequences as input
0.11578706882400058
0.11585193296769444
0.11546459039155516
0.11424920032035216
0.11645633856840333
- Value of the kernel with two random different sequences as input
0.008587979925110187
0.007509485644995965
0.007609519990360209
0.008049308925806652
0.007208346204913559


## K-means

In [14]:
def Kmeans(X, K, max_iter, tol=1e-30):
    
    #step 0 (initialize centroids mu)
    idx = np.random.randint(len(X), size=K)
    mu = X[idx]    
    
    n_iter = 0
    stop = False
    d = 1e6
    while stop != True:
        
        #create clusters
        clustering = np.zeros(len(X))
            
        #step 1 (minimizing by assigning a cluster to each point)
        for i in range(len(X)):
            clustering[i] = np.argmin(np.linalg.norm(X[i]-mu, axis=1))

        
        #step 2 (minimizing w.r.t mu)
        for k in range(K):
            if np.sum([clustering==k]) != 0:
                mu[k] = np.mean(X[clustering==k], axis=0)

        
        d_new = distortion(X, mu, clustering)
        #print(d_new)
        
        if np.abs(d_new-d) < tol or n_iter > max_iter:
            stop = True
            
        d = d_new
        
        n_iter +=1
        
        
        
    return mu, clustering


def distortion(X, mu, clustering):
    dis = 0
    for k in range(len(mu)):
        dis = dis + np.linalg.norm(X[clustering==k] - mu[k])**2
    return dis

#We try several random initializations and keep the partition which minimize the distorsion.
def Kmeans_try(X, n_try, n_cluster, max_iter):

    for i in range(n_try):
        
        mu, cl = Kmeans(X, n_cluster, max_iter)
    
        if i == 0:
            dist_min = distortion(X, mu, cl)
            mu_min, cl_min = mu, cl
        else:
            if distortion(X, mu, cl) < dist_min:
                dist_min = distortion(X, mu, cl)
                mu_min, cl_min = mu, cl
                
                
    return mu_min, cl_min, dist_min

## Spectral Clustering

In [170]:
def spec_cl(n_cl, kmers, σ):
    
    
    t = time()

    # compute Gram matrix
    k = int(len(kmers[0])/4)
    kernel = GaussianKernel(1/(2*(σ**2)*k))
    K = k*kernel.compute_gram_matrix(kmers)
    
    print(f"Time to compute K : {time()-t}")
    t= time()
    
    # Compute the n_cl first eigenvectors (ui, ∆i)
    λ, v = sp.linalg.eigh(K)
    
    print(f"Time to compute eigenvalues : {time()-t}")
    t=time()
    
    # compute the maximum entry of a row
    #cluster_idx = np.argmax(v[:,:n_cl], axis = 1)
    # OR Kmeans on the rows
    n_try = 1
    max_iter = 20
    Z = v[:,:n_cl]/np.linalg.norm(v[:,:n_cl],axis=1).reshape(-1,1) # normalize v
    
    mu_rows, cluster_idx, dist = Kmeans_try(Z, n_try, n_cl, max_iter)
    print(f"Time to compute kmeans {time() - t}")
    
    
    # compute the barycenter
    mu = []
    for i in range(n_cl):
        if np.sum([cluster_idx==i]) != 0:
            bary = np.mean(kmers[cluster_idx==i], axis = 0)
            mu.append(bary)
    
    return mu, v

Object `sp.eigh` not found.


**Choose parameter $\sigma$ here**

In [12]:
σc = 0.3

In [172]:
i = 0
kmers = compute_kmers_list(i, k)

n_cl = 200
n_kmers = 10000
# choose random kmers to do clustering
idx = np.random.choice(range(len(kmers)), size = n_kmers, replace = False)
# do spectral clustering (with gaussian kernel with \sigmac)
mu, v = spec_cl(n_cl, kmers[idx], σc)

Time to compute K : 18.868634939193726
Time to compute eigenvalues : 232.59353160858154
Time to compute kmeans 60.57641625404358


**Random anchors**

In [13]:
i = 2
kmers = compute_kmers_list(i, k)


n_kmers = 6000

index = np.random.choice(range(len(kmers)), replace=False, size = n_kmers)
anchors = kmers[index]

In [14]:
mu = anchors

**Compute the mapping approximation $\psi$**

In [15]:
σ = σc

In [16]:
σ = 0.4

In [17]:
# compute K_ZZ
Z = mu
p = len(mu)
K_zz = np.zeros((p,p))
for j in range(p):
    for i in range(j+1):
        K_zz[i,j] = K1(Z[i],Z[j], σ)
K_zz =  K_zz + K_zz.T
np.fill_diagonal(K_zz, np.diagonal(K_zz)/2)

# compute K_ZZ inv**0.5
β = 1e-3
K_ZZ_inv_sqr = sp.linalg.sqrtm(sp.linalg.inv(K_zz + β*np.eye(np.shape(K_zz)[0])))
#K_ZZ_inv_sqr = sp.linalg.inv(np.real(sp.linalg.sqrtm(K_zz)))



# define ψ_0
def ψ_0(z,Z_anchor, k , σ):
    return(K_ZZ_inv_sqr.dot(np.array([K1(z,z_a, σ) for z_a in Z_anchor])))


# define ψ
def ψ(x, Z_anchor, k , σ):
    P_x = [P(i,x,k)[0] for i in range(len(x)) if P(i,x,k)[1] == False]
    L = np.array([ψ_0(z, Z_anchor, k, σ) for z in P_x])
    return np.sum(L, axis=0)/len(L)


def ψ_optim(x, Z_anchor, k , σ):
    P_x = np.array([P(i,x,k)[0] for i in range(len(x)) if P(i,x,k)[1] == False])
    Z = np.array(Z_anchor)
    Z = Z/np.linalg.norm(Z,axis=1).reshape(-1,1) # normalize Z rows
    P_x_norm = P_x/np.linalg.norm(P_x,axis=1).reshape(-1,1) # normalize P_x rows
    S = Z.dot(P_x_norm.T)
    S = np.einsum('i, ij -> ij',np.linalg.norm(Z,axis=1), np.sqrt(k)*np.exp((S - 1)/σ**2))
    b = K_ZZ_inv_sqr.dot(S)
    return np.sum(b, axis=1)/np.shape(b)[1]


# define approx kernel
def approx_K(x,y, mu_min, k, σ):
    return ψ(x, mu_min, k, σ).dot(ψ(y, mu_min, k, σ))

In [18]:
K_ZZ_inv_sqr

array([[ 3.27877098e-01, -3.12678589e-03,  1.57654432e-04, ...,
         1.87550448e-05, -1.24909455e-04, -7.44996673e-05],
       [-3.12678589e-03,  3.28156514e-01,  1.63415640e-04, ...,
         9.90871982e-05,  2.95344071e-04, -2.32709754e-04],
       [ 1.57654432e-04,  1.63415640e-04,  3.26761168e-01, ...,
         1.26522434e-04,  7.98796405e-04,  1.24076457e-03],
       ...,
       [ 1.87550448e-05,  9.90871982e-05,  1.26522434e-04, ...,
         3.41236524e-01, -1.05853106e-04,  1.29427697e-04],
       [-1.24909455e-04,  2.95344071e-04,  7.98796405e-04, ...,
        -1.05853106e-04,  3.48831858e-01,  3.49377998e-04],
       [-7.44996673e-05, -2.32709754e-04,  1.24076457e-03, ...,
         1.29427697e-04,  3.49377998e-04,  3.58307220e-01]])

**Check Kernel approximation**

In [19]:
x = np.random.choice(X_train)
P_x = [P(i,x,k)[0] for i in range(len(x))]
L = np.array([ψ_0(z, mu, k, σ) for z in P_x])


i = 0
j = 0
print(f"approximate value : {L[i].dot(L[j]).real} / true value : {K0(P_x[i],P_x[j], σ)}")

approximate value : 1.061749162166857 / true value : 7.000000000000001


In [20]:
n_mean = 10

mean_error = 0
mean_value = 0
var = 0
X_train, Y_train, X_test = load_data(0, data_dir=DATA_DIR, files_dict=FILES, mat = False)

for i in tqdm_notebook(range(n_mean)):
    

    x = np.random.choice(X_train)
    y = np.random.choice(X_train)

    true_value = conv_kernel(x,y,k, σ)
    approx_value = ψ_optim(x, mu, k, σ).dot(ψ_optim(y, mu, k, σ))
    
    mean_error += np.abs(true_value - approx_value)
    mean_value += true_value
    var += true_value**2
    
    if (i<10):
        print(true_value, approx_value)
    
mean_error = mean_error/n_mean
mean_value = mean_value/n_mean
var = var/n_mean- mean_value**2
standard_deviation = np.sqrt(var)    

print(f"% error =  {mean_error/standard_deviation*100}")
print(f"Mean Approximation Error: {mean_error} / True Kernel sd : {standard_deviation} / Mean true Kernel Value : {mean_value}")

HBox(children=(IntProgress(value=0, max=10), HTML(value='')))

0.13393066045778407 0.013097076144076954
0.13403787784960022 0.013166338141827001
0.17979925286800735 0.01625274705599097
0.15261732973080738 0.015012253988883341
0.1287442494192495 0.012598866445533482
0.1380829769727547 0.013545784657685134
0.1351660677440754 0.013216467510054576
0.1492762810704115 0.014542268424145426
0.13951834993164483 0.013625778197416422
0.15777726638826184 0.01479285302252249

% error =  896.5608154077801
Mean Approximation Error: 0.13090998788444608 / True Kernel sd : 0.014601350587121599 / Mean true Kernel Value : 0.14489503124325967


**Compute embeddings and Gram matrix**

In [21]:
i = 2
X_train, Y_train, X_test = load_data(i, data_dir=DATA_DIR, files_dict=FILES, mat = False)
X_val = X_train[1600:]
Y_val = Y_train[1600:]
X_train = X_train[:1600]
Y_train = Y_train[:1600]


embed_train = []
for x in tqdm_notebook(X_train):
    embed_train.append(ψ_optim(x,mu,k,σ))
embed_val = []
for x in tqdm_notebook(X_val):
    embed_val.append(ψ_optim(x,mu,k,σ))
    
E_train = np.array(embed_train)
E_val = np.array(embed_val)

HBox(children=(IntProgress(value=0, max=1600), HTML(value='')))




HBox(children=(IntProgress(value=0, max=400), HTML(value='')))




**Run SVM!**

Dataset 0

In [312]:
γ = 747
kernel = GaussianKernel(γ)
λ = 13e-5
#kernel = linear_kernel


clf = SVM(λ, kernel)
clf.fit(E_train, Y_train)
y_pred_train =clf.predict(E_train)
y_pred_val =clf.predict(E_val)
score_train = clf.score(y_pred_train, Y_train)
score_val = clf.score(y_pred_val, Y_val)
print(f"Accuracy on train set / val set {i} : {score_train} / {score_val} (λ: {λ},γ: {γ})")

Accuracy on train set / val set 0 : 1.0 / 0.68 (λ: 0.00013,γ: 747)


Dataset 1

In [122]:
kernel = gaussian_kernel
γ = 2000
λ = 20
#kernel = linear_kernel


clf = SVM(γ, λ, kernel)
clf.fit(E_train, Y_train)
y_pred_train =clf.predict(E_train)
y_pred_val =clf.predict(E_val)
score_train = clf.score(y_pred_train, Y_train)
score_val = clf.score(y_pred_val, Y_val)
print(f"Accuracy on train set / val set {i} : {score_train} / {score_val} (λ: {λ},γ: {γ})")

HBox(children=(IntProgress(value=0, max=1600), HTML(value='')))

HBox(children=(IntProgress(value=0, max=400), HTML(value='')))

Accuracy on train set / val set 1 : 1.0 / 0.6175 (λ: 20,γ: 2000)


Dataset 2

In [69]:
γ = 300
λ = 2.11e-4
kernel = GaussianKernel(γ)


clf = SVM(λ, kernel)
clf.fit(E_train, Y_train)
y_pred_train =clf.predict(E_train)
y_pred_val =clf.predict(E_val)
score_train = clf.score(y_pred_train, Y_train)
score_val = clf.score(y_pred_val, Y_val)
print(f"Accuracy on train set / val set {i} : {score_train} / {score_val} (λ: {λ},γ: {γ})")

Accuracy on train set / val set 2 : 0.99375 / 0.67 (λ: 0.000211,γ: 300)


**Tuning**

In [22]:
γ = 10
λ = 1e-8
i = 2
gamma_list = np.linspace(500, γ, 15, endpoint = True)
lambda_list = np.linspace(1e-3, λ, 10, endpoint = True)
settings = list(product(gamma_list,lambda_list))
best_score = {i: 0 for i in range(3)}
best_lambda = {i: 0 for i in range(3)}
best_gamma = {i: 0 for i in range(3)}

for j, tup in enumerate(settings):
    
    γ, λ = tup
    
    #kernel = GaussianKernel(γ)
    clf = SVM(_lambda=λ, kernel=kernel)
    clf.fit(E_train, Y_train)    
    y_pred_train =clf.predict(E_train)
    y_pred_val =clf.predict(E_val)
    score_train = clf.score(y_pred_train, Y_train)
    score_val = clf.score(y_pred_val, Y_val)
    
    
    
    
    print(f"Accuracy on train set / val set {i} : {score_train} / {score_val} (λ: {λ},γ: {γ})")        
    
    if score_val > best_score[i]:
        best_score[i] = score_val
        best_lambda[i] = λ
        best_gamma[i] = γ
        
    
print(f"Best score : {best_score[i]} / gamma : {best_gamma[i]} / lambda : {best_lambda[i]}")

Accuracy on train set / val set 2 : 0.691875 / 0.58 (λ: 0.001,γ: 500.0)
Accuracy on train set / val set 2 : 0.70625 / 0.575 (λ: 0.00088889,γ: 500.0)
Accuracy on train set / val set 2 : 0.730625 / 0.59 (λ: 0.00077778,γ: 500.0)
Accuracy on train set / val set 2 : 0.765625 / 0.605 (λ: 0.00066667,γ: 500.0)
Accuracy on train set / val set 2 : 0.81125 / 0.6125 (λ: 0.00055556,γ: 500.0)
Accuracy on train set / val set 2 : 0.875625 / 0.6225 (λ: 0.00044445,γ: 500.0)
Accuracy on train set / val set 2 : 0.9375 / 0.6175 (λ: 0.0003333399999999999,γ: 500.0)
Accuracy on train set / val set 2 : 0.99 / 0.6675 (λ: 0.0002222299999999999,γ: 500.0)
Accuracy on train set / val set 2 : 1.0 / 0.6375 (λ: 0.00011111999999999993,γ: 500.0)
Accuracy on train set / val set 2 : 1.0 / 0.635 (λ: 1e-08,γ: 500.0)
Accuracy on train set / val set 2 : 0.691875 / 0.58 (λ: 0.001,γ: 465.0)
Accuracy on train set / val set 2 : 0.70625 / 0.575 (λ: 0.00088889,γ: 465.0)
Accuracy on train set / val set 2 : 0.730625 / 0.59 (λ: 0.0007

Accuracy on train set / val set 2 : 0.765625 / 0.605 (λ: 0.00066667,γ: 150.0)
Accuracy on train set / val set 2 : 0.81125 / 0.6125 (λ: 0.00055556,γ: 150.0)
Accuracy on train set / val set 2 : 0.875625 / 0.6225 (λ: 0.00044445,γ: 150.0)
Accuracy on train set / val set 2 : 0.9375 / 0.6175 (λ: 0.0003333399999999999,γ: 150.0)
Accuracy on train set / val set 2 : 0.99 / 0.6675 (λ: 0.0002222299999999999,γ: 150.0)
Accuracy on train set / val set 2 : 1.0 / 0.6375 (λ: 0.00011111999999999993,γ: 150.0)
Accuracy on train set / val set 2 : 1.0 / 0.635 (λ: 1e-08,γ: 150.0)
Accuracy on train set / val set 2 : 0.691875 / 0.58 (λ: 0.001,γ: 115.0)
Accuracy on train set / val set 2 : 0.70625 / 0.575 (λ: 0.00088889,γ: 115.0)
Accuracy on train set / val set 2 : 0.730625 / 0.59 (λ: 0.00077778,γ: 115.0)
Accuracy on train set / val set 2 : 0.765625 / 0.605 (λ: 0.00066667,γ: 115.0)
Accuracy on train set / val set 2 : 0.81125 / 0.6125 (λ: 0.00055556,γ: 115.0)
Accuracy on train set / val set 2 : 0.875625 / 0.6225 (

## Save results

In [45]:
def save_results(filename, results):
    """
    Save results in a csv file
    
    Parameters
    -----------
    - filename : string
        Name of the file to be saved under the ``results`` folder
        
    - results : numpy.array
        Resulting array (0 and 1's)
    """
    
    assert filename.endswith(".csv"), "this is not a csv extension!"
    # Convert results to int
    results = results.astype("int")
    
    with open(os.path.join(RESULT_DIR, filename), 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')

        # Write header
        writer.writerow(["Id", "Bound"]) 
        assert len(results) == 3000, "There is not 3000 predictions"
        # Write results
        for i in range(len(results)):
            writer.writerow([i, results[i]])

In [50]:
# Test the save results function
save_results("results5.csv", results)