# Challenge Scratchbook

* This notebook explores methods for the Kernel Methods for Machine Learning Kaggle [challenge](https://www.kaggle.com/c/kernel-methods-for-machine-learning-2018-2019/data).

* Note that this is a binary classification challenge.

Our first goal is to implement two baseline methods:
1. Random classification
2. All instances are 0s (Doing so we get an idea of the proportion of 0's in the public test set)
3. Implement the Simple Pattern Recognition Algorithm (SPR) from Learning with Kernels 

Before that, we have to implement some data loaders

## Imports

In [4]:
import csv
import os
import numpy as np

## Paths and Globals

In [5]:
CWD = os.getcwd()
DATA_DIR = os.path.join(CWD, "data")
RESULT_DIR = os.path.join(CWD, "results")

FILES = {0: {"train_mat": "Xtr0_mat100.csv",
             "train": "Xtr0.csv",
             "test_mat": "Xte0_mat100.csv",
             "test": "Xte0.csv",
             "label": "Ytr0.csv"},
         1: {"train_mat": "Xtr1_mat100.csv",
             "train": "Xtr1.csv",
             "test_mat": "Xte1_mat100.csv",
             "test": "Xte1.csv",
             "label": "Ytr1.csv"},
         2: {"train_mat": "Xtr2_mat100.csv",
             "train": "Xtr2.csv",
             "test_mat": "Xte2_mat100.csv",
             "test": "Xte2.csv",
             "label": "Ytr2.csv"}}

## 0 entries

In [6]:
#with open(os.path.join(RESULT_DIR, "results.csv"), 'w', newline='') as csvfile:
 #   writer = csv.writer(csvfile, delimiter=',')
    
  #  writer.writerow(["Id", "Bound"])
   # for i in range(3000):
    #    writer.writerow([i, 0])

**Comment:**

* We get 0.51266 which means that the dataset is pretty balanced.

## SPR: A Simple Pattern Recognition Algorithm

In [7]:
class SPR:
    """
    This class implements the Simple Pattern Recognition algorithm found in the Learning with Kernel books
    """
    def __init__(self,c, kernel=False):
        self.m0 = 0
        self.m1 = 0
        self.b = 0
        self.c = c
        self.kernel = kernel


        
    def fit(self,X,y):
        """
        Fitting phase
        
        Parameters
        ------------
        - X : numpy.ndarray
            Data matrix
            
        - y : numpy.array
            Labels
        """
        
        self.X_train = X
        self.X0 = X[y == 0]
        self.X1 = X[y == 1]
        
        self.m0 = len(self.X0)
        self.m1 = len(self.X1)
        
        if self.kernel == False:
            self.b = 1/2 * (1/(self.m0**2)*np.sum(self.X0.dot(self.X0.T)) - 1/(self.m1**2)*np.sum(self.X1.dot(self.X1.T)))
        else:
            # à changer
            self.list0 = list(np.where(y==0)[0])
            self.list1 = list(np.where(y==1)[0])
            self.b = 1/2 * (1/(self.m0**2)*np.sum([self.kernel(self.X_train[i],self.X_train[j],self.c) for i in self.list0 for j in self.list0]) - (1/(self.m1**2))*np.sum([self.kernel(self.X_train[i],self.X_train[j],self.c) for i in self.list1 for j in self.list1]))

    
    def predict(self,X):
        
        y_pred = np.zeros(len(X))
        
        for i in range(len(X)):
            if self.kernel == False:
                val = 1 / self.m1 * np.sum(self.X1.dot(X[i])) - 1 / self.m0 * np.sum(self.X0.dot(X[i])) + self.b
            else:
                val = (1/self.m1*np.sum([kernel(self.X_train[k],X[i],self.c) for k in self.list1]) 
                       - 1/self.m0*np.sum([kernel(self.X_train[k],X[i],self.c) for k in self.list0])) + self.b
            y_pred[i] = np.sign(val)/2 + 1/2
        return y_pred    
    
    
    def score(self, y, y_pred):
        return np.sum([y == y_pred]) / len(y)
    

## Data loading refactoring

In [8]:
def load_data(file_id, mat=True):
    
    X_train = list()
    Y_train = list()
    X_test = list()
    
    dic = FILES[file_id]
    
    if mat:
        files = [dic["train_mat"], dic["label"], dic["test_mat"]]
    else:
        files = [dic["train"], dic["label"], dic["test"]]

    for file, l in zip(files, [X_train, Y_train, X_test]):
        with open(os.path.join(DATA_DIR, file), "r", newline="") as csvfile:
            if "Y" in file:
                reader = csv.reader(csvfile, delimiter=",")
                next(reader, None) # Skip the header
                for row in reader:
                    l.append(row[1])
            else:
                reader = csv.reader(csvfile, delimiter=" ")
                for row in reader:
                    l.append(row)
                
    X_train = np.array(X_train).astype("float")
    Y_train = np.array(Y_train).astype("int")
    X_test = np.array(X_test).astype("float")
    
    return X_train, Y_train, X_test

## Define Kernels

In [20]:
def kernel(x,y,c): #c=0
    return (x.dot(y) + c)**2

def kernel(x,y, gamma): #c=100
    return np.exp(-gamma*np.linalg.norm(x-y)**2)

## Train and test on the different sets

In [24]:
results = np.zeros(3000)

for i in range(len(FILES)):
    X_train, Y_train, X_test = load_data(i)
    X_val = X_train[1600:]
    Y_val = Y_train[1600:]
    X_train = X_train[:1600]
    Y_train = Y_train[:1600]
    clf = SPR(0, False)
    clf.fit(X_train, Y_train)
    y_pred_train =clf.predict(X_train)
    y_pred_val = clf.predict(X_val)
    y_pred_test = clf.predict(X_test)
    
    score_train = clf.score(y_pred_train, Y_train)
    score_val = clf.score(y_pred_val, Y_val)
    results[i*1000:i*1000 + 1000] = y_pred_test
    print(f"Accuracy on train set / val set {i} : {score_train} / {score_val}")

Accuracy on train set / val set 0 : 0.563125 / 0.5175
Accuracy on train set / val set 1 : 0.65125 / 0.605
Accuracy on train set / val set 2 : 0.58625 / 0.6325


**Produit scalaire:**
- Accuracy on train set / val set 0 : 0.563125 / 0.5175
- Accuracy on train set / val set 1 : 0.65125 / 0.605
- Accuracy on train set / val set 2 : 0.58625 / 0.6325

**Noyau carré (<.,.>²):**


**Noyau Gaussien:**

## Train and test on the whole set

In [31]:
n = 2000
len_val = 0
len_train = n - len_val
all_X_train = np.zeros((3*len_train,100))
all_Y_train = np.zeros(3*len_train)
all_X_val = np.zeros((3*len_val,100))
all_Y_val = np.zeros(3*len_val)
all_X_test = np.zeros((3000,100))


for i in range(len(FILES)):
    X_train, Y_train, X_test = load_data(i)
    X_val = X_train[len_train:]
    Y_val = Y_train[len_train:]
    X_train = X_train[:len_train]
    Y_train = Y_train[:len_train]
    all_X_train[i*len_train:i*len_train + len_train] = X_train
    all_Y_train[i*len_train:i*len_train + len_train] = Y_train
    all_X_val[i*len_val:i*len_val + len_val] = X_val
    all_Y_val[i*len_val:i*len_val + len_val] = Y_val
    all_X_test[i*1000:i*1000 + 1000] = X_test
    
clf = SPR(50, False)
clf.fit(all_X_train, all_Y_train)
y_pred_train =clf.predict(all_X_train)
y_pred_val = clf.predict(all_X_val)
y_pred_test = clf.predict(all_X_test)
score_train = clf.score(y_pred_train, all_Y_train)
score_val = clf.score(y_pred_val, all_Y_val)
print(f"Accuracy on train set / val set {i} : {score_train} / {score_val}")

Accuracy on train set / val set 2 : 0.5735 / nan




**Produit scalaire:**
- Accuracy on train set / val set 2 : 0.5741666666666667 / 0.5716666666666667


**Noyau carré (<.,.>²):**
  - Accuracy on train set / val set 2 : 0.5741666666666667 / 0.5716666666666667 $(c=0)$
  
**Noyau Gaussien:**
- Accuracy on train set / val set 2 : 0.6504166666666666 / 0.57 $(\gamma = 100)$
- Accuracy on train set / val set 2 : 0.5702083333333333 / 0.5691666666666667 $(\gamma = 10)$
- Accuracy on train set / val set 2 : 0.5889583333333334 / 0.5725 $(\gamma = 50)$

## Save results

In [33]:
def save_results(filename, results):
    """
    Save results in a csv file
    
    Parameters
    -----------
    - filename : string
        Name of the file to be saved under the ``results`` folder
        
    - results : numpy.array
        Resulting array (0 and 1's)
    """
    
    assert filename.endswith(".csv"), "this is not a csv extension!"
    # Convert results to int
    results = results.astype("int")
    
    with open(os.path.join(RESULT_DIR, filename), 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')

        # Write header
        writer.writerow(["Id", "Bound"]) 
        assert len(results) == 3000, "There is not 3000 predictions"
        # Write results
        for i in range(len(results)):
            writer.writerow([i, results[i]])

In [34]:
# Test the save results function
save_results("results3.csv", results)