# Challenge Scratchbook

* This notebook explores methods for the Kernel Methods for Machine Learning Kaggle [challenge](https://www.kaggle.com/c/kernel-methods-for-machine-learning-2018-2019/data).

* Note that this is a binary classification challenge.

Our first goal is to implement two baseline methods:
1. Random classification
2. All instances are 0s (Doing so we get an idea of the proportion of 0's in the public test set)
3. Implement the Simple Pattern Recognition Algorithm (SPR) from Learning with Kernels 

Before that, we have to implement some data loaders


Now that we are done with the above, our goal is to implement SVM with Gaussian kernel.

## Imports

In [5]:
import csv
import os
import numpy as np
from scipy import optimize
from tqdm import tqdm_notebook

from utils.data import load_data, save_results
from utils.models import SVM, SPR
from utils.kernels import GaussianKernel

## Paths and Globals

In [6]:
CWD = os.getcwd()
DATA_DIR = os.path.join(CWD, "data")
RESULT_DIR = os.path.join(CWD, "results")

FILES = {0: {"train_mat": "Xtr0_mat100.csv",
             "train": "Xtr0.csv",
             "test_mat": "Xte0_mat100.csv",
             "test": "Xte0.csv",
             "label": "Ytr0.csv"},
         1: {"train_mat": "Xtr1_mat100.csv",
             "train": "Xtr1.csv",
             "test_mat": "Xte1_mat100.csv",
             "test": "Xte1.csv",
             "label": "Ytr1.csv"},
         2: {"train_mat": "Xtr2_mat100.csv",
             "train": "Xtr2.csv",
             "test_mat": "Xte2_mat100.csv",
             "test": "Xte2.csv",
             "label": "Ytr2.csv"}}

## 0 entries

In [7]:
if False:
    with open(os.path.join(RESULT_DIR, "results.csv"), 'w', newline='') as csvfile:
        writer = csv.writer(csvfile, delimiter=',')

        writer.writerow(["Id", "Bound"])
        for i in range(3000):
            writer.writerow([i, 0])

**Comment:**

* We get 0.51266 which means that the dataset is pretty balanced.

## SPR

* Simple Pattern Recognition algorithm with Gaussian kernel

In [8]:
γ = 500
λ = 5e-5
kernel = GaussianKernel(γ)

len_files = len(FILES)
for i in range(len_files):
    X_train, Y_train, X_test = load_data(i, data_dir=DATA_DIR, files_dict=FILES)
    X_val = X_train[1600:]
    Y_val = Y_train[1600:]
    X_train = X_train[:1600]
    Y_train = Y_train[:1600]
    clf = SPR(kernel)
    clf.fit(X_train, Y_train)
    y_pred_train =clf.predict(X_train)
    y_pred_val = clf.predict(X_val)
    score_train = clf.score(y_pred_train, Y_train)
    score_val = clf.score(y_pred_val, Y_val)

    print(f"Accuracy on train set / val set {i} : {score_train} / {score_val} (λ: {λ},γ: {γ})")

Accuracy on train set / val set 0 : 0.99875 / 0.585 (λ: 5e-05,γ: 500)
Accuracy on train set / val set 1 : 1.0 / 0.705 (λ: 5e-05,γ: 500)
Accuracy on train set / val set 2 : 1.0 / 0.5775 (λ: 5e-05,γ: 500)


## Train and test on the different sets

In [4]:
results = np.zeros(3000)

for i in range(len(FILES)):
    X_train, Y_train, X_test = load_data(i, data_dir=DATA_DIR, files_dict=FILES)
    clf = SPR()
    clf.fit(X_train, Y_train)
    results[i*1000:i*1000 + 1000] = clf.predict(X_test)

### Save results

In [6]:
# Test the save results function
save_results("test_results.csv", results, result_dir=RESULT_DIR)

## SVM with Gaussian Kernel

### Comparison with ``scikit-learn`` implementation

In [10]:
γ = 500
λ = 5e-5
kernel = GaussianKernel(γ)

len_files = len(FILES)
for i in range(len_files):
    X_train, Y_train, X_test = load_data(i, data_dir=DATA_DIR, files_dict=FILES)
    X_val = X_train[1600:]
    Y_val = Y_train[1600:]
    X_train = X_train[:1600]
    Y_train = Y_train[:1600]
    clf = SVM(_lambda=λ, kernel=kernel)
    clf.fit(X_train, Y_train)
    y_pred_train =clf.predict(X_train)
    y_pred_val = clf.predict(X_val)
    score_train = clf.score(y_pred_train, Y_train)
    score_val = clf.score(y_pred_val, Y_val)

    print(f"Accuracy on train set / val set {i} : {score_train} / {score_val} (λ: {λ},γ: {γ})")

Accuracy on train set / val set 0 : 1.0 / 0.575 (λ: 5e-05,γ: 500)
Accuracy on train set / val set 1 : 1.0 / 0.7275 (λ: 5e-05,γ: 500)
Accuracy on train set / val set 2 : 1.0 / 0.6375 (λ: 5e-05,γ: 500)


In [11]:
n = 2000
print(f"C: {1/(2 * n * λ)}")

C: 5.0


In [13]:
from sklearn.svm import SVC

len_files = len(FILES)
for i in range(len_files):
    X_train, Y_train, X_test = load_data(i, data_dir=DATA_DIR, files_dict=FILES)
    X_val = X_train[1600:]
    Y_val = Y_train[1600:]
    X_train = X_train[:1600]
    Y_train = Y_train[:1600]
    clf = SVC(C=5.0, kernel="rbf", gamma=500)
    clf.fit(X_train, Y_train)
    y_pred_train = clf.predict(X_train)
    y_pred_val = clf.predict(X_val)
    score_train = np.sum([Y_train == y_pred_train]) / len(Y_train)
    score_val = np.sum([Y_val == y_pred_val]) / len(Y_val)

    print(f"Accuracy on train set / val set {i} : {score_train} / {score_val} (λ: {λ},γ: {γ})")

Accuracy on train set / val set 0 : 1.0 / 0.5725 (λ: 5e-05,γ: 500)
Accuracy on train set / val set 1 : 1.0 / 0.705 (λ: 5e-05,γ: 500)
Accuracy on train set / val set 2 : 1.0 / 0.5875 (λ: 5e-05,γ: 500)
