# Programming Exercise 2
**Christian Steinmetz**

Due on November 27th

Pick a binary classification dataset from the LIBSVM repository:
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/


In [114]:
import time
import graphviz 
import itertools
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import IFrame
import matplotlib as mpl

from sklearn import tree
from sklearn import metrics
from sklearn import preprocessing

mpl.rcParams['figure.dpi'] = 100
%config InlineBackend.figure_format = 'retina'

## Dataset
For this exercise we the [Mushrooms dataset](https://www.kaggle.com/uciml/mushroom-classification/data), which includes features from different mushroom species and a label of whether or not the mushroom is safe to eat or poisonous. For this task, given a set of mushroom features, we want to predict whether or not it is safe to eat the mushroom. 

<img src='https://images.unsplash.com/photo-1512595765784-5ebad80772a3?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=800&q=60)' width=400>


## Loading and pre-processing
First we will load the data from the .csv file using pandas and then process it so that all of the features are encoded as numbers, since they are encoded as strings in the original dataset. We will then also split the data into instance features, $X$ and labels, $Y$.

In [223]:
df = pd.read_csv("./data/mushrooms.csv", dtype='category')
df

Unnamed: 0,class,cap-shape,cap-surface,cap-color,bruises,odor,gill-attachment,gill-spacing,gill-size,gill-color,...,stalk-surface-below-ring,stalk-color-above-ring,stalk-color-below-ring,veil-type,veil-color,ring-number,ring-type,spore-print-color,population,habitat
0,p,x,s,n,t,p,f,c,n,k,...,s,w,w,p,w,o,p,k,s,u
1,e,x,s,y,t,a,f,c,b,k,...,s,w,w,p,w,o,p,n,n,g
2,e,b,s,w,t,l,f,c,b,n,...,s,w,w,p,w,o,p,n,n,m
3,p,x,y,w,t,p,f,c,n,n,...,s,w,w,p,w,o,p,k,s,u
4,e,x,s,g,f,n,f,w,b,k,...,s,w,w,p,w,o,e,n,a,g
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8119,e,k,s,n,f,n,a,c,b,y,...,s,o,o,p,o,o,p,b,c,l
8120,e,x,s,n,f,n,a,c,b,y,...,s,o,o,p,n,o,p,b,v,l
8121,e,f,s,n,f,n,a,c,b,n,...,s,o,o,p,o,o,p,b,c,l
8122,p,k,y,n,f,y,f,c,n,b,...,k,w,w,p,w,o,e,w,v,l


In [224]:
df = df.apply(preprocessing.LabelEncoder().fit_transform)

In [225]:
def get_data(train_split):
    train_idx = np.random.choice(np.arange(len(df)), size=(train_split), replace=False)
    test_idx  = [idx for idx in np.arange(len(df)) if idx not in train_idx]

    y_train = df.loc[train_idx, 'class']
    x_train = df.loc[train_idx, df.columns != 'class']

    y_test = df.loc[test_idx, 'class']
    x_test = df.loc[test_idx, df.columns != 'class']
    
    return x_train, y_train, x_test, y_test

## 1. Decision Trees
Partition the dataset into a training and a testing set.
Run a decision tree learning algorithm usign the training set. Test the
decision tree on the testing dataset and report the total classification error
(i.e. 0/1 error). Repeat the experiment with a different partition. Plot
the resulting trees. Are they very similar, or very different? Explain why.

Advice: it can be convenient to set a maximum depth for the tree.

In [267]:
def graph_tree(trained_tree, name="tree"):
    dot_data = tree.export_graphviz(trained_tree,  
                    out_file=None, feature_names=df.columns[1:23],
                    class_names=["p", "e"], rounded=True, proportion=False, 
                    precision=2, filled = True)
    graph = graphviz.Source(dot_data) 
    graph.format = 'svg'
    graph.render(f"./plots/{name}") 
    return IFrame(f"./plots/{name}.svg", width='100%', height=800)

In [268]:
x_train, y_train, x_test, y_test = get_data(int(len(df)*0.8))
clfA = tree.DecisionTreeClassifier(random_state=0, max_depth=4)
clfA = clfA.fit(x_train, y_train)
y_hat = clfA.predict(x_test)
acc = metrics.accuracy_score(y_test, y_hat)
print(acc)

0.9766153846153847


In [269]:
x_train, y_train, x_test, y_test = get_data(int(len(df)*0.8))
clfB = tree.DecisionTreeClassifier(random_state=42, max_depth=4)
clfB = clfB.fit(x_train, y_train)
y_hat = clfB.predict(x_test)
acc = metrics.accuracy_score(y_test, y_hat)
print(acc)

0.9778461538461538


Let's observe the two different trees that were trained on different data

In [270]:
graph_tree(clfA, name="treeA")

In [271]:
graph_tree(clfB, name="treeB")

We notice that the two trees learn different things, but they do share some similiar qualities. This deviation between trees increases as the depth of the trees increase, since their are more degress of freedom. We actually found that when the depth was less than 3, with out dataset, both trees learned the same leaves, even with many different random samples of the data. 

In [272]:
x_train, y_train, x_test, y_test = get_data(int(len(df)*0.8))
clf = tree.DecisionTreeClassifier(random_state=0, max_depth=7)
clf = clf.fit(x_train, y_train)
y_hat = clf.predict(x_test)
acc = metrics.accuracy_score(y_test, y_hat)
print(acc)

1.0


With a depth of 7 we are able to get 100% accuracy on the test set.

## 2. Support Vector Machines
Run SVM to train a classifier, using radial basis as kernel function. Apply cross-validation to evaluate different
combinations of values of the model hyper-parameters (box constraint $C$
and kernel parameter $\gamma$). How sensitive is the cross-validation error to
changes in $C$ and $\gamma$? Choose the combination of $C$ and $\gamma$ that minimizes
the cross-validation error, train the SVM on the entire dataset and report
the total classification error.

Advice: use a logaritmic range for $\gamma$.

## Neural Networks
Train a Multi-Layer perceptron using the crossentropy loss with $-2$ regularization (weight decay penalty). In other
words, the activation function equals the logistic function. Plot curves
of the training and validation error as a function of the penalty strength
$\alpha$. How do the curves behave? Explain why.

Advice: use a logaritmic range for hyper-parameter $\alpha$.. Experiment with
different sizes of the training/validation sets and different model parameters (network layers).