# Practical session ML: Classification

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import random
from sklearn import datasets, neighbors, metrics, tree, svm, preprocessing

# Calculates the per-class accuracy given predicted and true output labels.
def class_accs(y_pred, y_true):
    acc0 = ((y_pred == y_true) & (y_true == 0)).sum() / (y_true == 0).sum()
    acc1 = ((y_pred == y_true) & (y_true == 1)).sum() / (y_true == 1).sum()
    return acc0, acc1

# Prints a summary of performance metrics given predicted and true output labels.
def print_metrics(y_pred, y_true):
    f1 = metrics.f1_score(y_true, y_pred)
    acc = metrics.accuracy_score(y_true, y_pred)
    acc0, acc1 = class_accs(y_pred, y_true)
    print(f'\tF1 = {f1}')
    print(f'\tAccuracy = {acc}')
    print(f'\t\tclass 0: {acc0}')
    print(f'\t\tclass 1: {acc1}')

## Part 1: Wisconsin breast cancer
The first dataset we will be using is the Wisconsin breast cancer dataset. This dataset contains characteristics of cell nuclei present in medical images of breast mass. The task is to classify these measurements into a positive ("benign") or negative ("malignant") class. First, we load the data and explore its size.

In [2]:
# load the Wisconsin breast cancer data set
wisconsin = datasets.load_breast_cancer()
X = wisconsin["data"]
y = wisconsin["target"]
names = wisconsin['target_names']

# explore the data set
print(f"Data set contains {X.shape[0]} instances with {X.shape[1]} features.")
print(f"The different classes are {wisconsin['target_names']}.")
print(f"{names[0]} contains {len(y == 0)} samples, {names[1]} contains {len(y == 1)} samples.")

Data set contains 569 instances with 30 features.
The different classes are ['malignant' 'benign'].
malignant contains 569 samples, benign contains 569 samples.


In [3]:
wisconsin.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

We first apply a simple, linear SVM to predict the class of unseen samples:

In [4]:
X_train = X[:300]
y_train = y[:300]
X_test = X[300:]
y_test = y[300:]

sv = svm.SVC(kernel = 'linear')
sv.fit(X_train, y_train)
y_pred = sv.predict(X_test)
print_metrics(y_pred, y_test)

	F1 = 0.9696969696969697
	Accuracy = 0.9553903345724907
		class 0: 0.9848484848484849
		class 1: 0.9458128078817734


**Question:** Which of the previous metrics would you use? Can you think of another metric that would reflect the classifier's accuracy on this dataset?

### Exercise 1: k-fold cross validation

A better way to evaluate the performance of a classifier on a dataset is to use k-fold cross validation (e.g. with k=5). In this case, we divide the dataset into k parts ("folds"), train the algorithm on (k-1) parts and test it on the final k-th part. We repeat this process k times to get a more representative image of the real performance of the classifier.


**Exercise:** Implement a function for k-fold cross validation. Use stratified CV, meaning that the original class distribution is respected.

In [5]:
def cross_validation(X, y, folds, clf):
    pass  # TODO

### Exercise 2: comparison of SVM kernels
Now use the function you just wrote to compare the performance of SVM on this dataset using a linear kernel vs an rbf kernel. If you weren't able to finish the previous exercise, you can just use the given `print_metrics` function.

**Question:** What can you conclude from this experiment? What do you notice regarding the speed of the different algorithms?

### Exercise 3: comparison of algorithms

Compare the performance of kNN, SVM and Decision Trees on the Wisconsin data set. Vary the internal parameters (number of nearest neighbors, impurity measure, kernel, etc.). Consult the [API documentation](http://scikit-learn.org/stable/modules/classes.html) for more details.

**Question:** What can we conclude from this experiment?

## Part 2: XOR dataset

We now switch to a second, synthetically generated dataset.

In [6]:
np.random.seed(0)
X_xor = np.random.randn(200, 2)
y_xor = np.logical_xor(X_xor[:, 0] > 0,
                       X_xor[:, 1] > 0)
y_xor = np.where(y_xor, 0, 1)

**Question:** Apply the same linear SVM that we applied on the wisconsin dataset earlier. Does it work well here? Can you explain why?

**Question:** Can you think of other methods that you could use to classify this dataset? For every method, explain why it would work better on this type of data.

## Part 3: Titanic dataset
Another very popular classification dataset is the Titanic dataset, in which we try to predict the survival of passengers on the Titanic, based on their class, age, sex, etc. In this part, we try out a few different algorithms on this dataset.

First, we load the data. It has already been converted to vector form (with categorical values converted to dummy representation), and only needs to be normalized. People who are interested in the original dataset can go to https://www.kaggle.com/c/titanic/data.

In [7]:
X = np.loadtxt("titanic.csv", delimiter=",")
y = np.loadtxt("labels.csv", delimiter=",").astype(np.int)
scaler = preprocessing.MinMaxScaler()
X = scaler.fit_transform(X)

Apply a number of classification algorithms to the Titanic dataset, and see which ones work best:
- KNN (for varying K)
- Decision tree
- SVM (linear/rbf/poly kernel)