# Workshop SL02: Classification

### Learning Objectives
* Naive Bayes
* K-Nearest Neighbours
* Decision-Trees
* Random Forests
* Support Vector Machines

### Dataset Anatonomy

Datasets are a composition of records (also called instances) (analogous to rows in a database).
* **Record**, the data (label and features) pertaining to a single item in the dataset (i.e. one of the rows of a table); a dataset is composed of 0..N instances.

Records are (in classification tasks) composed of features and labels.
* **Feature**, a property describing instances in the dataset (i.e. one of the columns of the table); each instance is described by a vector of M instances.
* **Label**, the property of an instance that you want to predict (i.e. also one of the colums of the table); the label is known for instances of the training/test datasets, but is unknown for unseen data.

### Classification and Clustering

The two major paradigms in data sciene are (1) classification and (2) clustering.
* **Classification**, mapping feature vectors to labels (e.g. given a person's Age and Sex, predict their Brain Size).
* **Clustering**, grouping instances according to some similarity metric (e.g. given a dataset of people's Nationality, Age, LanguageSpoken, Ethinicity, group instances into clusters).

We're going to focus on classification in this workshop.

In [60]:
import pandas as pd

In [61]:
# read-in the training and testing datasets
train = pd.read_csv("datasets/titanic_train_clean.csv")
train.fillna(inplace=True, method="bfill")
test = pd.read_csv("datasets/titanic_train_clean.csv")
test.fillna(inplace=True, method="bfill")

#### K-Nearest Neighbours

This is the first classifier that we're going to mess around with. Watch https://www.youtube.com/watch?v=4ObVzTuFivY, then read through the worked example.

The scientific package that we're going to use is `sklearn`; this comes with pretty much every classification (and clustering) algorithm that you can think of.

In [62]:
# extract the label series from the training/test data
train_y = train["Survived"]
test_y = test["Survived"]

# extract the feature matrix from the training/test data
train_X = train.drop(["PassengerId", "Survived"], inplace=False, axis=1)
test_X = test.drop(["PassengerId", "Survived"], inplace=False, axis=1)

In [63]:
# import the KNN classifier from sklearn
from sklearn.neighbors import KNeighborsClassifier

# define the classifier object
clf_knn = KNeighborsClassifier()

# train the classifier on our training data
clf_knn.fit(train_X, train_y)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [64]:
# check the performance of the classifier on the training data
clf_knn.score(test_X, test_y)

0.8013468013468014

## Exercises

1. Read through the `sklearn` documentation for the KNN classifier, and try messing around with a few of the hyperparameters (e.g. the value of k). See what happens to test accuracy if you make k really small (1), and if you make k really large (the size of the training dataset).
2. Using what you know so far, look up and build the following classifiers over the titanic dataset: (a) Naive Bayes, (b) Decision Tree, (c) Random Forest, (d) Support Vector Machines.

In [68]:
from sklearn.naive_bayes import GaussianNB

clf_gnb = GaussianNB()

clf_gnb.fit(train_X, train_y)

clf_gnb.score(test_X, test_y)

0.7934904601571269

In [69]:
from sklearn.ensemble import RandomForestClassifier

clf_rfc = RandomForestClassifier()

clf_rfc.fit(train_X, train_y)

clf_rfc.score(test_X, test_y)

0.9753086419753086

In [70]:
from sklearn.tree import DecisionTreeClassifier

clf_dtc = DecisionTreeClassifier()

clf_dtc.fit(train_X, train_y)

clf_dtc.score(test_X, test_y)

0.9854096520763187

In [71]:
from sklearn.svm import SVC

clf_svc = SVC()

clf_svc.fit(train_X, train_y)

clf_svc.score(test_X, test_y)

0.9180695847362514