## Data Science Society Workshop 8 - Support Vector Machine (Solutions)

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import svm
from sklearn.model_selection import train_test_split

#### 1) Ph Recognition
The "ph-recognition.csv" data set consists of three colour columns (all numerical) indicating how strongly that colour features in a particular Ph measurement. The "label" column indicates the resulting Ph value for that set of colours. There is also a "test?" column that indicates whether a particular row is allocated for testing. The entry in this column will be ````True```` if this is the case and ````False```` if it's intended for the training set. Split the training as specified by the "test?" column. Using a support vector machine algorithm achieve a score of $0.7$ when using ````clf.score````. Hint: think of what arguments can be varied and play around with these.

In [None]:
## SOLUTION CODE CELL
# Generating data frame from csv
df = pd.read_csv("ph-recognition.csv")
df

In [None]:
## SOLUTION CODE CELL
# Training data
X_train = df.drop("label",axis=1).loc[df["test?"] == False].drop("test?",axis=1)
y_train = df.loc[df["test?"] == False].drop("test?",axis=1)["label"]

# Test data
X_test = df.drop("label",axis=1).loc[df["test?"] == True].drop("test?",axis=1)
y_test = df.loc[df["test?"] == True].drop("test?",axis=1)["label"]

In [None]:
## SOLUTION CODE CELL
# Selecting a support vector machine with a linear kernel as the classifier
clf = svm.SVC(gamma="auto",kernel="linear")

# Training the classifier
clf.fit(X_train,y_train)

# Scoring the classifier
clf.score(X_test,y_test)

#### 2) Travel Insurance Claims
The "travel-insurance.zip" file contains information on travel insurance claims. Considering only the numerical columns use a support vector machine to build a model that predicts the likelyhood someone claims out on travel insurance. Score your model and use ````train_test_split```` to split into test and training sets. Training for this data set may take a while but all you should need in your ````svm.SVC```` arguments is ````gamma='auto'````.

In [None]:
## SOLUTION CODE CELL
# Generating data frame 
df = pd.read_csv("travel-insurance.zip")
df

In [None]:
## SOLUTION CODE CELL
# Independent variables
X = df[["Duration","Age","Net Sales","Commision (in value)"]]

# Dependent variable
y = df["Claim"]

# Splitting into test and training data
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2)

In [None]:
## SOLUTION CODE CELL
# Selecting a support vector machine for the classifier
clf = svm.SVC(gamma="auto")

# Training
clf.fit(X_train,y_train)

# Scoring
clf.score(X_test,y_test)

Your model should return a return a seemingly solid score somewhere in the high $0.9$s. However, consider why this score might not be reflective of your models ability to make predictions. Hint: Consider the entries in the "Claim" column.

In [None]:
## SOLUTION CODE CELL 
# Number of "Yes" claims over total number of claim entries
len(df.loc[df["Claim"] == "Yes"])/len(df)

Only a tiny fraction of our claims column consists of "Yes" entries. This means that are classifier has only really been exposed to the "No" claims to any significant degree. It is therefore only natural that it would guess "No". Since the column consists mainly of "No" anyway such a guess being correct is highly likely so our algorithm doesn't provide much insight into this data set. 

### References
[1] ph-recognition.csv: https://www.kaggle.com/robjan/ph-recognition

[2] travel-insurance.zip: https://www.kaggle.com/mhdzahier/travel-insurance