In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy import stats
import seaborn as sns; sns.set()

In [2]:
from sklearn.svm import SVR
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV

## Support Vector Machines (SVM)

## Regression


Predicting value of a continuous valued variable

Classification: categorical label
Regression: numeric label

## Dataset¶

In [3]:
tips = sns.load_dataset('tips')
tips.head()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2
1,10.34,1.66,Male,No,Sun,Dinner,3
2,21.01,3.5,Male,No,Sun,Dinner,3
3,23.68,3.31,Male,No,Sun,Dinner,2
4,24.59,3.61,Female,No,Sun,Dinner,4


## Features

## Exercise 1
Before we create any models and train them, we'll need vectorized data, splitted into training and test sets

Run the data through get_dummies to capture the vectorized data in X
pop out the tip column (as the target variable) into y

In [4]:
X = pd.get_dummies(tips, drop_first=True)
X

Unnamed: 0,total_bill,tip,size,sex_Female,smoker_No,day_Fri,day_Sat,day_Sun,time_Dinner
0,16.99,1.01,2,1,1,0,0,1,1
1,10.34,1.66,3,0,1,0,0,1,1
2,21.01,3.50,3,0,1,0,0,1,1
3,23.68,3.31,2,0,1,0,0,1,1
4,24.59,3.61,4,1,1,0,0,1,1
...,...,...,...,...,...,...,...,...,...
239,29.03,5.92,3,0,1,0,1,0,1
240,27.18,2.00,2,1,0,0,1,0,1
241,22.67,2.00,2,0,0,0,1,0,1
242,17.82,1.75,2,0,1,0,1,0,1


In [5]:
y = X.pop('tip')
y

0      1.01
1      1.66
2      3.50
3      3.31
4      3.61
       ... 
239    5.92
240    2.00
241    2.00
242    1.75
243    3.00
Name: tip, Length: 244, dtype: float64

In [6]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123)

## Exercise 2
We start by creating a default SVR model and checking its score

Create an SVR model with default param values
Fit the training data to this model
Score this model with the test data
How does the model perform?

In [7]:
svr = SVR()

# Fit svr
svr.fit(X_train, y_train)
# Score svr
svr.score(X_test, y_test)

0.35430198765682874

## Regression Performance Metrics


Sum of Squares Error (SSE)
𝑆𝑆𝐸=∑(𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑−𝐴𝑐𝑡𝑢𝑎𝑙)2
 
Mean Square Error (MSE)
𝑀𝑆𝐸=𝑆𝑆𝐸/𝑁𝑢𝑚𝑃𝑜𝑖𝑛𝑡𝑠
 
Root Mean Square Error (RMSE)
𝑅𝑀𝑆𝐸=⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯√𝑀𝑆𝐸

Coefficient of Determination ( 𝑅2
 )
𝑅2=1−𝑆𝑆𝐸/𝑆𝑆𝑇
 
Total Sum of Squares Error (SST)
𝑆𝑆𝑇=∑(𝐴𝑐𝑡𝑢𝑎𝑙−𝑀𝑒𝑎𝑛𝐴𝑐𝑡𝑢𝑎𝑙)2

## Exercise 3¶
Just like in the SVC exercises, we'll use GridSearchCV to find the best performing model. We start with an rbf kernel

Create a model with an rbf kernel
Create a GridSearchCV object with this model and the following grid of values to evaluate on:
Param	Values
C	0.1, 1,10,100
gamma	scale, auto
epsilon	0 to 5 in steps of 0.5
Fit the training data to the GridSearch object to get the best estimator
Capture the best estimator determined by the GridSearch object into svr_rbf and score it with the test data.

In [9]:
gs = GridSearchCV(SVR(kernel='rbf'), dict(C = [0.1, 1, 10, 100], gamma = ['scale', 'auto'], epsilon = np.arange(0, 5, 0.5)))

In [10]:
# fit gs
gs.fit(X_train, y_train)


In [11]:
svr_rbf = gs.best_estimator_

# score svr_rbf
svr_rbf.score(X_test, y_test)

0.45846196881581225

## Exercise 4
Repeat the process above for a 2-degree polynomial kernel

Create a model with a poly kernel with degree=2
Create a GridSearchCV object with this model and the following grid of values to evaluate on:
Param	Values
C	0.1, 1,10,100
coef0	0,1...9
epsilon	0 to 5 in steps of 0.5
Fit the training data to the GridSearch object to get the best estimator
Capture the best estimator determined by the GridSearch object into svr_poly and score it with the test data.

In [12]:
gs = GridSearchCV(SVR(kernel = 'poly', degree=2), dict(C=[0.1, 1, 10, 100], coef0 = range(0, 10), epsilon = np.arange(0, 5, 0.5)))

In [13]:
# fit gs
gs.fit(X_train, y_train)

In [14]:
svr_poly = gs.best_estimator_

# score svr_poly
svr_poly.score(X_test, y_test)

0.4206038472385356

## Exercise 5
Repeat the same process for a linear kernel

Create a model with a linear kernel
Create a GridSearchCV object with this model and the following grid of values to evaluate on:
Param	Values
C	0.1, 1,10,100
epsilon	0 to 5 in steps of 0.5
Fit the training data to the GridSearch object to get the best estimator
Capture the best estimator determined by the GridSearch object into svr_linear and score it with the test data.

In [15]:
gs = GridSearchCV(SVR(kernel='linear'), dict(C=[0.1, 1, 10, 100], epsilon=np.arange(0, 5, 0.5)))

In [16]:
# fit gs
gs.fit(X_train, y_train)

In [17]:
svr_linear = gs.best_estimator_

# score svr_linear
svr_linear.score(X_test, y_test)

0.5106136242784295

## Exercise 6
You might notice that a linear kernel seems to be performing best on this data. We will want to perform a final fine-tuning of parameters before locking them down for prediction.

Repeat all previous steps with a finer grid of values for C and epsilon now
Param	Values
C	0.05 to 0.5 in steps of 0.05
epsilon	0 to 1 in steps of 0.1
Is this our best performing regressor?

In [18]:
gs = GridSearchCV(SVR(kernel='linear'), dict(C=np.arange(0.05, 0.5, 0.05), epsilon=np.arange(0, 1, 0.1)))

In [19]:
# fit gs
gs.fit(X_train, y_train)

In [20]:
svr_linear = gs.best_estimator_

# score svr_linear
svr_linear.score(X_test, y_test)


0.5181822170203547

## Linear Regression

In [21]:
from sklearn.linear_model import LinearRegression

## Exercise 7
Since the data appears to be linear in nature, we'll try the regression task with a LinearRegression model this time

Create a LinearRegression model that will normalize your data too
Fit the training data to this model
Score the model on test data
Does this perform better or worse than our best SVR linear-kernel model?

In [23]:
lr = LinearRegression()

In [24]:
# fit lr
lr.fit(X_train, y_train)

# score lr
lr.score(X_test, y_test)

0.5122133123426889