# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [100]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

plt.style.use('bmh')

In [101]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [102]:
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [103]:
#Frecuencia por ranking
frec_tab = pd.pivot_table(data, values='alcohol', index=['quality'], columns=['type'], aggfunc=len, margins=True)
frec_tab

type,red,white,All
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,10.0,20.0,30.0
4,53.0,163.0,216.0
5,681.0,1457.0,2138.0
6,638.0,2198.0,2836.0
7,199.0,880.0,1079.0
8,18.0,175.0,193.0
9,,5.0,5.0
All,1599.0,4898.0,6497.0


# SVM

# Exercise 6.2

* Standarized the features (not the quality)
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.


In [104]:
#Estandarizar features
from sklearn.preprocessing import StandardScaler
data_w[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']] = StandardScaler().fit_transform(data_w[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']])
data_r[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']] = StandardScaler().fit_transform(data_r[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']])


In [105]:
#Crear una nueva variable vino bueno y malo
data_w['quality_bool'] = data_w['quality'] >6
data_r['quality_bool'] = data_r['quality'] >6

In [7]:
#Crear variables x & y para vino blanco
X_w = data_w.drop(['quality', 'quality_bool'], axis = 1)
y_w = data_w['quality_bool']

#Codificar variable dependiente
from sklearn.preprocessing import LabelEncoder
labelencoder_y = LabelEncoder()
y_w = labelencoder_y.fit_transform(y_w)

In [8]:
#SVM lineal para vino blanco
from sklearn.svm import SVC # "Support Vector Classifier"
clf = SVC(kernel='linear')
clf.fit(X_w, y_w)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [9]:
#Crear variables x & y para vino rojo
X_r = data_r.drop(['quality', 'quality_bool'], axis = 1)
y_r = data_r['quality_bool']

#Codificar variable dependiente
y_r = labelencoder_y.fit_transform(y_r)

In [10]:
#SVM lineal para vino rojo
from sklearn.svm import SVC # "Support Vector Classifier"
clf = SVC(kernel='linear')
clf.fit(X_r, y_r)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


In [11]:
#Crear training y test data para hacer test
from sklearn.cross_validation import train_test_split
X_w_train, X_w_test, y_w_train, y_w_test = train_test_split(X_w, y_w, test_size = 0.2, random_state = 0)
X_r_train, X_r_test, y_r_train, y_r_test = train_test_split(X_r, y_r, test_size = 0.2, random_state = 0)



In [12]:
#SVM rbf para vino blanco
clf1 = SVC(kernel='rbf')
clf1.fit(X_w_train, y_w_train)

from sklearn.model_selection import cross_val_score
accuracies = cross_val_score(estimator = clf1, X=X_w_train, y=y_w_train, cv=10)
accuracies.mean()

0.8251882365956671

In [13]:
accuracies.std()

0.0138955722865463

In [14]:
#SVM poly para vino blanco
clf3 = SVC(kernel='poly')
clf3.fit(X_w_train, y_w_train)

accuracies = cross_val_score(estimator = clf3, X=X_w_train, y=y_w_train, cv=10)
accuracies.mean()

0.8055250776048887

In [15]:
accuracies.std()

0.01118837079961567

In [16]:
#SVM sigmoid
clf5 = SVC(kernel='sigmoid')
clf5.fit(X_w_train, y_w_train)

accuracies = cross_val_score(estimator = clf5, X=X_w_train, y=y_w_train, cv=10)
accuracies.mean()

0.7378425515576293

In [17]:
accuracies.std()

0.021865821838019096

For the white wine, the best SVM model is the Radial Basis Function with a better average in accuaracy compared to the two others kernels.

In [18]:
#SVM rbf para vino rojo
clf2 = SVC(kernel='rbf')
clf2.fit(X_r_train, y_r_train)

accuracies = cross_val_score(estimator = clf2, X=X_w_train, y=y_w_train, cv=10)
accuracies.mean()

0.8251882365956671

In [19]:
accuracies.std()

0.0138955722865463

In [20]:
#SVM poly para vino rojo
clf4 = SVC(kernel='poly')
clf4.fit(X_r_train, y_r_train)

accuracies = cross_val_score(estimator = clf4, X=X_w_train, y=y_w_train, cv=10)
accuracies.mean()

0.8055250776048887

In [21]:
accuracies.std()

0.01118837079961567

In [22]:
#SVM sigmoid
clf6 = SVC(kernel='sigmoid')
clf6.fit(X_r_train, y_r_train)

accuracies = cross_val_score(estimator = clf6, X=X_w_train, y=y_w_train, cv=10)
accuracies.mean()

0.7378425515576293

In [23]:
accuracies.std()

0.021865821838019096

For the red wine, the best SVM model is also the Radial Basis Function with a better average in accuaracy compared to the two others kernels.

# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [24]:
from sklearn.model_selection import GridSearchCV
parameters = [{'C': [0.1, 1, 10, 100, 1000], 'kernel': ['rbf'],
               'gamma': [0.01, 0.001, 0.0001]}]

grid_search = GridSearchCV(estimator = clf1,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,)

grid_search.fit(X_w_train, y_w_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

In [25]:
best_accuracy

0.8264420622766717

In [26]:
best_parameters

{'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}

In [27]:
from sklearn.model_selection import GridSearchCV
parameters = [{'C': [0.1, 1, 10, 100, 1000], 'kernel': ['rbf'],
               'gamma': [0.01, 0.001, 0.0001]}]

grid_search = GridSearchCV(estimator = clf2,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 10,)

grid_search.fit(X_r_train, y_r_train)
best_accuracy = grid_search.best_score_
best_parameters = grid_search.best_params_

In [28]:
best_accuracy

0.8702111024237685

In [29]:
best_parameters

{'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}

# Exercise 6.5

Compare the results with other methods

# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [85]:
#Estandarizar los datos
data[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']] = StandardScaler().fit_transform(data[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']])
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,-0.166089,-0.423183,0.284686,3.206929,-0.314975,0.815565,0.959976,2.102214,-1.359049,-0.546178,-1.418558,6,white
1,-0.706073,-0.240949,0.147046,-0.807837,-0.20079,-0.931107,0.287618,-0.232332,0.506915,-0.277351,-0.831615,6,white
2,0.682458,-0.362438,0.559966,0.306208,-0.172244,-0.029599,-0.33166,0.134525,0.25812,-0.613385,-0.328521,6,white
3,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,6,white
4,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,6,white


In [40]:
#Crear variables
X = data.drop(['type', 'quality'], axis=1)
y = data['quality']

#Dividir en train y test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [54]:
#Regresión lineal
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)
y_pred = linreg.predict(X_test)

In [42]:
#Examinar los coeficientes
print(linreg.coef_)

[ 0.0952226  -0.2160049  -0.02552556  0.21019944 -0.02190581  0.11621053
 -0.15349144 -0.15068699  0.06105857  0.11338959  0.32668911]


We have that for an increase of 1 unit on the variables volatile acidity, citric acid, chlorides, total sulfur dioxide and density the quality of the wine will decrease by the beta coefficient. For the rest of the variables we have a positive relation, so an increase will make a better wine.

In [45]:
#Calcular RMSE
from sklearn import metrics
import numpy as np
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.7302497171614459


As the RMSE is a high, the observed data points are not very close to the model's predicted values, and so the linear regression model is not that accurate to predict the response.

# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [61]:
#Regresión Ridge con alfa=0,1
from sklearn.linear_model import Ridge
ridgereg = Ridge(alpha=0.1, normalize=True)
ridgereg.fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)

In [62]:
#Examinar los coeficientes
print(ridgereg.coef_)

[ 0.04442921 -0.18888647 -0.00473048  0.12993436 -0.0373843   0.0890757
 -0.11322464 -0.08281075  0.03364932  0.09710094  0.31446033]


In [63]:
#Calcular RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.7300914997347875


In [64]:
#Regresión Ridge con alfa=1
from sklearn.linear_model import Ridge
ridgereg = Ridge(alpha=1, normalize=True)
ridgereg.fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)

In [65]:
#Examinar los coeficientes
print(ridgereg.coef_)

[ 0.00232446 -0.09359023  0.02291247  0.0291213  -0.04430926  0.03030802
 -0.03649687 -0.06493969  0.01107645  0.04543449  0.16629053]


In [66]:
#Calcular RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.7631900081566193


Compared to the linear regression, the RMSE ridge regression with alpha=0,1 is practically the same. This is because the alpha is very close to zero so the ridge regression imposes little penalty to the coefficients. 

# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [67]:
#Regresión Lasso con alfa=0,01
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.001, normalize=True)
lassoreg.fit(X_train, y_train)
y_pred = lassoreg.predict(X_test)

In [68]:
#Examinar los coeficientes
print(lassoreg.coef_)

[ 0.         -0.14038323  0.          0.         -0.          0.
 -0.         -0.          0.          0.          0.30910432]


In [69]:
#Calcular RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.7480660549600331


In [70]:
#Regresión Lasso con alfa=0,1
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.01, normalize=True)
lassoreg.fit(X_train, y_train)
y_pred = lassoreg.predict(X_test)

In [71]:
#Examinar los coeficientes
print(lassoreg.coef_)

[-0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0.]


In [72]:
#Calcular RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.8732538929597913


In [73]:
#Regresión Lasso con alfa=1
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=1, normalize=True)
lassoreg.fit(X_train, y_train)
y_pred = lassoreg.predict(X_test)

In [74]:
#Examinar los coeficientes
print(lassoreg.coef_)

[-0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0.]


In [75]:
#Calcular RMSE
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.8732538929597913


# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [106]:
#Crear variable binaria
data['quality_b'] = data['quality'] >6

In [114]:
#Crear variables x & y
X = data.drop(['quality', 'quality_b', 'type'], axis = 1)
y = data['quality_b']

#Codificar variable dependiente
from sklearn.preprocessing import LabelEncoder
labelencoder_y = LabelEncoder()
y = labelencoder_y.fit_transform(y)

In [115]:
#Dividir en train y test 
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [119]:
#Regresión logística
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1e9,solver='liblinear')
logreg.fit(X_train, y_train)
y_pred_prob = logreg.predict_proba(X_test)

In [120]:
#Examinar los coeficientes
print(logreg.coef_)

[[ 1.81722472e-01 -3.89520099e+00 -5.44731715e-01  6.55185384e-02
  -1.28273707e+01  1.16238442e-02 -5.16513990e-03 -7.99817307e+00
   1.14422611e+00  1.74186577e+00  8.64842049e-01]]


In [None]:
# Calcular log loss
print(metrics.log_loss(y_test, y_pred_prob))

# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [139]:
# standardize X_train and X_test
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import f1_score

scaler = StandardScaler()
X_train = X_train.astype(float)
X_test = X_test.astype(float)
scaler.fit(X_train)
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test)

In [140]:
# try C=00.1 with L1 penalty
logreg = LogisticRegression(C=00.1, penalty='l1',solver='liblinear')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)

[[ 0.29521198 -0.5166364  -0.01055572  0.45783075 -0.27241609  0.14192268
  -0.23178275 -0.36771589  0.22199615  0.27234296  0.8365272 ]]


In [141]:
# generate predicted probabilities and calculate log loss
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(metrics.log_loss(y_test, y_pred_prob))

0.39421013970280827


In [145]:
f1_score(y_pred, y_test)

In [128]:
# try C=0.1 with L1 penalty
logreg = LogisticRegression(C=0.1, penalty='l1',solver='liblinear')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)

[[ 0.2946642  -0.51679608 -0.01055628  0.45704594 -0.27266539  0.14193199
  -0.23169959 -0.3664797   0.22166753  0.2722076   0.83708557]]


In [129]:
# generate predicted probabilities and calculate log loss
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(metrics.log_loss(y_test, y_pred_prob))

0.3942136457931717


In [130]:
# try C=1 with L1 penalty
logreg = LogisticRegression(C=1, penalty='l1',solver='liblinear')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)

[[ 0.56089343 -0.53371922 -0.06463637  0.79460695 -0.25917143  0.19339027
  -0.33295877 -0.83926552  0.38025904  0.34412118  0.67650924]]


In [131]:
# generate predicted probabilities and calculate log loss
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(metrics.log_loss(y_test, y_pred_prob))

0.3926696484976661


In [132]:
# try C=00.1 with L2 penalty
logreg = LogisticRegression(C=00.1, penalty='l2',solver='liblinear')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)

[[ 0.4375043  -0.51431614 -0.04996592  0.62614408 -0.28064981  0.18325519
  -0.30025829 -0.60993186  0.30447318  0.30950057  0.73280542]]


In [133]:
# generate predicted probabilities and calculate log loss
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(metrics.log_loss(y_test, y_pred_prob))

0.3934256957128285


In [134]:
# try C=0.1 with L2 penalty
logreg = LogisticRegression(C=0.1, penalty='l2',solver='liblinear')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)

[[ 0.4375043  -0.51431614 -0.04996592  0.62614408 -0.28064981  0.18325519
  -0.30025829 -0.60993186  0.30447318  0.30950057  0.73280542]]


In [135]:
# generate predicted probabilities and calculate log loss
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(metrics.log_loss(y_test, y_pred_prob))

0.3934256957128285


In [136]:
# try C=1 with L1 penalty
logreg = LogisticRegression(C=1, penalty='l2',solver='liblinear')
logreg.fit(X_train_scaled, y_train)
print(logreg.coef_)

[[ 0.56984526 -0.53471561 -0.06863037  0.80379798 -0.26259165  0.19763252
  -0.33905249 -0.85146694  0.38523459  0.34658632  0.67127941]]


In [137]:
# generate predicted probabilities and calculate log loss
y_pred_prob = logreg.predict_proba(X_test_scaled)
print(metrics.log_loss(y_test, y_pred_prob))

0.392698671740224
