# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [18]:
import pandas as pd
import numpy as np

In [4]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [5]:
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
2669,6.8,0.26,0.4,7.5,0.046,45.0,179.0,0.99583,3.2,0.49,9.3,5,white
5645,8.6,0.33,0.4,2.6,0.083,16.0,68.0,0.99782,3.3,0.48,9.4,5,red
2761,6.0,0.26,0.15,1.3,0.06,51.0,154.0,0.99354,3.14,0.51,8.7,5,white
2692,6.9,0.31,0.32,1.2,0.024,20.0,166.0,0.99208,3.05,0.54,9.8,6,white
366,6.0,0.18,0.27,1.5,0.089,40.0,143.0,0.9923,3.49,0.62,10.8,6,white


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [7]:
tab = pd.crosstab(data["quality"],data["type"] ,rownames=['quality'], colnames=['type'],margins=True)
print(tab)

type      red  white   All
quality                   
3          10     20    30
4          53    163   216
5         681   1457  2138
6         638   2198  2836
7         199    880  1079
8          18    175   193
9           0      5     5
All      1599   4898  6497


In [11]:
# Crea nueva variable de calidad y parte la base por cada tipo de vino
data['quality2']= np.where(data ['quality']>=6,1,0)
data.sample(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality2
4793,6.0,0.31,0.32,7.4,0.175,47.0,159.0,0.9952,3.19,0.5,9.4,6,white,1
843,6.9,0.19,0.35,1.7,0.036,33.0,101.0,0.99315,3.21,0.54,10.8,7,white,1
5235,7.8,0.43,0.32,2.8,0.08,29.0,58.0,0.9974,3.31,0.64,10.3,5,red,0
1853,8.3,0.27,0.39,2.4,0.058,16.0,107.0,0.9955,3.28,0.59,10.3,5,white,0
5036,7.8,0.56,0.19,2.1,0.081,15.0,105.0,0.9962,3.33,0.54,9.5,5,red,0
5576,8.3,0.78,0.1,2.6,0.081,45.0,87.0,0.9983,3.48,0.53,10.0,5,red,0
2538,5.9,0.28,0.14,8.6,0.032,30.0,142.0,0.99542,3.28,0.44,9.5,6,white,1
411,7.3,0.28,0.36,12.7,0.04,38.0,140.0,0.998,3.3,0.79,9.6,6,white,1
5657,8.8,0.42,0.21,2.5,0.092,33.0,88.0,0.99823,3.19,0.52,9.2,5,red,0
1483,6.9,0.25,0.24,3.6,0.057,13.0,85.0,0.9942,2.99,0.48,9.5,4,white,0


# SVM

# Exercise 6.2

* Standarized the features (not the quality)
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.

In [213]:
import scipy.stats as sp

# Selecciona las variables para estandarizar
Zdata = pd.DataFrame(data[data.columns[0:11]])

Zdata = pd.DataFrame(sp.stats.zscore(Zdata))
Zdata.rename(columns={0:'Zfixed_acidity', 1:'Zvolatile_acidity', 2:'Zcitric_acid', 3:'Zresidual_sugar',
                         4:'Zchlorides', 5:'Zfree_sulfur_dioxide', 6:'Ztotal_sulfur_dioxide',
                         7:'Zdensity', 8:'ZpH', 9:'Zsulphates', 10:'Zalcohol'}, inplace=True)
Zdata['type']=data['type']
Zdata['quality']=data['quality']
Zdata['quality2']=data['quality2']

Zdata['Target']=np.where(Zdata['type']=='white', 0, 1)

Zdata.sample(10)

Unnamed: 0,Zfixed_acidity,Zvolatile_acidity,Zcitric_acid,Zresidual_sugar,Zchlorides,Zfree_sulfur_dioxide,Ztotal_sulfur_dioxide,Zdensity,ZpH,Zsulphates,Zalcohol,type,quality,quality2,Target
4127,-0.397511,-0.180205,0.147046,0.285188,0.084672,1.153631,1.756189,0.227907,0.693511,0.193097,-0.160823,white,5,0,0
2086,0.759598,-1.030629,0.009406,-0.828857,-0.172244,-0.762074,-0.260885,-0.232332,-0.488266,0.32751,-0.831615,white,5,0,0
4878,-0.783214,1.156175,-2.055193,-0.954975,-0.600437,-1.381861,-0.614758,-0.785953,0.133722,-1.218246,-0.831615,white,4,0,0
5795,-0.088949,1.520643,-2.124013,-0.660699,0.684143,-0.198632,-1.287116,0.267928,1.2533,0.32751,0.174573,red,6,1,1
200,-0.32037,-0.423183,-0.403514,2.240022,-0.20079,1.379008,1.42001,1.235097,-0.426067,-0.210144,-0.999313,white,5,0,0
367,-0.24323,-0.058716,-0.747613,-0.933956,-0.086605,0.4775,0.570716,-0.66589,-0.612663,-0.546178,-0.915464,white,6,1,0
2824,-0.011808,-0.483928,0.559966,0.18009,-0.257883,1.209975,0.995363,0.344634,-0.239471,-0.008524,-0.831615,white,6,1,0
5145,0.759598,1.581387,-1.022893,-0.660699,0.455773,-1.10014,-0.756307,0.534733,-0.115073,-0.546178,-0.999313,red,5,0,1
1789,0.451036,-0.848395,-0.541153,-0.807837,-0.857353,-0.254976,1.296155,-1.232851,-0.861459,1.40282,1.348459,white,7,1,0
6197,0.296754,7.534354,-2.192833,-0.702739,2.311277,-1.438205,-1.888699,0.021133,1.750891,-0.882212,0.342271,red,3,0,1


* Create two Linear SVM's for the white and red wines, repectively.

In [66]:
from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

train_W, test_W = train_test_split(Zdata[Zdata['Target']==0], test_size=0.3)
train_R, test_R = train_test_split(Zdata[Zdata['Target']==1], test_size=0.3)
#print(len(train_W), len(test_W), len(train_R),len(test_R))

feature_cols = ['Zfixed_acidity','Zvolatile_acidity','Zcitric_acid','Zresidual_sugar','Zchlorides','Zfree_sulfur_dioxide',
                'Ztotal_sulfur_dioxide','Zdensity','ZpH','Zsulphates','Zalcohol']
XW = train_W[feature_cols]
yW = train_W['quality2']

SVM_W = SVC(kernel='linear')
SVM_W.fit(XW, yW)
test_W['quality2_pron']=SVM_W.predict(test_W[feature_cols])
print(confusion_matrix(test_W['quality2'], test_W['quality2_pron']))
print(accuracy_score(test_W['quality2'], test_W['quality2_pron']))


[[237 283]
 [ 92 858]]
0.7448979591836735


In [67]:
feature_cols = ['Zfixed_acidity','Zvolatile_acidity','Zcitric_acid','Zresidual_sugar','Zchlorides','Zfree_sulfur_dioxide',
                'Ztotal_sulfur_dioxide','Zdensity','ZpH','Zsulphates','Zalcohol']
XR = train_R[feature_cols]
yR = train_R['quality2']

SVM_R = SVC(kernel='linear')
SVM_R.fit(XR, yR)
test_R['quality2_pron']=SVM_R.predict(test_R[feature_cols])
print(confusion_matrix(test_R['quality2'], test_R['quality2_pron']))
print(accuracy_score(test_R['quality2'], test_R['quality2_pron']))

[[171  59]
 [ 67 183]]
0.7375


# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


### Vino Blanco

In [78]:
SVM_W2 = SVC(kernel='poly')
SVM_W2.fit(XW, yW)
test_W['quality2_pron']=SVM_W2.predict(test_W[feature_cols])
print(confusion_matrix(test_W['quality2'], test_W['quality2_pron']))
print('poly:',accuracy_score(test_W['quality2'], test_W['quality2_pron']))

[[185 335]
 [ 49 901]]
('poly:', 0.7387755102040816)


In [143]:
SVM_W3 = SVC(kernel='rbf')
SVM_W3.fit(XW, yW)
test_W['quality2_pron']=SVM_W3.predict(test_W[feature_cols])
print(confusion_matrix(test_W['quality2'], test_W['quality2_pron']))
print('rbf:',accuracy_score(test_W['quality2'], test_W['quality2_pron']))

[[293 227]
 [ 98 852]]
('rbf:', 0.7789115646258503)


In [144]:
SVM_W4 = SVC(kernel='sigmoid')
SVM_W4.fit(XW, yW)
test_W['quality2_pron']=SVM_W4.predict(test_W[feature_cols])
print(confusion_matrix(test_W['quality2'], test_W['quality2_pron']))
print('sigmoid:',accuracy_score(test_W['quality2'], test_W['quality2_pron']))

[[217 303]
 [178 772]]
('sigmoid:', 0.6727891156462585)


### Vino Rojo

In [145]:
SVM_R2 = SVC(kernel='poly')
SVM_R2.fit(XR, yR)
test_R['quality2_pron']=SVM_R2.predict(test_R[feature_cols])
print(confusion_matrix(test_R['quality2'], test_R['quality2_pron']))
print('poly:',accuracy_score(test_R['quality2'], test_R['quality2_pron']))

[[181  49]
 [ 67 183]]
('poly:', 0.7583333333333333)


In [146]:
SVM_R3 = SVC(kernel='rbf')
SVM_R3.fit(XR, yR)
test_R['quality2_pron']=SVM_R3.predict(test_R[feature_cols])
print(confusion_matrix(test_R['quality2'], test_R['quality2_pron']))
print('rbf:',accuracy_score(test_R['quality2'], test_R['quality2_pron']))

[[169  61]
 [ 57 193]]
('rbf:', 0.7541666666666667)


In [147]:
SVM_R4 = SVC(kernel='sigmoid')
SVM_R4.fit(XR, yR)
test_R['quality2_pron']=SVM_R4.predict(test_R[feature_cols])
print(confusion_matrix(test_R['quality2'], test_R['quality2_pron']))
print('sigmoid:',accuracy_score(test_R['quality2'], test_R['quality2_pron']))

[[130 100]
 [ 84 166]]
('sigmoid:', 0.6166666666666667)


# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

##### Usando Vino Blanco con kernel= ‘rbf’, ya que fue la mejor SVM

In [189]:
Res = []

for gamma in [0.01,0.001,0.0001]: #Valores gamma
    for C in [0.1,1,10,100,1000]: #Valores C
        clf1 = SVC(C=C, gamma=gamma,kernel='rbf')
        clf1.fit(XW, yW)
        test_W['quality2_pron']=clf1.predict(test_W[feature_cols])
        accuracy=accuracy_score(test_W['quality2'], test_W['quality2_pron'])
        Res.append((C, gamma, accuracy))
max(Res)

(1000, 0.01, 0.7836734693877551)

# Exercise 6.5

Compare the results with other methods

###### Se pide usar modelo de regresión Logística

In [190]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()

logreg.fit(XW, yW)
test_W['quality2_pron']=logreg.predict(test_W[feature_cols])
print(confusion_matrix(test_W['quality2'], test_W['quality2_pron']))
print(accuracy_score(test_W['quality2'], test_W['quality2_pron']))

[[241 279]
 [ 96 854]]
0.7448979591836735


# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [217]:
import warnings
from sklearn import metrics
warnings.filterwarnings("ignore")

from sklearn.linear_model import LinearRegression
train, test = train_test_split(Zdata, test_size=0.3)

linreg = LinearRegression()

X = train[feature_cols]
y = train['quality']
linreg.fit(X, y)
test['quality_pron']=linreg.predict(test[feature_cols])

Coef_linreg=pd.DataFrame(np.transpose(linreg.coef_),np.transpose(feature_cols)) 
print(Coef_linreg)

# Se observa que la variable que más aporta en el modelo de regresión lineal es Alcohol (positiva). Por otro lado la que
# más aporta con signo negativo es volatile_acidity.

                              0
Zfixed_acidity         0.091912
Zvolatile_acidity     -0.226299
Zcitric_acid          -0.011859
Zresidual_sugar        0.214311
Zchlorides            -0.019890
Zfree_sulfur_dioxide   0.091833
Ztotal_sulfur_dioxide -0.137716
Zdensity              -0.163103
ZpH                    0.078686
Zsulphates             0.107571
Zalcohol               0.324514


In [218]:
from math import sqrt
MSE_linreg=metrics.mean_squared_error(test['quality'], test['quality_pron'])
RMSE=sqrt(MSE_linreg)
print('RMSE:', RMSE)

('RMSE:', 0.7355630901280124)


# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [220]:
from sklearn.linear_model import Ridge
ridreg = Ridge(alpha=0.1) # alpha equals 0.1
ridreg.fit(X, y)
test['quality_pron'] = ridreg.predict(test[feature_cols])
print('RMSE Alpha 0.1:',np.sqrt(metrics.mean_squared_error(test['quality'], test['quality_pron'])))

('RMSE Alpha 0.1:', 0.7355627382438407)


In [221]:
ridreg1 = Ridge(alpha=1)
ridreg1.fit(X, y)
test['quality_pron'] = ridreg1.predict(test[feature_cols])
print('RMSE Alpha 1:',np.sqrt(metrics.mean_squared_error(test['quality'], test['quality_pron'])))

('RMSE Alpha 1:', 0.7355596055247973)


# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [222]:
# vs lasso alpha 0.01
from sklearn.linear_model import Lasso
lasreg = Lasso(alpha=0.01)
lasreg.fit(X, y)
test['quality_pron'] = lasreg.predict(test[feature_cols])

print('RMSE Alpha 0,01:',np.sqrt(metrics.mean_squared_error(test['quality'], test['quality_pron'])))

('RMSE Alpha 0,01:', 0.7373468812894776)


In [224]:
# vs lasso alpha 0.1
from sklearn.linear_model import Lasso
lasreg = Lasso(alpha=0.1)
lasreg.fit(X, y)
test['quality_pron'] = lasreg.predict(test[feature_cols])

print('RMSE Alpha 0,1:',np.sqrt(metrics.mean_squared_error(test['quality'], test['quality_pron'])))

('RMSE Alpha 0,1:', 0.7600294267285042)


In [225]:
# vs lasso alpha 1
from sklearn.linear_model import Lasso
lasreg = Lasso(alpha=1)
lasreg.fit(X, y)
test['quality_pron'] = lasreg.predict(test[feature_cols])

print('RMSE Alpha 1:',np.sqrt(metrics.mean_squared_error(test['quality'], test['quality_pron'])))

('RMSE Alpha 1:', 0.8627257425162578)


###### A medida que se incrementa el valor de alpha, el valor de RMSE crece, es decir los pronósticos se alejan cada vez más del dato real 

# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [230]:
X = train[feature_cols]
y = train['quality2']

logreg.fit(X, y)
test['quality2_pron']=logreg.predict(test[feature_cols])
print(confusion_matrix(test['quality2'], test['quality2_pron']))
print(accuracy_score(test['quality2'], test['quality2_pron']))

[[ 386  311]
 [ 197 1056]]
0.7394871794871795


In [232]:
from sklearn.metrics import f1_score
print(f1_score(test['quality2'], test['quality2_pron']))

0.8061068702290076


# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [240]:
X = train[feature_cols]
y = train['quality2']

Res = []
for C in [0.01, 0.1, 1]: # C 
    for penal in ['l1', 'l2']: # Penalty
        logreg = LogisticRegression(C=C, penalty=penal,solver='liblinear')
        logreg.fit(X, y)
        test['quality2_pron']=logreg.predict(test[feature_cols]) 
        f1=f1_score(test['quality2'], test['quality2_pron']) 
        Res.append((C, penal, f1))
       # coeficientes_tmp=pd.DataFrame(np.transpose(logreg.coef_))
       # coeficientes_1=np.column_stack([coeficientes_1, coeficientes_tmp])
        Res