# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.



In [76]:
import pandas as pd
import numpy as np

In [77]:
data_rx = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_wx = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')


In [78]:
data = data_wx.assign(type = 'white')

data = data.append(data_rx.assign(type = 'red'), ignore_index=True)
data.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
4760,6.6,0.17,0.26,7.4,0.052,45.0,128.0,0.99388,3.16,0.37,10.0,6,white
3036,6.8,0.29,0.34,3.5,0.054,26.0,189.0,0.99489,3.42,0.58,10.4,5,white
5625,6.4,0.57,0.02,1.8,0.067,4.0,11.0,0.997,3.46,0.68,9.5,5,red
5444,7.5,0.55,0.24,2.0,0.078,10.0,28.0,0.9983,3.45,0.78,9.5,6,red
3909,6.2,0.39,0.24,4.8,0.037,45.0,138.0,0.99174,3.23,0.43,11.2,7,white


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [79]:
pd.crosstab(index=data['quality'],
            columns=data['type'], margins=True)

type,red,white,All
quality,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
3,10,20,30
4,53,163,216
5,681,1457,2138
6,638,2198,2836
7,199,880,1079
8,18,175,193
9,0,5,5
All,1599,4898,6497


# SVM

# Exercise 6.2

* Standarized the features (not the quality)
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.


In [80]:
# Filtramos los datos
data_ry = data_rx['quality']
data_wy = data_wx['quality']
del data_wx['quality']
del data_rx['quality']


# Estandarizamos los datos
column_names = data_wx.columns.values.tolist()

for col in column_names:
    data_wx[col]=((data_wx[col] - data_wx[col].mean())/ data_wx[col].std())
    
for col in column_names:
    data_rx[col]=((data_rx[col] - data_rx[col].mean())/ data_rx[col].std())  
    
#Create a binary target for each type of wine
data['type_binary']=np.where(data['type']=='white',1,0)

from sklearn.model_selection import train_test_split

rx_train,rx_test,ry_train,ry_test, = train_test_split(data_rx, data_ry, test_size=0.3)
wx_train,wx_test,wy_train,wy_test, = train_test_split(data_wx, data_wy, test_size=0.3)

#Creamos el modelo r
from sklearn.svm import SVR

algolitmo_r = SVR(gamma='auto')
algolitmo_r.fit(rx_train,ry_train)
algolitmo_r.predict(rx_test)
print('kernel rbf vino white',algolitmo_r.score(rx_test,ry_test))

#Creamos el modelo w
algolitmo_w = SVR(gamma='auto')
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
print('kernel rbf vino white',algolitmo_w.score(wx_test,wy_test))


kernel rbf vino white 0.38634350439565496
kernel rbf vino white 0.4081917872032985


# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


In [74]:
#Kernel sigmoid
#r
algolitmo_r = SVR(kernel='sigmoid',gamma='auto')
algolitmo_r.fit(rx_train,ry_train)
algolitmo_r.predict(rx_test)
print('kernel sigmoid vino red',algolitmo_r.score(rx_test,ry_test))
# W
algolitmo_w = SVR(kernel='sigmoid',gamma='auto')
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
print('kernel sigmoid vino white',algolitmo_w.score(wx_test,wy_test))

#Kernel poly
#r
algolitmo_r = SVR(kernel='poly',gamma='auto')
algolitmo_r.fit(rx_train,ry_train)
algolitmo_r.predict(rx_test)
print('kernel poly vino red',algolitmo_r.score(rx_test,ry_test))
#w
algolitmo_w = SVR(kernel='poly',gamma='auto')
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
print('kernel poly vino white', algolitmo_w.score(wx_test,wy_test))

kernel sigmoid vino red -196.271965075645
kernel sigmoid vino white -599.3216027460525
kernel poly vino red -0.08881111080111848
kernel poly vino white 0.18407287247264337


# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [20]:
# se eecutan las diferentes combinaciones
algolitmo_w = SVR(C = 0.01,gamma='auto')
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_0_1=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 1,gamma='auto')
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_1__=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 10,gamma='auto')
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_10_=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 100,gamma='auto')
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_100=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 1000,gamma='auto')
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_1000=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(gamma=0.01)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
gamma_0_01=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(gamma=0.001)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
gamma_0_001=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(gamma=0.0001)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
gamma_0_0001=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 0.01,gamma=0.01)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_0_01_gamma_0_01=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 0.01,gamma=0.001)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_0_01_gamma_0_001=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 0.01,gamma=0.0001)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_0_01_gamma_0_0001=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 1,gamma=0.01)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_1_gamma_0_01=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 1,gamma=0.001)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_1_gamma_0_001=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 1,gamma=0.0001)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_1_gamma_0_0001=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 10,gamma=0.01)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_10_gamma_0_01=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 10,gamma=0.001)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_10_gamma_0_001=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 100,gamma=0.0001)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_10_gamma_0_0001=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 100,gamma=0.01)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_100_gamma_0_01=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 100,gamma=0.001)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_100_gamma_0_001=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 100,gamma=0.0001)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_100_gamma_0_0001=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 1000,gamma=0.01)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_1000_gamma_0_01=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 1000,gamma=0.001)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_1000_gamma_0_001=algolitmo_w.score(wx_test,wy_test)

algolitmo_w = SVR(C = 1000,gamma=0.0001)
algolitmo_w.fit(wx_train,wy_train)
algolitmo_w.predict(wx_test)
c_100_gamma_0_0001=algolitmo_w.score(wx_test,wy_test)

# Exercise 6.5

Compare the results with other methods

In [48]:
print(c_0_1)
print(c_1__)
print('el mejor modelo es C = 10 donde me esta controlando la regularidad de la funcion')
print(c_10_)
print(c_100)
print(c_1000)
print(gamma_0_01)
print(gamma_0_001)
print(gamma_0_0001)
print(c_0_01_gamma_0_01)
print(c_0_01_gamma_0_001)
print(c_0_01_gamma_0_0001)
print(c_1_gamma_0_01)
print(c_1_gamma_0_001)
print(c_1_gamma_0_0001)
print(c_10_gamma_0_01)
print(c_10_gamma_0_001)
print(c_10_gamma_0_0001)
print(c_100_gamma_0_01)
print(c_100_gamma_0_001)
print(c_100_gamma_0_0001)
print(c_1000_gamma_0_01)
print(c_1000_gamma_0_001)
print(c_100_gamma_0_0001)

0.20116787456259144
0.38862958420797833
el mejor modelo es C = 10 donde me esta controlando la regularidad de la funcion
0.3981267616153078
0.2474185031728867
-0.36101611218088325
0.33090564677773593
0.2447560980070375
0.13518665798322915
0.1238167318558221
0.027396996732660498
0.00255230397238404
0.33090564677773593
0.2447560980070375
0.13518665798322915
0.3529144954715132
0.2829028327936929
0.257690774019919
0.36917329201864857
0.3124336976826093
0.2660808232987907
0.3645220571437491
0.3326762938514116
0.2660808232987907


# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [81]:
#Train a linear regression to predict wine quality (Continous)
data_y = data['quality']
del data['quality']
del data['type']

for col in column_names:
    data[col]=((data[col] - data[col].mean())/ data[col].std())
    
tx_train,tx_test,ty_train,ty_test, = train_test_split(data, data_y, test_size=0.3)

from sklearn import datasets,linear_model
lr_multiple = linear_model.LinearRegression()
lr_multiple.fit(tx_train,ty_train)
y_pred_multiple= lr_multiple.predict(tx_test)
print('valor de llos coeficientes')
print(column_names)
print(lr_multiple.coef_)
print('las variables que aportan positivamente a la prediccion de la calidad del vino son fixed acidity,citric acid, chlorides, pH')
#print(lr_multiple.score(tx_train,ty_train))
from sklearn.metrics import mean_squared_error
print('RMSE')
mean_squared_error(ty_test,y_pred_multiple)


valor de llos coeficientes
['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']
[ 0.10665084 -0.23546415  0.00502327  0.31887528 -0.03568649  0.07928867
 -0.06927384 -0.35075145  0.08169809  0.10222532  0.24526099 -0.41398983]

RMSE


0.5257802192357292

# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [46]:
from sklearn.linear_model import Ridge
lr_ridge = linear_model.LinearRegression()
lr_ridge = Ridge(alpha=1)
lr_ridge.fit(tx_train,ty_train) 
print(lr_ridge.coef_)
y_pred_ridge = lr_multiple.predict(tx_test)
print(lr_ridge.score(tx_train,ty_train))
mean_squared_error(ty_test,y_pred_ridge)
print ('----------------//-----------------')
lr_ridge = linear_model.LinearRegression()
lr_ridge = Ridge(alpha=0.1)
lr_ridge.fit(tx_train,ty_train) 
print(lr_ridge.coef_)
y_pred_ridge = lr_multiple.predict(tx_test)
print(lr_ridge.score(tx_train,ty_train))
mean_squared_error(ty_test,y_pred_ridge)

[ 0.11625809 -0.248466   -0.00980369  0.2921536  -0.02362161  0.09467536
 -0.0786319  -0.31119308  0.08727171  0.09964646  0.2599978  -0.34103644]
0.29144015405565477
----------------//-----------------
[ 0.11709811 -0.24870181 -0.00979047  0.29395188 -0.02361059  0.09450858
 -0.07815295 -0.31392183  0.08773564  0.0997215   0.25899528 -0.34480123]
0.2914409519068612


0.5100609746621253

# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [50]:
from sklearn.metrics import mean_squared_error
lr_lasso = linear_model.Lasso(alpha=1)
lr_lasso.fit(tx_train,ty_train) 
print(lr_lasso.coef_)
y_pred_lasso = lr_lasso.predict(tx_test)
print(lr_lasso.score(tx_train,ty_train))
mean_squared_error(ty_test,y_pred_lasso)
print('-----------//----------')

lr_lasso = linear_model.Lasso(alpha=0.01)
lr_lasso.fit(tx_train,ty_train) 
print(lr_lasso.coef_)
y_pred_lasso = lr_lasso.predict(tx_test)
print(lr_lasso.score(tx_train,ty_train))
mean_squared_error(ty_test,y_pred_lasso)
print('-----------//----------')

lr_lasso = linear_model.Lasso(alpha=0.1)
lr_lasso.fit(tx_train,ty_train) 
print(lr_lasso.coef_)
y_pred_lasso = lr_lasso.predict(tx_test)
print(lr_lasso.score(tx_train,ty_train))
mean_squared_error(ty_test,y_pred_lasso)


[-0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0.  0.]
0.0
-----------//----------
[ 0.         -0.22199049 -0.          0.07680455 -0.01461034  0.08247742
 -0.08451538 -0.          0.01883194  0.07476498  0.37779254 -0.        ]
0.28216132875363764
-----------//----------
[-0.         -0.12991388  0.          0.         -0.          0.
 -0.         -0.          0.          0.          0.27753881  0.        ]
0.23175040771437527


0.5585522615047995

# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [57]:
ty_test_B = np.where(pd.DataFrame(ty_test)['quality']< 6 , 0 , 1)
ty_train_B =np.where(pd.DataFrame(ty_train)['quality']< 6 , 0 , 1)

from sklearn import linear_model
logistica = linear_model.LogisticRegression(solver='liblinear')
logistica.fit(tx_train,ty_train_B)
predictions = logistica.predict(tx_test)
print(logistica.coef_)
logistica.score(tx_train,ty_train_B)

[[ 0.0522569  -0.79810979 -0.0557798   0.47246186 -0.06982154  0.29698317
  -0.35522678 -0.30598713  0.10526114  0.30087517  0.98922642 -0.58310852]]


0.74180778535298

# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [60]:
from sklearn import linear_model
logistica = linear_model.LogisticRegression(C = 00.1,solver='liblinear')
logistica.fit(tx_train,ty_train_B)
predictions = logistica.predict(tx_test)
print(logistica.coef_)
logistica.score(tx_train,ty_train_B)

print('-----------//----------')
from sklearn import linear_model
logistica = linear_model.LogisticRegression(C = 0.1,solver='liblinear')
logistica.fit(tx_train,ty_train_B)
predictions = logistica.predict(tx_test)
print(logistica.coef_)
logistica.score(tx_train,ty_train_B)

print('-----------//----------')
from sklearn import linear_model
logistica = linear_model.LogisticRegression(C = 1,solver='liblinear')
logistica.fit(tx_train,ty_train_B)
predictions = logistica.predict(tx_test)
print(logistica.coef_)
logistica.score(tx_train,ty_train_B)

print('-----------//----------')

from sklearn import linear_model
logistica = linear_model.LogisticRegression(penalty='l2',solver='liblinear')
logistica.fit(tx_train,ty_train_B)
predictions = logistica.predict(tx_test)
print(logistica.coef_)
logistica.score(tx_train,ty_train_B)

print('-----------//----------')
from sklearn import linear_model
logistica = linear_model.LogisticRegression( penalty='l1',solver='liblinear')
logistica.fit(tx_train,ty_train_B)
predictions = logistica.predict(tx_test)
print(logistica.coef_)
logistica.score(tx_train,ty_train_B)

print('-----------//----------')
from sklearn import linear_model
logistica = linear_model.LogisticRegression(C = 00.1, penalty='l1',solver='liblinear')
logistica.fit(tx_train,ty_train_B)
predictions = logistica.predict(tx_test)
print(logistica.coef_)
logistica.score(tx_train,ty_train_B)

print('-----------//----------')
from sklearn import linear_model
logistica = linear_model.LogisticRegression(C = 00.1, penalty='l2',solver='liblinear')
logistica.fit(tx_train,ty_train_B)
predictions = logistica.predict(tx_test)
print(logistica.coef_)
logistica.score(tx_train,ty_train_B)

print('-----------//----------')
from sklearn import linear_model
logistica = linear_model.LogisticRegression(C = 0.1, penalty='l1',solver='liblinear')
logistica.fit(tx_train,ty_train_B)
predictions = logistica.predict(tx_test)
print(logistica.coef_)
logistica.score(tx_train,ty_train_B)

print('-----------//----------')
from sklearn import linear_model
logistica = linear_model.LogisticRegression(C = 0.1, penalty='l2',solver='liblinear')
logistica.fit(tx_train,ty_train_B)
predictions = logistica.predict(tx_test)
print(logistica.coef_)
logistica.score(tx_train,ty_train_B)

print('-----------//----------')
from sklearn import linear_model
logistica = linear_model.LogisticRegression(C = 1, penalty='l2',solver='liblinear')
logistica.fit(tx_train,ty_train_B)
predictions = logistica.predict(tx_test)
print(logistica.coef_)
logistica.score(tx_train,ty_train_B)

print('-----------//----------')
from sklearn import linear_model
logistica = linear_model.LogisticRegression(C = 1, penalty='l1',solver='liblinear')
logistica.fit(tx_train,ty_train_B)
predictions = logistica.predict(tx_test)
print(logistica.coef_)
logistica.score(tx_train,ty_train_B)

[[ 0.04017209 -0.74016518 -0.05473523  0.38416673 -0.0653724   0.30191409
  -0.39888395 -0.19351851  0.10050729  0.30499509  0.99159826 -0.2140798 ]]
-----------//----------
[[ 0.04017209 -0.74016518 -0.05473523  0.38416673 -0.0653724   0.30191409
  -0.39888395 -0.19351851  0.10050729  0.30499509  0.99159826 -0.2140798 ]]
-----------//----------
[[ 0.0522569  -0.79810979 -0.0557798   0.47246186 -0.06982154  0.29698317
  -0.35522678 -0.30598713  0.10526114  0.30087517  0.98922642 -0.58310852]]
-----------//----------
[[ 0.0522569  -0.79810979 -0.0557798   0.47246186 -0.06982154  0.29698317
  -0.35522678 -0.30598713  0.10526114  0.30087517  0.98922642 -0.58310852]]
-----------//----------
[[ 0.03463674 -0.79816491 -0.05438217  0.44570048 -0.0690655   0.29540203
  -0.35453729 -0.26487865  0.09502285  0.29684519  1.00649509 -0.55540648]]
-----------//----------
[[-0.00543243 -0.72697244 -0.05285229  0.25305806 -0.04764574  0.28523192
  -0.38463345  0.          0.0632343   0.27549392  1.072

0.7400483835495931