# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [1]:
import pandas as pd
import warnings
warnings.filterwarnings('ignore')

In [2]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [3]:
data_w.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
410,7.2,0.25,0.39,18.95,0.038,42.0,155.0,0.9999,2.97,0.47,9.0,6
2982,7.0,0.2,0.31,8.0,0.05,29.0,213.0,0.99596,3.28,0.57,10.4,6
2400,9.2,0.19,0.42,2.0,0.047,16.0,104.0,0.99517,3.09,0.66,10.0,4
3913,7.2,0.25,0.32,1.5,0.054,24.0,105.0,0.99154,3.17,0.48,11.1,6
3658,5.8,0.275,0.3,5.4,0.043,41.0,149.0,0.9926,3.33,0.42,10.8,7


In [4]:
data = data_w.assign(type = 'white')
data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.tail()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
6492,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5,red
6493,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6,red
6494,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6,red
6495,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5,red
6496,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6,red


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [5]:
data[['type','quality']].pivot_table(index='type', columns='quality', aggfunc=len, fill_value=0)
#pd.crosstab(data['type'],data['quality'])

quality,3,4,5,6,7,8,9
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
red,10,53,681,638,199,18,0
white,20,163,1457,2198,880,175,5


# SVM

# Exercise 6.2

* Preparación de la data


In [6]:
#Definir una columna con la codificación de la calidad 1 o 0
data['quality1'] = 0
data['quality1'][data['quality']> 5] = 1
data['quality1'].value_counts()

d = {'white': 1, 'red': 0}
data['type_bool'] = data['type'].map(d)

In [7]:
data.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality1,type_bool
6039,8.2,0.38,0.32,2.5,0.08,24.0,71.0,0.99624,3.27,0.85,11.0,6,red,1,0
3592,5.6,0.28,0.27,3.9,0.043,52.0,158.0,0.99202,3.35,0.44,10.7,7,white,1,1
469,7.2,0.29,0.53,18.15,0.047,59.0,182.0,0.9992,3.09,0.52,9.6,5,white,0,1
5205,10.3,0.41,0.42,2.4,0.213,6.0,14.0,0.9994,3.19,0.62,9.5,6,red,1,0
3533,6.6,0.22,0.3,14.7,0.045,50.0,136.0,0.99704,3.14,0.37,10.6,6,white,1,1


* Create two Linear SVM's for the white and red wines, repectively.

In [8]:
from sklearn import preprocessing
from sklearn.svm import SVC
from sklearn import metrics
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

In [9]:
#Aplicación del modelo para vinos rojos.
red_data = pd.DataFrame(data[data.type_bool == 0])

X_red = np.array(red_data[['fixed acidity','volatile acidity', 'citric acid',
                   'residual sugar','chlorides','free sulfur dioxide',
                   'total sulfur dioxide','density','pH',
                   'sulphates','alcohol']])

y_red = np.array(red_data[['quality1']])

X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_red, y_red, test_size=0.2, random_state=42)

#Declara el modelo SVN
clf = SVC(kernel='linear')
#Estandariza X
standardized_X = preprocessing.scale(X_train_r)
#Calcula el modelo a partir de los datos de entrenamiento
clf.fit(standardized_X, y_train_r)
#Predice la respuesta a partir de los datos de prueba
y_pred_r = clf.predict(X_test_r)
#Métrica para medir la diferencia entre los y train y y_pred.
print('La precisión del modelo lineal para vinos rojos es:',accuracy_score(y_test_r, y_pred_r))

La precisión del modelo lineal para vinos rojos es: 0.603125


In [10]:
#Aplicación del modelo para vinos blancos.
white_data = pd.DataFrame(data[data.type_bool == 1])

X_white = np.array(white_data[['fixed acidity','volatile acidity', 'citric acid',
                   'residual sugar','chlorides','free sulfur dioxide',
                   'total sulfur dioxide','density','pH',
                   'sulphates','alcohol']])

y_white = np.array(white_data[['quality1']])

X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_white, y_white, test_size=0.2, random_state=42)

clf = SVC(kernel='linear')
standardized_X = preprocessing.scale(X_train_w)
clf.fit(standardized_X, y_train_w)
y_pred_w = clf.predict(X_test_w)
print('La precisión del modelo lineal para vinos blancos es:',accuracy_score(y_test_w, y_pred_w))

La precisión del modelo lineal para vinos blancos es: 0.6806122448979591


# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


In [11]:
print('Precisión Modelos Vino rojo')

clf = SVC(kernel='poly')
standardized_X = preprocessing.scale(X_train_r)
clf.fit(standardized_X, y_train_r)
y_pred_w = clf.predict(X_test_r)
print('poly:',round(accuracy_score(y_test_r, y_pred_r),3))

clf = SVC(kernel='rbf')
standardized_X = preprocessing.scale(X_train_r)
clf.fit(standardized_X, y_train_r)
y_pred_w = clf.predict(X_test_r)
print('rbf:',round(accuracy_score(y_test_r, y_pred_r),3))

clf = SVC(kernel='sigmoid')
standardized_X = preprocessing.scale(X_train_r)
clf.fit(standardized_X, y_train_r)
y_pred_w = clf.predict(X_test_r)
print('sigmoid:',round(accuracy_score(y_test_r, y_pred_r),3))

Precisión Modelos Vino rojo
poly: 0.603
rbf: 0.603
sigmoid: 0.603


In [12]:
print('Precisión Modelos Vino blanco')

clf = SVC(kernel='poly')
standardized_X = preprocessing.scale(X_train_w)
clf.fit(standardized_X, y_train_w)
y_pred_w = clf.predict(X_test_w)
print('poly:',round(accuracy_score(y_test_w, y_pred_w),3))

clf = SVC(kernel='rbf')
standardized_X = preprocessing.scale(X_train_w)
clf.fit(standardized_X, y_train_w)
y_pred_w = clf.predict(X_test_w)
print('rbf:',round(accuracy_score(y_test_w, y_pred_w),3))

clf = SVC(kernel='sigmoid')
standardized_X = preprocessing.scale(X_train_w)
clf.fit(standardized_X, y_train_w)
y_pred_w = clf.predict(X_test_w)
print('sigmoid:',round(accuracy_score(y_test_w, y_pred_w),3))

Precisión Modelos Vino blanco
poly: 0.672
rbf: 0.672
sigmoid: 0.674


# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [13]:
from sklearn.metrics import confusion_matrix 
from sklearn.metrics import classification_report

In [14]:
C=[ 0.1, 1, 10, 100, 1000]
gamma=[0.01, 0.001, 0.0001]

In [15]:
#Definición de parámetros óptimos para el modelo poly sobre la base de vinos rojos

standardized_X = preprocessing.scale(X_train_r)

for c in C:
    for g in gamma:
        ## Poly
        clf = SVC(kernel='poly',degree=7,C=c,gamma=g)
        clf.fit(standardized_X, y_train_r)
        y_pred_r = clf.predict(X_test_r)  
        print("C: ",c,"/ gamma: ",g, "/ Accuracy: ",accuracy_score(y_test_r,y_pred_r))
        print(classification_report(y_test_r,y_pred_r))

C:  0.1 / gamma:  0.01 / Accuracy:  0.49375
             precision    recall  f1-score   support

          0       0.46      0.77      0.57       141
          1       0.60      0.28      0.38       179

avg / total       0.54      0.49      0.47       320

C:  0.1 / gamma:  0.001 / Accuracy:  0.55625
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       141
          1       0.56      0.99      0.71       179

avg / total       0.31      0.56      0.40       320

C:  0.1 / gamma:  0.0001 / Accuracy:  0.559375
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       141
          1       0.56      1.00      0.72       179

avg / total       0.31      0.56      0.40       320

C:  1 / gamma:  0.01 / Accuracy:  0.41875
             precision    recall  f1-score   support

          0       0.42      0.89      0.57       141
          1       0.36      0.05      0.09       179

avg / total     

Los parámetros óptimos para el SVN poly para los vinos blancos son:

C:  0.1
gamma:  0.0001
Accuracy:  0.559375

In [16]:
#Definición de parámetros óptimos para el modelo poly sobre la base de vinos blancos

standardized_X = preprocessing.scale(X_train_w)

for c in C:
    for g in gamma:
        ## Poly
        clf = SVC(kernel='poly',degree=7,C=c,gamma=g)
        clf.fit(standardized_X, y_train_w)
        y_pred_w = clf.predict(X_test_w)  
        print("C: ",c,"/ gamma: ",g, "/ Accuracy: ",accuracy_score(y_test_w,y_pred_w))
        #print(confusion_matrix(y_test_w,y_pred_w)) 
        print(classification_report(y_test_w,y_pred_w))

C:  0.1 / gamma:  0.01 / Accuracy:  0.32653061224489793
             precision    recall  f1-score   support

          0       0.33      1.00      0.49       321
          1       0.00      0.00      0.00       659

avg / total       0.11      0.33      0.16       980

C:  0.1 / gamma:  0.001 / Accuracy:  0.6795918367346939
             precision    recall  f1-score   support

          0       0.59      0.07      0.13       321
          1       0.68      0.98      0.80       659

avg / total       0.65      0.68      0.58       980

C:  0.1 / gamma:  0.0001 / Accuracy:  0.6724489795918367
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       321
          1       0.67      1.00      0.80       659

avg / total       0.45      0.67      0.54       980

C:  1 / gamma:  0.01 / Accuracy:  0.32653061224489793
             precision    recall  f1-score   support

          0       0.33      1.00      0.49       321
          1       0.00   

Los parámetros óptimos para el SVN poly para los vinos blancos son:

C:  0.1
gamma:  0.001
Accuracy:  0.6795918367346939

# Exercise 6.5

Compare the results with other methods

In [17]:
#Definición de parámetros óptimos para el modelo rbf sobre la base de vinos rojos

standardized_X = preprocessing.scale(X_train_r)

for c in C:
    for g in gamma:
        ## Poly
        clf = SVC(kernel='rbf',degree=7,C=c,gamma=g)
        clf.fit(standardized_X, y_train_r)
        y_pred_r = clf.predict(X_test_r)  
        print("C: ",c,"/ gamma: ",g, "/ Accuracy: ",accuracy_score(y_test_r,y_pred_r))
        print(classification_report(y_test_r,y_pred_r))

C:  0.1 / gamma:  0.01 / Accuracy:  0.434375
             precision    recall  f1-score   support

          0       0.44      0.99      0.61       141
          1       0.00      0.00      0.00       179

avg / total       0.19      0.43      0.27       320

C:  0.1 / gamma:  0.001 / Accuracy:  0.559375
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       141
          1       0.56      1.00      0.72       179

avg / total       0.31      0.56      0.40       320

C:  0.1 / gamma:  0.0001 / Accuracy:  0.559375
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       141
          1       0.56      1.00      0.72       179

avg / total       0.31      0.56      0.40       320

C:  1 / gamma:  0.01 / Accuracy:  0.440625
             precision    recall  f1-score   support

          0       0.44      1.00      0.61       141
          1       0.00      0.00      0.00       179

avg / total  

In [18]:
#Definición de parámetros óptimos para el modelo poly sobre la base de vinos blancos

standardized_X = preprocessing.scale(X_train_w)

for c in C:
    for g in gamma:
        ## Poly
        clf = SVC(kernel='rbf',degree=7,C=c,gamma=g)
        clf.fit(standardized_X, y_train_w)
        y_pred_w = clf.predict(X_test_w)  
        print("C: ",c,"/ gamma: ",g, "/ Accuracy: ",accuracy_score(y_test_w,y_pred_w))
        #print(confusion_matrix(y_test_w,y_pred_w)) 
        print(classification_report(y_test_w,y_pred_w))

C:  0.1 / gamma:  0.01 / Accuracy:  0.32653061224489793
             precision    recall  f1-score   support

          0       0.33      1.00      0.49       321
          1       0.00      0.00      0.00       659

avg / total       0.11      0.33      0.16       980

C:  0.1 / gamma:  0.001 / Accuracy:  0.6724489795918367
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       321
          1       0.67      1.00      0.80       659

avg / total       0.45      0.67      0.54       980

C:  0.1 / gamma:  0.0001 / Accuracy:  0.6724489795918367
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       321
          1       0.67      1.00      0.80       659

avg / total       0.45      0.67      0.54       980

C:  1 / gamma:  0.01 / Accuracy:  0.6724489795918367
             precision    recall  f1-score   support

          0       0.00      0.00      0.00       321
          1       0.67    

# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)


In [19]:
X = np.array(data[['fixed acidity','volatile acidity', 'citric acid',
                   'residual sugar','chlorides','free sulfur dioxide',
                   'total sulfur dioxide','density','pH',
                   'sulphates','alcohol']])

y = np.array(data[['quality']])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [20]:
# import
from sklearn.linear_model import LinearRegression
# Initialize
linreg = LinearRegression(fit_intercept=False)
# Fit
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=False, n_jobs=1, normalize=False)

In [21]:
y_pred = linreg.predict(X_test)

* Analyze the coefficients

In [22]:
linreg.coef_

array([[ 0.01123191, -1.44251231, -0.10053734,  0.02319971, -0.65374668,
         0.00636245, -0.00244575,  2.00929887,  0.13921813,  0.6621809 ,
         0.32959046]])

Al observar los coeficientes, se ve que la gran mayoría de ellos indican variables con significacncia para el modelo lineal.

* Evaluate the RMSE

In [23]:
from math import sqrt
print('MSE:', sqrt(metrics.mean_squared_error(y_test, y_pred)))

MSE: 0.6920917977650142


# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.


In [24]:
# alpha=0.1
from sklearn.linear_model import Ridge
ridgereg01 = Ridge(alpha=0.1, normalize=True)
ridgereg01.fit(X_train, y_train)
y_pred = ridgereg01.predict(X_test)

In [25]:
ridgereg01.coef_

array([[ 2.90911851e-02, -1.15638289e+00,  2.58061347e-02,
         2.69505448e-02, -8.80349904e-01,  4.85795553e-03,
        -1.96916472e-03, -2.88186411e+01,  2.27099214e-01,
         6.68265776e-01,  2.58453034e-01]])

In [26]:
print('MSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

MSE: 0.6923968765515969


In [27]:
# alpha=1
from sklearn.linear_model import Ridge
ridgereg = Ridge(alpha=1, normalize=True)
ridgereg.fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)

In [28]:
ridgereg.coef_

array([[ 6.86548358e-04, -5.72959670e-01,  1.73567247e-01,
         6.00456832e-03, -1.25598585e+00,  1.66021236e-03,
        -6.35899542e-04, -2.18758539e+01,  7.70463388e-02,
         3.12457632e-01,  1.37364728e-01]])

In [29]:
print('MSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

MSE: 0.7300937224503552


# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [30]:
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.0005, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)
print('MSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

[ 0.00000000e+00 -1.09406553e+00  0.00000000e+00  6.89306915e-03
 -0.00000000e+00  4.77392494e-04 -0.00000000e+00  0.00000000e+00
  0.00000000e+00  3.69277283e-01  2.95976005e-01]
MSE: 0.7300937224503552


In [31]:
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=0.01, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)
print('MSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

[-0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0.]
MSE: 0.7300937224503552


In [32]:
from sklearn.linear_model import Lasso
lassoreg = Lasso(alpha=1, normalize=True)
lassoreg.fit(X_train, y_train)
print(lassoreg.coef_)
print('MSE:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

[-0. -0.  0. -0. -0.  0. -0. -0.  0.  0.  0.]
MSE: 0.7300937224503552


# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [33]:
X = np.array(data[['fixed acidity','volatile acidity', 'citric acid',
                   'residual sugar','chlorides','free sulfur dioxide',
                   'total sulfur dioxide','density','pH',
                   'sulphates','alcohol']])

y = np.array(data[['quality1']])

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

In [34]:
# fit a logistic regression model and store the class predictions
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear',C=1e9)
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)

In [35]:
logreg.coef_

array([[ 0.03797515, -4.41668794, -0.55810775,  0.07106798, -1.0722782 ,
         0.01780072, -0.00787689, -5.02832196,  0.51686001,  1.94753138,
         0.90954524]])

In [36]:
print(classification_report(y_test,y_pred))

             precision    recall  f1-score   support

          0       0.67      0.60      0.63       468
          1       0.79      0.83      0.81       832

avg / total       0.74      0.75      0.74      1300



# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [37]:
from sklearn.linear_model import LogisticRegression
from sklearn import datasets
from sklearn.preprocessing import StandardScaler

In [38]:
C = [0.01, 0.1, 1]

In [42]:
for c in C:
    clf_l1_LR = LogisticRegression(C=c, penalty='l1', tol=0.01, solver='saga')
    clf_l2_LR = LogisticRegression(C=c, penalty='l2', tol=0.01, solver='saga')
    clf_l1_LR.fit(X_train, y_train)
    clf_l2_LR.fit(X_train, y_train)
    
    coef1 = clf_l1_LR.coef_.ravel()
    coef2 = clf_l2_LR.coef_.ravel()
    
    y_pred1 = clf_l1_LR.predict(X_test)
    y_pred2 = clf_l2_LR.predict(X_test)

In [43]:
print(classification_report(y_test,y_pred1))

             precision    recall  f1-score   support

          0       0.50      0.17      0.25       468
          1       0.66      0.91      0.76       832

avg / total       0.60      0.64      0.58      1300



In [44]:
print(classification_report(y_test,y_pred2))

             precision    recall  f1-score   support

          0       0.49      0.12      0.19       468
          1       0.65      0.93      0.77       832

avg / total       0.59      0.64      0.56      1300

