# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [1]:
import pandas as pd
import numpy as np

from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn import metrics
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.metrics import accuracy_score
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression, LinearRegression, Ridge, Lasso
from sklearn.preprocessing import StandardScaler

import warnings
warnings.filterwarnings('ignore')

In [2]:
#data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
#data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

data_r = pd.read_csv('Wine_data_red.csv')
data_w = pd.read_csv('Wine_data_white.csv')

In [3]:
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
2642,7.6,0.23,0.4,5.2,0.066,14.0,91.0,0.99488,3.17,0.8,9.7,5,white
5992,6.6,0.725,0.09,5.5,0.117,9.0,17.0,0.99655,3.35,0.49,10.8,6,red
5284,7.8,0.54,0.26,2.0,0.088,23.0,48.0,0.9981,3.41,0.74,9.2,6,red
2093,6.6,0.22,0.53,15.1,0.052,22.0,136.0,0.9986,2.94,0.35,9.4,5,white
2762,7.3,0.32,0.35,1.4,0.05,8.0,163.0,0.99244,3.24,0.42,10.7,5,white


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [4]:
# tabla de frecuencia de clases de vinos
pd.value_counts(data['type'])

white    4898
red      1599
Name: type, dtype: int64

In [5]:
#frecuencia= data.pivot_table(values='pH', index=['type','quality'], aggfunc='count').unstack()

frecuencia = data.pivot_table(values='pH', index=['type'],columns='quality', aggfunc='count')
print("Tabla de frecuencias por tipo de vino y su nivel de calidad \n")
frecuencia

Tabla de frecuencias por tipo de vino y su nivel de calidad 



quality,3,4,5,6,7,8,9
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
red,10.0,53.0,681.0,638.0,199.0,18.0,
white,20.0,163.0,1457.0,2198.0,880.0,175.0,5.0


# SVM

# Exercise 6.2

* Create a binary target for each type of wine

In [6]:
data['quality2'] = data['quality']
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality2
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,6


In [7]:
data['quality'] = data['quality']>6

data['quality'] = data['quality'].replace('True',1)
data['quality'] = data['quality'].replace('False',0)

In [8]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality2
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,False,white,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,False,white,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,False,white,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,False,white,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,False,white,6


In [9]:
datared = data[data['type'] == 'red']
datawhite = data[data['type'] == 'white']
datared.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality2
4898,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,False,red,5
4899,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,False,red,5
4900,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,False,red,5
4901,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,False,red,6
4902,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,False,red,5


* Create two Linear SVM's for the white and red wines, repectively.

In [10]:
X = datared[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']]
y = datared['quality']

- Standarized the features (not the quality)

#### VINO ROJO

- Dividir el dataset en trainingy testing sets

In [11]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20)

scaler = StandardScaler()
X_train = X_train.astype(float)
X_test = X_test.astype(float)

scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

- Entrenando el algoritmo

In [12]:
svclassifier = SVC(kernel='linear').fit(X, y)

In [13]:
svclassifier.decision_function

<bound method BaseSVC.decision_function of SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)>

- Realizando las predicciones

In [14]:
y_pred = svclassifier.predict(X_test)

- Evaluando el algoritmo

In [15]:
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))

[[265   0]
 [ 55   0]]
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

   micro avg       0.83      0.83      0.83       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.83      0.75       320



- Precisión de Clasificación

In [16]:
accuracy_score(y_test, y_pred)

0.828125

La precisión de la clasificación es el número de predicciones correctas realizadas como una proporción de todas las predicciones realizadas.

#### VINO BLANCO

- Se crea X_ y y_

In [17]:
X_ = datawhite[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']]
y_ = datawhite['quality']

In [18]:
#Dividir el dataset
X_train_, X_test_, y_train_, y_test_ = train_test_split(X_, y_, test_size = 0.20) 

#Estandarizar X
X_train_ = X_train_.astype(float)
X_test_ = X_test_.astype(float)

scaler.fit(X_train_)
X_train_ = scaler.transform(X_train_)
X_test_ = scaler.transform(X_test_)

In [19]:
#Entrenando el Algoritmo
svclassifier_ = SVC(kernel='linear').fit(X_, y_)

In [20]:
svclassifier_.decision_function

<bound method BaseSVC.decision_function of SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='linear', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)>

In [21]:
#Realizando las predicción
y_pred_ = svclassifier_.predict(X_test_)

In [22]:
print(confusion_matrix(y_test_,y_pred_))  
print(classification_report(y_test_,y_pred_))

[[759   0]
 [221   0]]
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980



In [23]:
print("Precisión de Clasificación:", accuracy_score(y_test_, y_pred_))

Precisión de Clasificación: 0.7744897959183673


# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


#### VINO ROJO

- Polynomial Kernel

In [24]:
svclassifier = SVC(kernel='poly', degree=8)

In [25]:
svclassifier.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=8, gamma='auto_deprecated',
  kernel='poly', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [26]:
y_pred = svclassifier.predict(X_test)

In [27]:
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[259   6]
 [ 42  13]]
              precision    recall  f1-score   support

       False       0.86      0.98      0.92       265
        True       0.68      0.24      0.35        55

   micro avg       0.85      0.85      0.85       320
   macro avg       0.77      0.61      0.63       320
weighted avg       0.83      0.85      0.82       320



- Gaussian Kernel

In [28]:
svclassifier = SVC(kernel='rbf').fit(X, y)
svclassifier.decision_function

<bound method BaseSVC.decision_function of SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='rbf', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)>

In [29]:
y_pred = svclassifier.predict(X_test)

In [30]:
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[265   0]
 [ 55   0]]
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

   micro avg       0.83      0.83      0.83       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.83      0.75       320



- Sigmoid Kernel

In [31]:
svclassifier = SVC(kernel='sigmoid').fit(X, y)
svclassifier.decision_function

<bound method BaseSVC.decision_function of SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto_deprecated',
  kernel='sigmoid', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)>

In [32]:
y_pred = svclassifier.predict(X_test)

In [33]:
print(confusion_matrix(y_test, y_pred))  
print(classification_report(y_test, y_pred))

[[164 101]
 [ 17  38]]
              precision    recall  f1-score   support

       False       0.91      0.62      0.74       265
        True       0.27      0.69      0.39        55

   micro avg       0.63      0.63      0.63       320
   macro avg       0.59      0.65      0.56       320
weighted avg       0.80      0.63      0.68       320



- Comparando los kernel

Al comparar el rendimiento de los diferentes tipos de kernels, se ve que el Kernel Gaussian  tiene el peor rendimiento. Entre el kernel sigmoide y el kernel polinomial, el kernel polinomial logró una tasa de predicción ligeramente mejor que el sigmoide.

#### VINO BLANCO

- Polynomial Kernel

In [34]:
svclassifier_ = SVC(kernel='poly', degree=8)

In [35]:
svclassifier_.fit(X_train, y_train)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=8, gamma='auto_deprecated',
  kernel='poly', max_iter=-1, probability=False, random_state=None,
  shrinking=True, tol=0.001, verbose=False)

In [36]:
y_pred_ = svclassifier_.predict(X_test_)
print(confusion_matrix(y_test_, y_pred_))  
print(classification_report(y_test_, y_pred_))

[[739  20]
 [212   9]]
              precision    recall  f1-score   support

       False       0.78      0.97      0.86       759
        True       0.31      0.04      0.07       221

   micro avg       0.76      0.76      0.76       980
   macro avg       0.54      0.51      0.47       980
weighted avg       0.67      0.76      0.69       980



- Gaussian Kernel

In [37]:
svclassifier_ = SVC(kernel='rbf').fit(X_, y_)
y_pred_ = svclassifier_.predict(X_test_)
print(confusion_matrix(y_test_, y_pred_))  
print(classification_report(y_test_, y_pred_))

[[759   0]
 [221   0]]
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980



- Sigmoid Kernel

In [38]:
svclassifier_ = SVC(kernel='sigmoid').fit(X_, y_)
y_pred_ = svclassifier_.predict(X_test_)
print(confusion_matrix(y_test_, y_pred_))  
print(classification_report(y_test_, y_pred_))

[[543 216]
 [159  62]]
              precision    recall  f1-score   support

       False       0.77      0.72      0.74       759
        True       0.22      0.28      0.25       221

   micro avg       0.62      0.62      0.62       980
   macro avg       0.50      0.50      0.50       980
weighted avg       0.65      0.62      0.63       980



- Comparando los resultados

Al comparar el rendimiento de los diferentes tipos de kernels, se ve que el Kernel Gaussian tiene el menor rendimiento. Entre el kernel sigmoide y el kernel polinomial, el kernel polinomial logró una tasa de predicción ligeramente mejor que el sigmoide.

# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

#### VINO ROJO

- Variando los parametros

In [39]:
C=[0.1, 1, 10, 100, 1000]
gamma=[0.01, 0.001, 0.0001]

In [40]:
for c in C:
    for g in gamma:
        svclassifier = SVC(kernel='rbf',C=c,gamma=g).fit(X, y)
        y_pred = svclassifier.predict(X_test)
        print("Accuracy:", accuracy_score(y_test, y_pred))
        print("Parametro 'C'=", c,"y Parametro gamma", g)
        print(classification_report(y_test, y_pred))
        

Accuracy: 0.828125
Parametro 'C'= 0.1 y Parametro gamma 0.01
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

   micro avg       0.83      0.83      0.83       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.83      0.75       320

Accuracy: 0.828125
Parametro 'C'= 0.1 y Parametro gamma 0.001
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

   micro avg       0.83      0.83      0.83       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.83      0.75       320

Accuracy: 0.828125
Parametro 'C'= 0.1 y Parametro gamma 0.0001
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        

Accuracy: 0.828125
Parametro 'C'= 1 y Parametro gamma 0.001
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

   micro avg       0.83      0.83      0.83       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.83      0.75       320

Accuracy: 0.828125
Parametro 'C'= 1 y Parametro gamma 0.0001
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

   micro avg       0.83      0.83      0.83       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.83      0.75       320

Accuracy: 0.828125
Parametro 'C'= 10 y Parametro gamma 0.01
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

 

Accuracy: 0.828125
Parametro 'C'= 10 y Parametro gamma 0.001
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

   micro avg       0.83      0.83      0.83       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.83      0.75       320

Accuracy: 0.828125
Parametro 'C'= 10 y Parametro gamma 0.0001
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

   micro avg       0.83      0.83      0.83       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.83      0.75       320



Accuracy: 0.828125
Parametro 'C'= 100 y Parametro gamma 0.01
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

   micro avg       0.83      0.83      0.83       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.83      0.75       320

Accuracy: 0.828125
Parametro 'C'= 100 y Parametro gamma 0.001
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

   micro avg       0.83      0.83      0.83       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.83      0.75       320



Accuracy: 0.828125
Parametro 'C'= 100 y Parametro gamma 0.0001
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

   micro avg       0.83      0.83      0.83       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.83      0.75       320



Accuracy: 0.828125
Parametro 'C'= 1000 y Parametro gamma 0.01
              precision    recall  f1-score   support

       False       0.83      1.00      0.91       265
        True       0.00      0.00      0.00        55

   micro avg       0.83      0.83      0.83       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.83      0.75       320



Accuracy: 0.825
Parametro 'C'= 1000 y Parametro gamma 0.001
              precision    recall  f1-score   support

       False       0.83      1.00      0.90       265
        True       0.00      0.00      0.00        55

   micro avg       0.82      0.82      0.82       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.82      0.75       320



Accuracy: 0.825
Parametro 'C'= 1000 y Parametro gamma 0.0001
              precision    recall  f1-score   support

       False       0.83      1.00      0.90       265
        True       0.00      0.00      0.00        55

   micro avg       0.82      0.82      0.82       320
   macro avg       0.41      0.50      0.45       320
weighted avg       0.69      0.82      0.75       320



- Usando el Gaussian Kernel los parametros que dan mejor precisón son: 'C'= 100 y Parametro gamma 0.001 y Parametro 'C'= 1000 y Parametro gamma 0.0001

#### VINO BLANCO

In [41]:
for c in C:
    for g in gamma:
        svclassifier_ = SVC(kernel='rbf',C=c,gamma=g).fit(X, y)
        y_pred_ = svclassifier_.predict(X_test_)
        print("Accuracy:", accuracy_score(y_test_, y_pred_))
        print("Parametro 'C'=", c,"y Parametro gamma", g)
        print(classification_report(y_test_, y_pred_))

Accuracy: 0.7744897959183673
Parametro 'C'= 0.1 y Parametro gamma 0.01
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980

Accuracy: 0.7744897959183673
Parametro 'C'= 0.1 y Parametro gamma 0.001
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980



Accuracy: 0.7744897959183673
Parametro 'C'= 0.1 y Parametro gamma 0.0001
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980



Accuracy: 0.7744897959183673
Parametro 'C'= 1 y Parametro gamma 0.01
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980



Accuracy: 0.7744897959183673
Parametro 'C'= 1 y Parametro gamma 0.001
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980

Accuracy: 0.7744897959183673
Parametro 'C'= 1 y Parametro gamma 0.0001
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980



Accuracy: 0.7744897959183673
Parametro 'C'= 10 y Parametro gamma 0.01
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980



Accuracy: 0.7744897959183673
Parametro 'C'= 10 y Parametro gamma 0.001
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980

Accuracy: 0.7744897959183673
Parametro 'C'= 10 y Parametro gamma 0.0001
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980



Accuracy: 0.7744897959183673
Parametro 'C'= 100 y Parametro gamma 0.01
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980

Accuracy: 0.7744897959183673
Parametro 'C'= 100 y Parametro gamma 0.001
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980



Accuracy: 0.7744897959183673
Parametro 'C'= 100 y Parametro gamma 0.0001
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980



Accuracy: 0.7744897959183673
Parametro 'C'= 1000 y Parametro gamma 0.01
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980

Accuracy: 0.773469387755102
Parametro 'C'= 1000 y Parametro gamma 0.001
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980



Accuracy: 0.7744897959183673
Parametro 'C'= 1000 y Parametro gamma 0.0001
              precision    recall  f1-score   support

       False       0.77      1.00      0.87       759
        True       0.00      0.00      0.00       221

   micro avg       0.77      0.77      0.77       980
   macro avg       0.39      0.50      0.44       980
weighted avg       0.60      0.77      0.68       980



- Usando el Gaussian Kernel  los parametros que dan mejor precisón son: Parametro 'C'= 1000 y Parametro gamma 0.0001

# Exercise 6.5

Compare the results with other methods
 - Comparando con un modelo de regresión logística

#### VINO ROJO

In [42]:
logreg = LogisticRegression(solver='liblinear',C=1e9).fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))  
print(classification_report(y_test,y_pred))

Accuracy: 0.84375
[[256   9]
 [ 41  14]]
              precision    recall  f1-score   support

       False       0.86      0.97      0.91       265
        True       0.61      0.25      0.36        55

   micro avg       0.84      0.84      0.84       320
   macro avg       0.74      0.61      0.64       320
weighted avg       0.82      0.84      0.82       320



#### VINO BLANCO

In [43]:
logreg_ = LogisticRegression(solver='liblinear',C=1e9).fit(X_train_, y_train_)
y_pred_ = logreg_.predict(X_test_)
print("Accuracy:", accuracy_score(y_test_, y_pred_))
print(confusion_matrix(y_test_, y_pred_))  
print(classification_report(y_test_, y_pred_))

Accuracy: 0.7959183673469388
[[719  40]
 [160  61]]
              precision    recall  f1-score   support

       False       0.82      0.95      0.88       759
        True       0.60      0.28      0.38       221

   micro avg       0.80      0.80      0.80       980
   macro avg       0.71      0.61      0.63       980
weighted avg       0.77      0.80      0.77       980



- Realizando la regresión logística, el comportamiento es similar para los dos tipos de vino, presentando mejor precisión en el modelo de regresión lineal que en el modelo usando SVM.

# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

In [44]:
# remove categorical features
#data.drop(['type', 'quality'], axis=1, inplace=True)
#data.dropna(inplace=True)
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality2
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,False,white,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,False,white,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,False,white,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,False,white,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,False,white,6


In [45]:
X = data[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']]
y = data['quality2']

In [46]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

X_train = X_train.astype(float)
X_test = X_test.astype(float)

scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)

In [47]:
linreg = LinearRegression().fit(X_train, y_train)
y_pred = linreg.predict(X_test)

In [48]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality2
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,False,white,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,False,white,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,False,white,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,False,white,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,False,white,6


* Analyze the coefficients

In [49]:
list(zip(linreg.coef_, data.columns))

[(0.10170329402356523, 'fixed acidity'),
 (-0.22413539715698688, 'volatile acidity'),
 (-0.029335163607170103, 'citric acid'),
 (0.2137163716272402, 'residual sugar'),
 (-0.015114269037877923, 'chlorides'),
 (0.0929454967849375, 'free sulfur dioxide'),
 (-0.13235220174879392, 'total sulfur dioxide'),
 (-0.17189226472924307, 'density'),
 (0.07299203681413294, 'pH'),
 (0.11315602615308655, 'sulphates'),
 (0.3089344513324735, 'alcohol')]

- Analizando los coeficientes se identifica que el nivel de alcohol es la variable que tiene mayor afectación. 

- Evaluate the RMSE

In [50]:
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.7170517385288945


El RMSE toma el promedio del cuadrado de la diferencia entre los valores originales y los valores predichos.

# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [51]:
# alpha=0.1
ridgereg = Ridge(alpha=0.1, normalize=True).fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.7203295233692668


- El RMSE es mayor que el usando una regresión lineal

In [52]:
list(zip(ridgereg.coef_, data.columns))

[(0.04515783954940397, 'fixed acidity'),
 (-0.19688630741690105, 'volatile acidity'),
 (-0.007611887096904056, 'citric acid'),
 (0.12686013885823458, 'residual sugar'),
 (-0.033507390157222934, 'chlorides'),
 (0.07082333973190955, 'free sulfur dioxide'),
 (-0.09749806688977407, 'total sulfur dioxide'),
 (-0.0907534373608146, 'density'),
 (0.0402666874059643, 'pH'),
 (0.09542033889115942, 'sulphates'),
 (0.30343999238743563, 'alcohol')]

In [53]:
# try alpha=1
ridgereg = Ridge(alpha=1, normalize=True).fit(X_train, y_train)
y_pred = ridgereg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.7607058336527054


- El RMSE es mayor que el usando una regresión lineal

In [54]:
list(zip(ridgereg.coef_, data.columns))

[(0.001812758444931827, 'fixed acidity'),
 (-0.09719105501471262, 'volatile acidity'),
 (0.021642682840787686, 'citric acid'),
 (0.027157607507033936, 'residual sugar'),
 (-0.044099129702806175, 'chlorides'),
 (0.02306459978193872, 'free sulfur dioxide'),
 (-0.03276076849473246, 'total sulfur dioxide'),
 (-0.06700209874471388, 'density'),
 (0.013208758863247141, 'pH'),
 (0.04445544520651518, 'sulphates'),
 (0.1621037278968578, 'alcohol')]

-  Con la regresión ridge, el RMSE aumenta en relación con el RMSE de la regresión lineal, sin embargo, reduce los coeficientes respectivamente.

# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [55]:
lassoreg = Lasso(alpha=0.01, normalize=True).fit(X_train, y_train)
y_pred = lassoreg.predict(X_test)
list(zip(lassoreg.coef_, data.columns))

[(-0.0, 'fixed acidity'),
 (-0.0, 'volatile acidity'),
 (0.0, 'citric acid'),
 (-0.0, 'residual sugar'),
 (-0.0, 'chlorides'),
 (0.0, 'free sulfur dioxide'),
 (-0.0, 'total sulfur dioxide'),
 (-0.0, 'density'),
 (0.0, 'pH'),
 (0.0, 'sulphates'),
 (0.0, 'alcohol')]

In [56]:
lassoreg = Lasso(alpha=0.1, normalize=True).fit(X_train, y_train)
y_pred = lassoreg.predict(X_test)
list(zip(lassoreg.coef_, data.columns))

[(-0.0, 'fixed acidity'),
 (-0.0, 'volatile acidity'),
 (0.0, 'citric acid'),
 (-0.0, 'residual sugar'),
 (-0.0, 'chlorides'),
 (0.0, 'free sulfur dioxide'),
 (-0.0, 'total sulfur dioxide'),
 (-0.0, 'density'),
 (0.0, 'pH'),
 (0.0, 'sulphates'),
 (0.0, 'alcohol')]

In [57]:
lassoreg = Lasso(alpha=1, normalize=True).fit(X_train, y_train)
y_pred = lassoreg.predict(X_test)
list(zip(lassoreg.coef_, data.columns))

[(-0.0, 'fixed acidity'),
 (-0.0, 'volatile acidity'),
 (0.0, 'citric acid'),
 (-0.0, 'residual sugar'),
 (-0.0, 'chlorides'),
 (0.0, 'free sulfur dioxide'),
 (-0.0, 'total sulfur dioxide'),
 (-0.0, 'density'),
 (0.0, 'pH'),
 (0.0, 'sulphates'),
 (0.0, 'alcohol')]

- Al realizar la regresión con el metodo Lasso con alpha=[0.01,0.1,1] los coeficientes no varian y ninguno presenta mayor afectación sobre la calidad del vino. Así como tampoco los RMSE´s cambian variando el alpha, sin embargo son más altos que el de la regresión lineal.

In [58]:
alpha=[0.01,0.1,1]

In [59]:
for a in alpha:
        lassoreg = Lasso(alpha=a, normalize=True).fit(X_train, y_train)
        y_pred = lassoreg.predict(X_test)
        print("Parametro 'alpha'=", a)
        print("RMSE",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

Parametro 'alpha'= 0.01
RMSE 0.8709345888926285
Parametro 'alpha'= 0.1
RMSE 0.8709345888926285
Parametro 'alpha'= 1
RMSE 0.8709345888926285


# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [60]:
X = data[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH','sulphates','alcohol']]
y = data['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

X_train = X_train.astype(float)
X_test = X_test.astype(float)

scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)


In [61]:
logreg = LogisticRegression(solver='liblinear',C=1e9).fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(confusion_matrix(y_test,y_pred))

Accuracy: 0.8178461538461539
[[1237   53]
 [ 243   92]]


In [62]:
print(logreg.coef_)

[[ 0.59312348 -0.53570656 -0.07113704  0.83529613 -0.25721834  0.19944132
  -0.34490719 -0.89698585  0.39935637  0.3527559   0.65706125]]


- Analizando los coeficientes se identifica que el nivel alcohol es la variable que tiene mayor afectación. 

In [63]:
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

       False       0.84      0.96      0.89      1290
        True       0.63      0.27      0.38       335

   micro avg       0.82      0.82      0.82      1625
   macro avg       0.74      0.62      0.64      1625
weighted avg       0.79      0.82      0.79      1625



El f1score varia indicando un rendimiento del modelo.

In [64]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,quality2
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,False,white,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,False,white,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,False,white,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,False,white,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,False,white,6


# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [65]:
C=[0.01, 0.1, 1]
penalty = ['l1', 'l2']

In [66]:
for c in C:
    for p in penalty :
        logreg = LogisticRegression(C=c, penalty=p,solver='liblinear').fit(X_train, y_train)
        print("Accuracy:", accuracy_score(y_test_, y_pred_))
        print("Parametro 'C'=", c,"y Penalty", p)
        print(classification_report(y_test_, y_pred_))
        print("Coeficientes:",logreg.coef_)

Accuracy: 0.7959183673469388
Parametro 'C'= 0.01 y Penalty l1
              precision    recall  f1-score   support

       False       0.82      0.95      0.88       759
        True       0.60      0.28      0.38       221

   micro avg       0.80      0.80      0.80       980
   macro avg       0.71      0.61      0.63       980
weighted avg       0.77      0.80      0.77       980

Coeficientes: [[ 0.         -0.26203498  0.          0.          0.          0.
   0.          0.          0.          0.01822901  0.77118696]]
Accuracy: 0.7959183673469388
Parametro 'C'= 0.01 y Penalty l2
              precision    recall  f1-score   support

       False       0.82      0.95      0.88       759
        True       0.60      0.28      0.38       221

   micro avg       0.80      0.80      0.80       980
   macro avg       0.71      0.61      0.63       980
weighted avg       0.77      0.80      0.77       980

Coeficientes: [[ 0.16983683 -0.33357089  0.02004888  0.2713157  -0.2104919   0

Accuracy: 0.7959183673469388
Parametro 'C'= 0.1 y Penalty l2
              precision    recall  f1-score   support

       False       0.82      0.95      0.88       759
        True       0.60      0.28      0.38       221

   micro avg       0.80      0.80      0.80       980
   macro avg       0.71      0.61      0.63       980
weighted avg       0.77      0.80      0.77       980

Coeficientes: [[ 0.4375043  -0.51431614 -0.04996592  0.62614408 -0.28064981  0.18325519
  -0.30025829 -0.60993186  0.30447318  0.30950057  0.73280542]]
Accuracy: 0.7959183673469388
Parametro 'C'= 1 y Penalty l1
              precision    recall  f1-score   support

       False       0.82      0.95      0.88       759
        True       0.60      0.28      0.38       221

   micro avg       0.80      0.80      0.80       980
   macro avg       0.71      0.61      0.63       980
weighted avg       0.77      0.80      0.77       980

Coeficientes: [[ 0.56080706 -0.53375643 -0.06464439  0.79447169 -0.2591976

- El f1score no cambia variando los parametros de 'C' y penalty indicando un rendimiento similar del modelo, sin embargo, los coeficientes si varian dependiendo de los parametros ingresados, siendo el nivel alcohol la variable que tiene mayor afectación.