# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [2]:
import pandas as pd
import numpy as np

In [2]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [3]:
data = data_w.assign(type = 'white')
data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.tail(5)
data.shape

(6497, 13)

# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [4]:
pd.crosstab(index=data["type"],     
                      columns=[data["quality"]])

quality,3,4,5,6,7,8,9
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
red,10,53,681,638,199,18,0
white,20,163,1457,2198,880,175,5


A continuación, vamos a clasificar manualmente un vino como bueno o malo para las dos bases de datos, la de vinos blancos y la de vinos rojos.

In [5]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')
data_r = data_r.assign(type = 'red')
data_w = data_w.assign(type = 'white')

In [6]:
data_r.loc[(data_r['quality'] >= 7) & (data_r['quality'] <= 9), 'quality_2'] = 1
data_r.loc[(data_r['quality'] >=3) & (data_r['quality'] <= 6), 'quality_2'] = 0
#data_r
data_w.loc[(data_w['quality'] >= 7) & (data_w['quality'] <= 9), 'quality_2'] = 1
data_w.loc[(data_w['quality'] >=3) & (data_w['quality'] <= 6), 'quality_2'] = 0
#data_w

data_r=data_r.drop(['quality'], axis = 1)
data_r.rename(columns={'quality_2':'quality'}, inplace=True)
data_r['quality'] = data_r['quality'].astype('category')
#data_r

data_w=data_w.drop(['quality'], axis = 1)
data_w.rename(columns={'quality_2':'quality'}, inplace=True)
data_w['quality'] = data_w['quality'].astype('category')
#data_w
data_w.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 13 columns):
fixed acidity           4898 non-null float64
volatile acidity        4898 non-null float64
citric acid             4898 non-null float64
residual sugar          4898 non-null float64
chlorides               4898 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4898 non-null float64
sulphates               4898 non-null float64
alcohol                 4898 non-null float64
type                    4898 non-null object
quality                 4898 non-null category
dtypes: category(1), float64(11), object(1)
memory usage: 464.1+ KB


# SVM

# Exercise 6.2

* Standarized the features (not the quality)
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.


In [7]:
#Create a binary target for each type of wine

X_r = data_r.drop(['quality', 'type'], axis = 1)
y_r = data_r['quality']
from sklearn.preprocessing import LabelEncoder
labelencoder_y_r = LabelEncoder()
y_r = labelencoder_y_r.fit_transform(y_r)
y_r

X_w = data_w.drop(['quality', 'type'], axis = 1)
y_w = data_w['quality']
from sklearn.preprocessing import LabelEncoder
labelencoder_y_w = LabelEncoder()
y_w = labelencoder_y_w.fit_transform(y_w)
y_w


array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

In [8]:
#Test & Train datasets
from sklearn.cross_validation import train_test_split
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_r, y_r, test_size = 0.3, random_state = 0)
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_w, y_w, test_size = 0.3, random_state = 0)



In [9]:
#Standarized the features (not the quality)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_r = sc.fit_transform(X_train_r)
X_test_r = sc.transform(X_test_r)

X_train_w = sc.fit_transform(X_train_w)
X_test_w = sc.transform(X_test_w)

In [10]:
#Create two linear SVM's for the white and red wines, repectively.

from sklearn.svm import SVC
classifier_r = SVC(kernel = 'linear', random_state = 0)
classifier_r.fit(X_train_r, y_train_r)

classifier_w = SVC(kernel = 'linear', random_state = 0)
classifier_w.fit(X_train_w, y_train_w)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=0, shrinking=True,
  tol=0.001, verbose=False)

# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


In [11]:
from sklearn.model_selection import cross_val_score

#red wines
    #rbf
classifier_r_rbf = SVC(kernel = 'rbf', random_state = 0)
classifier_r_rbf.fit(X_train_r, y_train_r)
y_pred_r_rbf = classifier_r_rbf.predict(X_test_r)

accuracies_r_rbf = cross_val_score(estimator = classifier_r_rbf, X = X_train_r,
                             y = y_train_r, cv = 10)

    #poly
classifier_r_poly = SVC(kernel = 'poly', random_state = 0)
classifier_r_poly.fit(X_train_r, y_train_r)
y_pred_r_poly = classifier_r_poly.predict(X_test_r)

accuracies_r_poly = cross_val_score(estimator = classifier_r_poly, X = X_train_r,
                             y = y_train_r, cv = 10)

   #sigmoid
classifier_r_sig = SVC(kernel = 'sigmoid', random_state = 0)
classifier_r_sig.fit(X_train_r, y_train_r)
y_pred_r_sig = classifier_r_sig.predict(X_test_r)

accuracies_r_sig = cross_val_score(estimator = classifier_r_sig, X = X_train_r,
                             y = y_train_r, cv = 10)

#white wines 
    #rbf
classifier_w_rbf = SVC(kernel = 'rbf', random_state = 0)
classifier_w_rbf.fit(X_train_w, y_train_w)
y_pred_w_rbf = classifier_w_rbf.predict(X_test_w)

accuracies_w_rbf = cross_val_score(estimator = classifier_w_rbf, X = X_train_w,
                             y = y_train_w, cv = 10)

    #poly
classifier_w_poly = SVC(kernel = 'poly', random_state = 0)
classifier_w_poly.fit(X_train_w, y_train_w)
y_pred_w_poly = classifier_w_poly.predict(X_test_w)

accuracies_w_poly = cross_val_score(estimator = classifier_w_poly, X = X_train_w,
                             y = y_train_w, cv = 10)

   #sigmoid
classifier_w_sig = SVC(kernel = 'sigmoid', random_state = 0)
classifier_w_sig.fit(X_train_w, y_train_w)
y_pred_w_sig = classifier_w_sig.predict(X_test_w)

accuracies_w_sig = cross_val_score(estimator = classifier_w_sig, X = X_train_w,
                             y = y_train_w, cv = 10)


print("Los resultados son :")
print("\n")
print("Accuracy score del SVM con kernel-rbf en base de vinos rojos :"+" "+str(accuracies_r_rbf.mean()))
print("Accuracy score del SVM con kernel-poly en base de vinos rojos :"+" "+str(accuracies_r_poly.mean()))
print("Accuracy score del SVM con kernel-sigmoid en base de vinos rojos :"+" "+str(accuracies_r_sig.mean()))
print("\n")
print("Accuracy score del SVM con kernel-rbf en base de vinos blancos :"+" "+str(accuracies_w_rbf.mean()))
print("Accuracy score del SVM con kernel-poly en base de vinos blancos :"+" "+str(accuracies_w_poly.mean()))
print("Accuracy score del SVM con kernel-sigmoid en base de vinos blancos :"+" "+str(accuracies_w_sig.mean()))

Los resultados son :


Accuracy score del SVM con kernel-rbf en base de vinos rojos : 0.8678281710914455
Accuracy score del SVM con kernel-poly en base de vinos rojos : 0.8597762269222446
Accuracy score del SVM con kernel-sigmoid en base de vinos rojos : 0.8266836368606281


Accuracy score del SVM con kernel-rbf en base de vinos blancos : 0.831992931228562
Accuracy score del SVM con kernel-poly en base de vinos blancos : 0.8054282151748617
Accuracy score del SVM con kernel-sigmoid en base de vinos blancos : 0.7403666280873835


Así, implementar un modelo de SVM con un kernel rbf tiene el mejor accuracy en ambas bases de datos.

Otra opción es utilizar la clase GridSearchCV. GridSearchCV implementa un modelo de clasificación como cualquier otro, excepto que los parámetros del clasificador usado para predecir es optimizado por cross validation. A continuación, utilizaremos esta opción para la base de datos de vinos blancos. Observaremos que el accuracy score es el mismo.

In [12]:
from sklearn.model_selection import GridSearchCV
classifier_w = SVC(random_state = 0)
parameters_w = [{'kernel': ['rbf', 'poly', 'sigmoid']}]
grid_search_w = GridSearchCV(estimator = classifier_w,
                           param_grid = parameters_w,
                           scoring = 'accuracy',
                           cv = 10,)
grid_search_w.fit(X_train_w, y_train_w)
best_accuracy_w = grid_search_w.best_score_
best_parameters_w = grid_search_w.best_params_
#here is the best accuracy 
best_accuracy_w

0.8319719953325554

In [13]:
#and here is best parameters
best_parameters_w

{'kernel': 'rbf'}

# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [14]:
#Grid search for best model and parameters

    #red wines
parameters_r = [{'C': [0.1,1, 10, 100, 1000], 'kernel': ['rbf'],
               'gamma': [0.01, 0.001, 0.0001]}]
grid_search_r = GridSearchCV(estimator = classifier_r_rbf,
                           param_grid = parameters_r,
                           scoring = 'accuracy',
                           cv = 10,)
grid_search_r.fit(X_train_r, y_train_r)
best_accuracy_r = grid_search_r.best_score_
best_parameters_r = grid_search_r.best_params_
#here is the best accuracy 
best_accuracy_r

0.872207327971403

In [15]:
#and here is best parameters
best_parameters_r

{'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}

In [16]:
    #white wines
parameters_w = [{'C': [0.1,1, 10, 100, 1000], 'kernel': ['rbf'],
               'gamma': [0.01, 0.001, 0.0001]}]
grid_search_w = GridSearchCV(estimator = classifier_w_rbf,
                           param_grid = parameters_w,
                           scoring = 'accuracy',
                           cv = 10,)
grid_search_w.fit(X_train_w, y_train_w)
best_accuracy_w = grid_search_w.best_score_
best_parameters_w = grid_search_w.best_params_
#here is the best accuracy 
best_accuracy_w

0.8281796966161027

In [17]:
#and here is best parameters
best_parameters_w

{'C': 1000, 'gamma': 0.01, 'kernel': 'rbf'}

Así, es ambos casos el mejor rendimiento está dado por los parámetros C=1000, gamma=0.01 y un kernal rbf.

# Exercise 6.5

Compare the results with other methods

Se implementará un modelo de regresión logística, para luego comparar el accuracy de estos modelos con aquel de los modelos del ejercicio 6,4 (SVM).

In [18]:
#Logistic regression in red wine dataset

from sklearn.linear_model import LogisticRegression
logreg_r = LogisticRegression(solver='liblinear',C=1e9)
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}
logreg_cv=GridSearchCV(logreg_r,grid,cv=10)
logreg_cv.fit(X_train_r,y_train_r)
                                        
best_accuracy_log_r = logreg_cv.best_score_
best_parameters_log_r = logreg_cv.best_params_         
                    
#here is the best accuracy 
best_accuracy_log_r 

0.8695263628239499

In [19]:
#and here is best parameters
best_parameters_log_r 

{'C': 0.01, 'penalty': 'l2'}

In [20]:
#Logistic regression in white wine dataset
logreg_w = LogisticRegression(solver='liblinear',C=1e9)
grid={"C":np.logspace(-3,3,7), "penalty":["l1","l2"]}
logreg_cv=GridSearchCV(logreg_w,grid,cv=10)
logreg_cv.fit(X_train_w,y_train_w)
                                        
best_accuracy_log_w = logreg_cv.best_score_
best_parameters_log_w = logreg_cv.best_params_         
                    
#here is the best accuracy 
best_accuracy_log_w 

0.8060093348891482

In [21]:
#and here is best parameters
best_parameters_log_w 

{'C': 0.01, 'penalty': 'l2'}

Por lo tanto, en ambos casos el accuracy es mejor cuando se implementa un modelo de SVM comparado con aquel cuando se implementa una regresión logistica 

# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [22]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')
data_w.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
fixed acidity           4898 non-null float64
volatile acidity        4898 non-null float64
citric acid             4898 non-null float64
residual sugar          4898 non-null float64
chlorides               4898 non-null float64
free sulfur dioxide     4898 non-null float64
total sulfur dioxide    4898 non-null float64
density                 4898 non-null float64
pH                      4898 non-null float64
sulphates               4898 non-null float64
alcohol                 4898 non-null float64
quality                 4898 non-null int64
dtypes: float64(11), int64(1)
memory usage: 459.3 KB


In [23]:
#Sets X y Y
X_r = data_r.drop(['quality'], axis = 1)
y_r = data_r['quality']

X_w = data_w.drop(['quality'], axis = 1)
y_w = data_w['quality']

In [24]:
#Test & Train data

from sklearn.cross_validation import train_test_split
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_r, y_r, test_size = 0.3, random_state = 0)
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_w, y_w, test_size = 0.3, random_state = 0)

In [25]:
#Standarized the features (not the quality)

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train_r = sc.fit_transform(X_train_r)
X_test_r = sc.transform(X_test_r)

X_train_w = sc.fit_transform(X_train_w)
X_test_w = sc.transform(X_test_w)

In [38]:
#Train a linear regression to predict wine quality (Continous).
import statsmodels.api as sm
#red wines
from sklearn.linear_model import LinearRegression
linreg_r = LinearRegression()
linreg_r.fit(X_train_r, y_train_r)
print(linreg_r.coef_)
y_pred_r = linreg_r.predict(X_test_r)

#white wines
linreg_w = LinearRegression()
linreg_w.fit(X_train_w, y_train_w)
print(linreg_w.coef_)
y_pred_w = linreg_w.predict(X_test_w)
       

[ 0.03523454 -0.22500299 -0.01917881  0.03301609 -0.09404393  0.02142016
 -0.10287699 -0.03175893 -0.06166729  0.15260506  0.28471337]
[ 0.09636772 -0.18212585 -0.00178702  0.48284855 -0.00936514  0.0871764
 -0.01019314 -0.5914199   0.13181841  0.07656047  0.15829533]


In [39]:
#Analyze the coefficients

Los resultados muestran lo siguiente:
1. En ambos casos, un aumento de la acidez fija implica un incremento de la calidad del vino.
2. En ambos casos, un aumento de la acidez volatil implica una disminución de la calidad del vino.
3. En ambos casos, un aumento del ácido cítrico implica una disminución de la calidad del vino.
4. En ambos casos, un aumento del azúcar residual implica un incremento de la calidad del vino.
5. En ambos casos, un aumento del cloruro implica una disminución de la calidad del vino.
6. En ambos casos, un aumento del dióxido de azufre libre implica un aumento de la calidad del vino.
7. En ambos casos, un aumento del dióxido de azufre total implica una disminución de la calidad del vino.
8. En ambos casos, un aumento de la densidad implica una disminución de la calidad del vino.
9. Si el vino es rojo, un aumento del pH implica una disminución de la calidad del vino. Mientras que si el vino es blanco, un aumento del pH implica un aumento de la calidad del vino.
10. En ambos casos, un aumento del sulfato implica un aumento de la calidad del vino.
11. En ambos casos, un aumento del alcohol implica un aumento de la calidad del vino.

In [40]:
#Evaluate the RMSE

from sklearn import metrics
import numpy as np

print("RMSE de la regresión lineal en la base de vinos rojos :"+" "+str(np.sqrt(metrics.mean_squared_error(y_test_r, y_pred_r))))
print("RMSE de la regresión lineal en la base de vinos blancos :"+" "+str(np.sqrt(metrics.mean_squared_error(y_test_w, y_pred_w))))


RMSE de la regresión lineal en la base de vinos rojos : 0.6330721652193917
RMSE de la regresión lineal en la base de vinos blancos : 0.7797679548193168


# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [43]:
#Estimate a ridge regression with alpha equals 0.1 and 1.

#red wines
    #alpha 0.1 

from sklearn.linear_model import Ridge
ridgereg_r_01 = Ridge(alpha=0.1, normalize=True)
ridgereg_r_01.fit(X_train_r, y_train_r)
y_pred_r_01 = ridgereg_r_01.predict(X_test_r)
   
    #alpha 1.0 
ridgereg_r_1 = Ridge(alpha=1, normalize=True)
ridgereg_r_1.fit(X_train_r, y_train_r)
y_pred_r_1 = ridgereg_r_1.predict(X_test_r)


#white wines
    #alpha 0.1 

ridgereg_w_01 = Ridge(alpha=0.1, normalize=True)
ridgereg_w_01.fit(X_train_w, y_train_w)
y_pred_w_01 = ridgereg_w_01.predict(X_test_w)
   
    #alpha 1.0 
ridgereg_w_1 = Ridge(alpha=1, normalize=True)
ridgereg_w_1.fit(X_train_w, y_train_w)
y_pred_w_1 = ridgereg_w_1.predict(X_test_w)


In [48]:
#Compare the coefficients with the linear regression

print(ridgereg_r_01.coef_)
print("\n")
print(ridgereg_r_1.coef_)
print("\n")

print(ridgereg_w_01.coef_)
print("\n")
print(ridgereg_w_1.coef_)

[ 0.04671119 -0.20098627  0.0110126   0.03533417 -0.08771638  0.01206531
 -0.09294553 -0.05696875 -0.03941277  0.14213138  0.24847562]


[ 0.03198728 -0.12515816  0.04854226  0.01505283 -0.05036445 -0.00898519
 -0.05605514 -0.05090195 -0.01553189  0.0819124   0.1500815 ]


[-0.00705278 -0.1649712   0.0011319   0.17096024 -0.03894095  0.0937976
 -0.03558273 -0.15091429  0.04693054  0.04856757  0.31119153]


[-0.02264332 -0.08658794  0.00600274  0.03011663 -0.05449677  0.04654489
 -0.03102156 -0.07614674  0.02447019  0.02576662  0.15879778]


Dado lo anterior, se puede concluir de forma general que cuando el alfa es igual a 0,1 los resultados son similares a aquellos encontrados cuando se implementó una regresión lineal, mientras que cuando el alfa es igual a 1, la penalización es mayor y por lo tanto, la magnitud de los coeficientes son menores al caso de cuando se implementó una regresión lineal.

In [45]:
#Evaluate the RMSE

print("RMSE de la regresión de ridge en la base de vinos rojos con alfa igual a 0,1 :"+" "+str(np.sqrt(metrics.mean_squared_error(y_test_r, y_pred_r_01))))
print("RMSE de la regresión de ridge en la base de vinos rojos con alfa igual a 1 :"+" "+str (np.sqrt(metrics.mean_squared_error(y_test_r, y_pred_r_1))))
print("RMSE de la regresión de ridge en la base de vinos blancos con alfa igual a 0,1 :"+" "+str (np.sqrt(metrics.mean_squared_error(y_test_w, y_pred_w_01))))
print("RMSE de la regresión de ridge en la base de vinos blancos con alfa igual a 1 :"+" "+str (np.sqrt(metrics.mean_squared_error(y_test_w, y_pred_w_1))))


RMSE de la regresión de ridge en la base de vinos rojos con alfa igual a 0,1 : 0.6344194631893529
RMSE de la regresión de ridge en la base de vinos rojos con alfa igual a 1 : 0.6569663486954711
RMSE de la regresión de ridge en la base de vinos blancos con alfa igual a 0,1 : 0.7797233024279017
RMSE de la regresión de ridge en la base de vinos blancos con alfa igual a 1 : 0.8120347448335604


# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [46]:
#Estimate a ridge regression with alpha equals 0.01, 0.1 and 1.

#red wines 
    
    #alpha 0.01
    
from sklearn.linear_model import Lasso
lassoreg_r_001 = Lasso(alpha=0.01, normalize=True)
lassoreg_r_001.fit(X_train_r, y_train_r)
y_pred_r_001 = lassoreg_r_001.predict(X_test_r)

    #alpha 0.01

lassoreg_r_01 = Lasso(alpha=0.1, normalize=True)
lassoreg_r_01.fit(X_train_r, y_train_r)
y_pred_r_01 = lassoreg_r_01.predict(X_test_r)

    #alpha 1

lassoreg_r_1 = Lasso(alpha=1, normalize=True)
lassoreg_r_1.fit(X_train_r, y_train_r)
y_pred_r_1 = lassoreg_r_1.predict(X_test_r)

#white wines 
    
    #alpha 0.01
    
lassoreg_w_001 = Lasso(alpha=0.01, normalize=True)
lassoreg_w_001.fit(X_train_w, y_train_w)
y_pred_w_001 = lassoreg_w_001.predict(X_test_w)

    #alpha 0.01

lassoreg_w_01 = Lasso(alpha=0.1, normalize=True)
lassoreg_w_01.fit(X_train_w, y_train_w)
y_pred_w_01 = lassoreg_w_01.predict(X_test_w)

    #alpha 1

lassoreg_w_1 = Lasso(alpha=1, normalize=True)
lassoreg_w_1.fit(X_train_w, y_train_w)
y_pred_w_1 = lassoreg_w_1.predict(X_test_w)

In [47]:
#Compare the coefficients with the linear regression

print(lassoreg_r_001.coef_)
print("\n")
print(lassoreg_r_01.coef_)
print("\n")
print(lassoreg_r_1.coef_)
print("\n")

print(lassoreg_w_001.coef_)
print("\n")
print(lassoreg_w_01.coef_)
print("\n")
print(lassoreg_w_1.coef_)
print("\n")

[ 0.         -0.00601074  0.          0.         -0.         -0.
 -0.         -0.         -0.          0.          0.04489403]


[ 0. -0.  0.  0. -0. -0. -0. -0. -0.  0.  0.]


[ 0. -0.  0.  0. -0. -0. -0. -0. -0.  0.  0.]


[-0. -0. -0. -0. -0.  0. -0. -0.  0.  0.  0.]


[-0. -0. -0. -0. -0.  0. -0. -0.  0.  0.  0.]


[-0. -0. -0. -0. -0.  0. -0. -0.  0.  0.  0.]




El anterior resultado muestra que incluso cuando la penalización es pequeña (igual a 0,01), la regresión de lasso convierte a 0 los coeficientes de casi todas las variables, removiéndolas así del modelo, a comparación del modelo de regresión lineal donde ningún coeficiente es igual a 0, y del modelo de ridge donde los coeficientes son pequeños y más cercanos a 0 pero nunca iguales.

In [49]:
#Evaluate the RMSE

print("RMSE de la regresión de lasso en la base de vinos rojos con alfa igual a 0,01 :"+" "+str(np.sqrt(metrics.mean_squared_error(y_test_r, y_pred_r_001))))
print("RMSE de la regresión de lasso en la base de vinos rojos con alfa igual a 0,1 :"+" "+str(np.sqrt(metrics.mean_squared_error(y_test_r, y_pred_r_01))))
print("RMSE de la regresión de lasso en la base de vinos rojos con alfa igual a 1 :"+" "+str (np.sqrt(metrics.mean_squared_error(y_test_r, y_pred_r_1))))
print("RMSE de la regresión de lasso en la base de vinos blancos con alfa igual a 0,01 :"+" "+str (np.sqrt(metrics.mean_squared_error(y_test_w, y_pred_w_001))))
print("RMSE de la regresión de lasso en la base de vinos blancos con alfa igual a 0,1 :"+" "+str(np.sqrt(metrics.mean_squared_error(y_test_w, y_pred_w_01))))
print("RMSE de la regresión de lasso en la base de vinos blancos con alfa igual a 1 :"+" "+str (np.sqrt(metrics.mean_squared_error(y_test_w, y_pred_w_1))))


RMSE de la regresión de lasso en la base de vinos rojos con alfa igual a 0,01 : 0.7463573531910372
RMSE de la regresión de lasso en la base de vinos rojos con alfa igual a 0,1 : 0.7698374031730867
RMSE de la regresión de lasso en la base de vinos rojos con alfa igual a 1 : 0.7698374031730867
RMSE de la regresión de lasso en la base de vinos blancos con alfa igual a 0,01 : 0.9016157829038678
RMSE de la regresión de lasso en la base de vinos blancos con alfa igual a 0,1 : 0.9016157829038678
RMSE de la regresión de lasso en la base de vinos blancos con alfa igual a 1 : 0.9016157829038678


# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [28]:
#Create a binary target

data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')
data_r = data_r.assign(type = 'red')
data_w = data_w.assign(type = 'white')
data_r.loc[(data_r['quality'] >= 7) & (data_r['quality'] <= 9), 'quality_2'] = 1
data_r.loc[(data_r['quality'] >=3) & (data_r['quality'] <= 6), 'quality_2'] = 0
#data_r
data_w.loc[(data_w['quality'] >= 7) & (data_w['quality'] <= 9), 'quality_2'] = 1
data_w.loc[(data_w['quality'] >=3) & (data_w['quality'] <= 6), 'quality_2'] = 0
#data_w

data_r=data_r.drop(['quality'], axis = 1)
data_r.rename(columns={'quality_2':'quality'}, inplace=True)
data_r['quality'] = data_r['quality'].astype('category')
#data_r

data_w=data_w.drop(['quality'], axis = 1)
data_w.rename(columns={'quality_2':'quality'}, inplace=True)
data_w['quality'] = data_w['quality'].astype('category')
#data_w

In [29]:
X_r = data_r.drop(['quality', 'type'], axis = 1)
y_r = data_r['quality']
from sklearn.preprocessing import LabelEncoder
labelencoder_y_r = LabelEncoder()
y_r = labelencoder_y_r.fit_transform(y_r)
y_r

X_w = data_w.drop(['quality', 'type'], axis = 1)
y_w = data_w['quality']
from sklearn.preprocessing import LabelEncoder
labelencoder_y_w = LabelEncoder()
y_w = labelencoder_y_w.fit_transform(y_w)
y_w

array([0, 0, 0, ..., 0, 1, 0], dtype=int64)

In [30]:
#Test & Train datasets
from sklearn.cross_validation import train_test_split
X_train_r, X_test_r, y_train_r, y_test_r = train_test_split(X_r, y_r, test_size = 0.3, random_state = 0)
X_train_w, X_test_w, y_train_w, y_test_w = train_test_split(X_w, y_w, test_size = 0.3, random_state = 0)


#from sklearn.preprocessing import StandardScaler
#sc = StandardScaler()
#X_train_r = sc.fit_transform(X_train_r)
#X_test_r = sc.transform(X_test_r)

#X_train_w = sc.fit_transform(X_train_w)
#X_test_w = sc.transform(X_test_w)

In [31]:
#Train a logistic regression to predict wine quality (binary)

#red wine
from sklearn.linear_model import LogisticRegression
logreg_r = LogisticRegression(C=1e9,solver='liblinear')
logreg_r.fit(X_train_r, y_train_r)
y_pred_r = logreg_r.predict(X_test_r)


#white wine
logreg_w = LogisticRegression(C=1e9,solver='liblinear')
logreg_w.fit(X_train_w, y_train_w)
y_pred_w = logreg_w.predict(X_test_w)

In [32]:
#Analyze the coefficients

print(logreg_r.coef_)
print("\n")
print(logreg_w.coef_)

[[ 1.07228790e-01 -4.65023271e+00 -2.28456937e-02  1.38909930e-01
  -7.69158260e+00  5.18432033e-03 -1.28377509e-02 -6.84868428e+00
   3.09590278e-01  3.10429491e+00  9.56426157e-01]]


[[ 6.56445277e-02 -3.94049063e+00 -8.99937341e-01  6.07584200e-02
  -1.18131449e+01  1.40757620e-02 -2.85416857e-03 -7.76549929e+00
   1.19536881e+00  1.16976609e+00  9.16254534e-01]]


Los coeficientes de una regresión logística no son muy informativos, indican el cambio en el logaritmo de los odds cuando la variable independiente aumenta en una unidad. Por ejemplo, para el caso de los vinos rojos, un aumento de una unidad en la acidez fija implica un incremento de 1,07 en el logaritmo de los odds de la calidad del vino, dejando el resto de las variables fijas. Por esta razón, es que en lugar de utilizar el coeficiente de la regresión logística se utiliza su correspondiente odd ratio.

In [35]:
#Evaluate the f1 score

from sklearn.metrics import f1_score
print("f1 score en la base de vinos rojos :"+" "+str (f1_score(y_test_r,y_pred_r)))
print("f1 score en la base de vinos blanos :"+" "+str (f1_score(y_test_w,y_pred_w)))

f1 score en la base de vinos rojos : 0.39999999999999997
f1 score en la base de vinos blanos : 0.3257918552036199


# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [19]:
#Estimate a regularized logistic regression

#red wines
    #C=0,01, penalty l1 and L2
    
logreg_r_001l1 = LogisticRegression(C=0.01, penalty='l1',solver='liblinear')
logreg_r_001l1.fit(X_train_r, y_train_r)
y_pred_r_001l1 = logreg_r_001l1.predict(X_test_r)

logreg_r_001l2 = LogisticRegression(C=0.01, penalty='l2',solver='liblinear')
logreg_r_001l2.fit(X_train_r, y_train_r)
y_pred_r_001l2 = logreg_r_001l2.predict(X_test_r)


    #C=0,1, penalty l1 and L2
    
logreg_r_01l1 = LogisticRegression(C=0.1, penalty='l1',solver='liblinear')
logreg_r_01l1.fit(X_train_r, y_train_r)
y_pred_r_01l1 = logreg_r_01l1.predict(X_test_r)

logreg_r_01l2 = LogisticRegression(C=0.1, penalty='l2',solver='liblinear')
logreg_r_01l2.fit(X_train_r, y_train_r)
y_pred_r_01l2 = logreg_r_01l2.predict(X_test_r)


    #C=1, penalty l1 and L2
    
logreg_r_1l1 = LogisticRegression(C=1, penalty='l1',solver='liblinear')
logreg_r_1l1.fit(X_train_r, y_train_r)
y_pred_r_1l1 = logreg_r_1l1.predict(X_test_r)


logreg_r_1l2 = LogisticRegression(C=1, penalty='l2',solver='liblinear')
logreg_r_1l2.fit(X_train_r, y_train_r)
y_pred_r_1l2 = logreg_r_1l2.predict(X_test_r)

    

#white wines
    #C=0,01, penalty l1 and L2
    
logreg_w_001l1 = LogisticRegression(C=0.01, penalty='l1',solver='liblinear')
logreg_w_001l1.fit(X_train_w, y_train_w)
y_pred_w_001l1 = logreg_w_001l1.predict(X_test_w)


logreg_w_001l2 = LogisticRegression(C=0.01, penalty='l2',solver='liblinear')
logreg_w_001l2.fit(X_train_w, y_train_w)
y_pred_w_001l2 = logreg_w_001l2.predict(X_test_w)


    #C=0,1, penalty l1 and L2
    
logreg_w_01l1 = LogisticRegression(C=0.1, penalty='l1',solver='liblinear')
logreg_w_01l1.fit(X_train_w, y_train_w)
y_pred_w_01l1 = logreg_w_01l1.predict(X_test_w)


logreg_w_01l2 = LogisticRegression(C=0.1, penalty='l2',solver='liblinear')
logreg_w_01l2.fit(X_train_w, y_train_w)
y_pred_w_01l2 = logreg_w_01l2.predict(X_test_w)


    #C=1, penalty l1 and L2
    
logreg_w_1l1 = LogisticRegression(C=1, penalty='l1',solver='liblinear')
logreg_w_1l1.fit(X_train_w, y_train_w)
y_pred_w_1l1 = logreg_w_1l1.predict(X_test_w)


logreg_w_1l2 = LogisticRegression(C=1, penalty='l2',solver='liblinear')
logreg_w_1l2.fit(X_train_w, y_train_w)
y_pred_w_1l2 = logreg_w_1l2.predict(X_test_w)

In [20]:
#Compare the coefficients

print(logreg_r_001l1.coef_)
print("\n")
print(logreg_r_001l2.coef_)
print("\n")
print(logreg_r_01l1.coef_)
print("\n")
print(logreg_r_01l2.coef_)
print("\n")
print(logreg_r_1l1.coef_)
print("\n")
print(logreg_r_1l2.coef_)
print("\n")
print(logreg_w_001l1.coef_)
print("\n")
print(logreg_w_001l2.coef_)
print("\n")
print(logreg_w_01l1.coef_)
print("\n")
print(logreg_w_01l2.coef_)
print("\n")
print(logreg_w_1l1.coef_)
print("\n")
print(logreg_w_1l2.coef_)
print("\n")


[[ 0.         -0.05312673  0.          0.          0.          0.
   0.          0.          0.          0.          0.30036999]]


[[ 0.10581735 -0.24410939  0.13874937  0.08748767 -0.11793492 -0.03857878
  -0.11303616 -0.16796152  0.00272116  0.20888344  0.38480811]]


[[ 0.10454161 -0.69145307  0.          0.04783128 -0.12941093  0.
  -0.20035459  0.          0.          0.36373742  0.88213212]]


[[ 0.33516164 -0.52464543  0.13527128  0.23952232 -0.25557932 -0.01573835
  -0.27817828 -0.36590727  0.10820021  0.46105     0.64638141]]


[[ 0.54489943 -0.73078731  0.04599536  0.33931111 -0.31130653  0.
  -0.37206908 -0.49938641  0.20192802  0.59483627  0.76365044]]


[[ 0.5908664  -0.69332112  0.08077686  0.3629774  -0.32481745  0.01631072
  -0.38890623 -0.56147098  0.24227814  0.60677438  0.72006846]]


[[ 0.         -0.09706075  0.          0.          0.          0.
   0.          0.          0.          0.          0.73745331]]


[[ 0.03536665 -0.22689488 -0.02945548  0.24823684 -0

De forma general, se observa que a medida que la penalización es mayor (valores más pequeños de C), más pequeños son los coeficientes,comparados con aquellos de la regresión logística del ejercicio 6,9. Además, cuando la penalización es tipo l1 (lasso) los coeficientes de la mayoría de las variables son 0, cuando la penalización es muy alta.

In [27]:
# Compare the f1 score

from sklearn.metrics import f1_score
print("f1 score c=0,01 penalty l1 en la base de vinos rojos :"+" "+str(f1_score(y_test_r,y_pred_r_001l1)))
print("f1 score c=0,01 penalty l2 en la base de vinos rojos :"+" "+str(f1_score(y_test_r,y_pred_r_001l2)))
print("\n")
print("f1 score c=0,1 penalty l1 en la base de vinos rojos :"+" "+str(f1_score(y_test_r,y_pred_r_01l1)))
print("f1 score c=0,1 penalty l2 en la base de vinos rojos :"+" "+str(f1_score(y_test_r,y_pred_r_01l2)))
print("\n")
print("f1 score c=1 penalty l1 en la base de vinos rojos :"+" "+str(f1_score(y_test_r,y_pred_r_1l1)))
print("f1 score c=1 penalty l2 en la base de vinos rojos :"+" "+str(f1_score(y_test_r,y_pred_r_1l2)))
print("-------------------------------------------------------------------------------")
print("\n")


print("f1 score c=0,01 penalty l1 en la base de vinos blancos :"+" "+str(f1_score(y_test_w,y_pred_w_001l1)))
print("f1 score c=0,01 penalty l2 en la base de vinos blancos :"+" "+str(f1_score(y_test_w,y_pred_w_001l2)))
print("\n")
print("f1 score c=0,1 penalty l1 en la base de vinos blancos :"+" "+str(f1_score(y_test_w,y_pred_w_01l1)))
print("f1 score c=0,1 penalty l2 en la base de vinos blancos :"+" "+str(f1_score(y_test_w,y_pred_w_01l2)))
print("\n")
print("f1 score c=1 penalty l1 en la base de vinos blancos :"+" "+str(f1_score(y_test_w,y_pred_w_1l1)))
print("f1 score c=1 penalty l2 en la base de vinos blancos :"+" "+str(f1_score(y_test_w,y_pred_w_1l2)))


f1 score c=0,01 penalty l1 en la base de vinos rojos : 0.0
f1 score c=0,01 penalty l2 en la base de vinos rojos : 0.3950617283950617


f1 score c=0,1 penalty l1 en la base de vinos rojos : 0.40963855421686746
f1 score c=0,1 penalty l2 en la base de vinos rojos : 0.43902439024390244


f1 score c=1 penalty l1 en la base de vinos rojos : 0.4367816091954023
f1 score c=1 penalty l2 en la base de vinos rojos : 0.4367816091954023
-------------------------------------------------------------------------------


f1 score c=0,01 penalty l1 en la base de vinos blancos : 0.20460358056265981
f1 score c=0,01 penalty l2 en la base de vinos blancos : 0.2997658079625293


f1 score c=0,1 penalty l1 en la base de vinos blancos : 0.3257918552036199
f1 score c=0,1 penalty l2 en la base de vinos blancos : 0.32286995515695066


f1 score c=1 penalty l1 en la base de vinos blancos : 0.33185840707964603
f1 score c=1 penalty l2 en la base de vinos blancos : 0.3274336283185841


  'precision', 'predicted', average, warn_for)


Un f1 score para la base de vinos rojos es mayor a aquel de la base de vinos blancos, para los tres niveles de penalidad (C). En la base de vinos rojos, para bajos valores de penalización, l2 reporta mejores niveles de precisión (más cercanos a 1) que aquellos si se implementa una penalización tipo l1. Sin embargo, a medida que la penalización aumenta los niveles de precisión tienden a ser similares para ambos tipos de penalización. Lo mismo ocurre con la base de vinos blancos. Por otro lado, por default de la función, los resultados del ejercicio 6.9 muestran los resultados del f1 score con un C=1 y una penalidad tipo l2. Así, estos resultados deben ser comparados con su similar de este ejercicio (ejercicio 6.10). Por lo tanto, para la base de vinos rojos regularizada el f1 score con c=1 y l2 de penalidad es de 0,43, mayor a 0,39, que es el f1 score del ejercicio 6.9. Por su parte, para la base de vinos blancos el f1 score es en ambos casos (ejercicio 6.9 y 6.10) igual a 0.32.


In [None]:
##############