# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [3]:
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.sample()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
4270,6.6,0.21,0.3,9.9,0.041,64.0,174.0,0.995,3.07,0.5,10.1,6,white


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [4]:
pd.crosstab(index=data["type"],columns=data["quality"])

quality,3,4,5,6,7,8,9
type,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
red,10,53,681,638,199,18,0
white,20,163,1457,2198,880,175,5


# SVM

# Exercise 6.2

* Standarized the features (not the quality)
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.


In [5]:
from sklearn import preprocessing

In [6]:
print(data['alcohol'].shape)

(6497,)


In [7]:
data.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'type'],
      dtype='object')

In [8]:
data_norm = pd.DataFrame(index=data.index)

In [9]:
for r in data.loc[:, ~data.columns.isin(['quality', 'type'])].columns:
    data_norm[r]=preprocessing.scale(data[r])

In [10]:
data_norm[['quality','type']]=data[['quality','type']]

In [11]:
data_norm.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,-0.166089,-0.423183,0.284686,3.206929,-0.314975,0.815565,0.959976,2.102214,-1.359049,-0.546178,-1.418558,6,white
1,-0.706073,-0.240949,0.147046,-0.807837,-0.20079,-0.931107,0.287618,-0.232332,0.506915,-0.277351,-0.831615,6,white
2,0.682458,-0.362438,0.559966,0.306208,-0.172244,-0.029599,-0.33166,0.134525,0.25812,-0.613385,-0.328521,6,white
3,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,6,white
4,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,6,white


In [12]:
data_norm['responce'] = np.where(data_norm['quality']>=7,1,0)
data_norm['responce'] = data_norm['responce'].astype(bool)

In [13]:
data_norm.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,responce
0,-0.166089,-0.423183,0.284686,3.206929,-0.314975,0.815565,0.959976,2.102214,-1.359049,-0.546178,-1.418558,6,white,False
1,-0.706073,-0.240949,0.147046,-0.807837,-0.20079,-0.931107,0.287618,-0.232332,0.506915,-0.277351,-0.831615,6,white,False
2,0.682458,-0.362438,0.559966,0.306208,-0.172244,-0.029599,-0.33166,0.134525,0.25812,-0.613385,-0.328521,6,white,False
3,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,6,white,False
4,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,6,white,False


In [14]:
White = data_norm[data_norm.type == 'white']
White.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,responce
0,-0.166089,-0.423183,0.284686,3.206929,-0.314975,0.815565,0.959976,2.102214,-1.359049,-0.546178,-1.418558,6,white,False
1,-0.706073,-0.240949,0.147046,-0.807837,-0.20079,-0.931107,0.287618,-0.232332,0.506915,-0.277351,-0.831615,6,white,False
2,0.682458,-0.362438,0.559966,0.306208,-0.172244,-0.029599,-0.33166,0.134525,0.25812,-0.613385,-0.328521,6,white,False
3,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,6,white,False
4,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,6,white,False


In [15]:
Red = data_norm[data_norm.type == 'red']
Red.head(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,responce
4898,0.142473,2.188833,-2.192833,-0.744778,0.569958,-1.10014,-1.446359,1.034993,1.81309,0.193097,-0.915464,5,red,False
4899,0.451036,3.282235,-2.192833,-0.59764,1.197975,-0.31132,-0.862469,0.701486,-0.115073,0.999579,-0.580068,5,red,False
4900,0.451036,2.5533,-1.917553,-0.660699,1.026697,-0.874763,-1.092486,0.768188,0.25812,0.797958,-0.580068,5,red,False
4901,3.073817,-0.362438,1.661085,-0.744778,0.541412,-0.762074,-0.986324,1.101694,-0.363868,0.32751,-0.580068,6,red,False
4902,0.142473,2.188833,-2.192833,-0.744778,0.569958,-1.10014,-1.446359,1.034993,1.81309,0.193097,-0.915464,5,red,False


In [16]:
from sklearn.svm import SVC # "Support Vector Classifier"

In [17]:
Yw = White['responce']
Xw = White.loc[:, ~data.columns.isin(['quality', 'type'])]

In [18]:
Wclf = SVC(kernel='linear')
Wclf.fit(Xw, Yw)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [19]:
Wclf.coef_

array([[ 2.47636041e-04, -1.00694862e-04, -5.31948264e-05,
         4.85026914e-04, -1.08896571e-04, -1.08800847e-05,
        -3.04724786e-05, -6.82528241e-04,  1.69735041e-04,
         1.24110509e-04, -3.95746011e-05]])

In [20]:
print(Wclf.score(Xw,Yw))

0.7835851367905268


In [21]:
Yr = Red['responce']
Xr = Red.loc[:, ~data.columns.isin(['quality', 'type'])]

In [22]:
Rclf = SVC(kernel='linear')
Rclf.fit(Xr, Yr)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [23]:
Rclf.coef_

array([[-8.47544703e-06, -1.19424604e-05,  5.64523332e-05,
         1.63944647e-04, -2.99573176e-05,  1.45597943e-04,
        -3.37909625e-04, -1.15810347e-04,  5.48648460e-06,
         1.29456129e-04,  1.01566526e-04]])

In [24]:
print(Rclf.score(Xr,Yr))

0.8642901813633521


# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


## Poly por tipo

In [25]:
Wclf_p = SVC(kernel='poly', gamma='auto')
Wclf_p.fit(Xw, Yw)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [26]:
print(Wclf_p.score(Xw,Yw))

0.8144140465496121


In [27]:
Rclf_p = SVC(kernel='poly', gamma='auto')
Rclf_p.fit(Xr, Yr)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [28]:
print(Rclf_p.score(Xr,Yr))

0.8961851156973109


## rbf por tipo

In [29]:
Wclf_r = SVC(kernel='rbf', gamma='auto')
Wclf_r.fit(Xw, Yw)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [30]:
print(Wclf_r.score(Xw,Yw))

0.8301347488770927


In [31]:
Rclf_r = SVC(kernel='rbf', gamma='auto')
Rclf_r.fit(Xr, Yr)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [32]:
print(Rclf_r.score(Xr,Yr))

0.8961851156973109


## sigmoid por tipo

In [33]:
Wclf_s = SVC(kernel='sigmoid', gamma='auto')
Wclf_s.fit(Xw, Yw)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [34]:
print(Wclf_s.score(Xw,Yw))

0.7182523478971009


In [35]:
Rclf_s = SVC(kernel='sigmoid', gamma='auto')
Rclf_s.fit(Xr, Yr)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [36]:
print(Rclf_s.score(Xr,Yr))

0.7986241400875547


# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [37]:
C_range = pd.DataFrame([0.1, 1, 10, 100, 1000], columns=['c'],index=['0.1', '1', '10', '100', '1000'])
gamma_range = pd.DataFrame([0.01, 0.001, 0.0001],columns=['g'],index=['0.01', '0.001', '0.0001'])

In [38]:
results = pd.DataFrame(index=C_range.index, columns=gamma_range.index)

In [39]:
results = pd.DataFrame(index=C_range.index, columns=gamma_range.index)
for C in C_range['c'].iteritems():
       for G in gamma_range['g'].iteritems():
            Wclf_p = SVC(kernel='rbf', gamma=G[1], C=C[1])
            Wclf_p.fit(Xw, Yw)
            results[G[0]][C[0]]=Wclf_p.score(Xw,Yw)
results

Unnamed: 0,0.01,0.001,0.0001
0.1,0.783585,0.783585,0.783585
1.0,0.793181,0.783585,0.783585
10.0,0.818293,0.783585,0.783585
100.0,0.82646,0.801552,0.783585
1000.0,0.835647,0.817272,0.783585


In [40]:
results2 = pd.DataFrame(index=C_range.index, columns=gamma_range.index)

In [41]:
for C in C_range['c'].iteritems():
    for G in gamma_range['g'].iteritems():
        Rclf_p = SVC(kernel='rbf', gamma=G[1], C=C[1])
        Rclf_p.fit(Xr, Yr)
        results2[G[0]][C[0]]=Rclf_p.score(Xr,Yr)
results2

Unnamed: 0,0.01,0.001,0.0001
0.1,0.86429,0.86429,0.86429
1.0,0.86429,0.86429,0.86429
10.0,0.88868,0.86429,0.86429
100.0,0.894934,0.86429,0.86429
1000.0,0.917448,0.888055,0.86429


# Exercise 6.5

Compare the results with other methods

Para comparara los resultados con  los demás métodos, se procede a tomar la combinación de mejor performance y se somete a los diferentes tipos de regresión presentando lo siguiente:

In [105]:
Rclf_pr = SVC(kernel='rbf', gamma=0.01, C=1000)
Rclf_pr.fit(Xr, Yr)

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [106]:
Rclf_pp = SVC(kernel='poly', gamma=0.01, C=1000)
Rclf_pp.fit(Xr, Yr)

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.01, kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [107]:
Rclf_ps = SVC(kernel='sigmoid', gamma=0.01, C=1000)
Rclf_ps.fit(Xr, Yr)

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.01, kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [108]:
print( "primera rbf: ", Rclf_pr.score(Xr,Yr))
print( "Segunda poly: ", Rclf_pp.score(Xr,Yr))
print( "Tercera sigmoid: ", Rclf_pr.score(Xr,Yr))

primera rbf:  0.9174484052532833
Segunda poly:  0.8974358974358975
Tercera sigmoid:  0.9174484052532833


Esto lo que nos muestra es que a pesar de tener los parámetros óptimos dentro de la configuración del dataset, estos al ser comparados frente a otros modelos pueden variar significativamente por lo cual se determina que si se quiere conseguir un resultado óptimo se debe buscar siempre los mejores parámetros que se adecuen al modelo a analizar 

# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [42]:
datareg=data

In [43]:
datareg['type']= np.where(datareg['type']=='white',1,0)
datareg['type'] = datareg['type'].astype(bool)

In [44]:
datareg.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,True
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,True
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,True
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,True
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,True
5,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,True
6,6.2,0.32,0.16,7.0,0.045,30.0,136.0,0.9949,3.18,0.47,9.6,6,True
7,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,True
8,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,True
9,8.1,0.22,0.43,1.5,0.044,28.0,129.0,0.9938,3.22,0.45,11.0,6,True


In [45]:
Xreg=datareg.drop('quality', axis=1)
Yreg=datareg['quality']

In [46]:
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(Xreg, Yreg, random_state=1)

In [47]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, Y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [114]:
Vars0=np.array(list(X_train))
coeficientes=linreg.coef_
result0=pd.DataFrame(coeficientes, index=Vars, columns = [1])
result0

Unnamed: 0,1
fixed acidity,0.097622
volatile acidity,-1.550473
citric acid,-0.136419
residual sugar,0.066747
chlorides,-0.76794
free sulfur dioxide,0.003998
total sulfur dioxide,-0.001057
density,-113.045446
pH,0.51589
sulphates,0.701082


In [49]:
Y_pred = linreg.predict(X_test)

In [50]:
from sklearn import metrics
import numpy as np
print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred)))

0.7176907067288502


Se puede determinar que al realizar la regresión la densidad al contener los valores más grandes es la que más relevancia toma dentro del modelo por lo cual así este describa el 72% este solo es descrito por esa variable

# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [51]:
from sklearn.linear_model import Ridge
ridgereg = Ridge(alpha=0.1, normalize=True)
ridgereg.fit(X_train, Y_train)

Ridge(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=None,
   normalize=True, random_state=None, solver='auto', tol=0.001)

In [52]:
ridgereg.coef_

array([ 2.88566714e-02, -1.28227049e+00, -2.39357437e-02,  2.97639425e-02,
       -1.18663870e+00,  3.80364255e-03, -1.28915306e-03, -3.82373057e+01,
        2.08500233e-01,  5.91549553e-01,  2.50777308e-01, -1.56871332e-01])

In [53]:
Y_pred1 = ridgereg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred1)))

0.7199798419799722


In [54]:
ridgereg2 = Ridge(alpha=1, normalize=True)
ridgereg2.fit(X_train, Y_train)

Ridge(alpha=1, copy_X=True, fit_intercept=True, max_iter=None, normalize=True,
   random_state=None, solver='auto', tol=0.001)

In [115]:
result0['2']= ridgereg2.coef_
result0

Unnamed: 0,1,2
fixed acidity,0.097622,0.001598
volatile acidity,-1.550473,-0.585135
citric acid,-0.136419,0.148863
residual sugar,0.066747,0.005687
chlorides,-0.76794,-1.274221
free sulfur dioxide,0.003998,0.001296
total sulfur dioxide,-0.001057,-0.000583
density,-113.045446,-22.285153
pH,0.51589,0.082972
sulphates,0.701082,0.300782


In [56]:
Y_pred2 = ridgereg2.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred2)))

0.7607146212084613


Al realizar la regresión con el método ridge se puede determinar que esta normaliza el coeficiente predominante haciendo que la función evaluada tenga un mejor desempeño frente a la función lineal convencional.

# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [57]:
from sklearn.linear_model import Lasso
lassoreg0 = Lasso(alpha=0.01, normalize=True)
lassoreg0.fit(X_train, Y_train)

Lasso(alpha=0.01, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=True, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [118]:
result0['3']= lassoreg0.coef_
result0

Unnamed: 0,1,2,3
fixed acidity,0.097622,0.001598,-0.0
volatile acidity,-1.550473,-0.585135,-0.0
citric acid,-0.136419,0.148863,0.0
residual sugar,0.066747,0.005687,-0.0
chlorides,-0.76794,-1.274221,-0.0
free sulfur dioxide,0.003998,0.001296,0.0
total sulfur dioxide,-0.001057,-0.000583,-0.0
density,-113.045446,-22.285153,-0.0
pH,0.51589,0.082972,0.0
sulphates,0.701082,0.300782,0.0


In [59]:
Y_pred3 = lassoreg0.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred3)))

0.8709345888926285


In [60]:
from sklearn.linear_model import Lasso
lassoreg1 = Lasso(alpha=0.1, normalize=True)
lassoreg1.fit(X_train, Y_train)

Lasso(alpha=0.1, copy_X=True, fit_intercept=True, max_iter=1000,
   normalize=True, positive=False, precompute=False, random_state=None,
   selection='cyclic', tol=0.0001, warm_start=False)

In [119]:
result0['4']= lassoreg1.coef_
result0

Unnamed: 0,1,2,3,4
fixed acidity,0.097622,0.001598,-0.0,-0.0
volatile acidity,-1.550473,-0.585135,-0.0,-0.0
citric acid,-0.136419,0.148863,0.0,0.0
residual sugar,0.066747,0.005687,-0.0,-0.0
chlorides,-0.76794,-1.274221,-0.0,-0.0
free sulfur dioxide,0.003998,0.001296,0.0,0.0
total sulfur dioxide,-0.001057,-0.000583,-0.0,-0.0
density,-113.045446,-22.285153,-0.0,-0.0
pH,0.51589,0.082972,0.0,0.0
sulphates,0.701082,0.300782,0.0,0.0


In [62]:
Y_pred4 = lassoreg1.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred4)))

0.8709345888926285


In [63]:
from sklearn.linear_model import Lasso
lassoreg2 = Lasso(alpha=1, normalize=True)
lassoreg2.fit(X_train, Y_train)

Lasso(alpha=1, copy_X=True, fit_intercept=True, max_iter=1000, normalize=True,
   positive=False, precompute=False, random_state=None, selection='cyclic',
   tol=0.0001, warm_start=False)

In [120]:
result0['5']= lassoreg2.coef_
result0


Unnamed: 0,1,2,3,4,5
fixed acidity,0.097622,0.001598,-0.0,-0.0,-0.0
volatile acidity,-1.550473,-0.585135,-0.0,-0.0,-0.0
citric acid,-0.136419,0.148863,0.0,0.0,0.0
residual sugar,0.066747,0.005687,-0.0,-0.0,-0.0
chlorides,-0.76794,-1.274221,-0.0,-0.0,-0.0
free sulfur dioxide,0.003998,0.001296,0.0,0.0,0.0
total sulfur dioxide,-0.001057,-0.000583,-0.0,-0.0,-0.0
density,-113.045446,-22.285153,-0.0,-0.0,-0.0
pH,0.51589,0.082972,0.0,0.0,0.0
sulphates,0.701082,0.300782,0.0,0.0,0.0


In [65]:
Y_pred5 = lassoreg2.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(Y_test, Y_pred5)))

0.8709345888926285


Al analizar los coeficientes de esta regresión se determina que dentro de la configuración de la regresión todos los coeficientes tienen la misma importancia dentro del modelo.

# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [66]:
dataregb=datareg

In [67]:
dataregb['quality']= np.where(dataregb['quality']>=7,1,0)
dataregb['quality'] = dataregb['quality'].astype(bool)

In [68]:
Xlog=dataregb.drop('quality', axis=1)
Ylog=dataregb['quality']

In [69]:
from sklearn.model_selection import train_test_split
X_tr, X_te, Y_tr, Y_te = train_test_split(Xlog, Ylog, random_state=2)

In [70]:
from sklearn.linear_model import LogisticRegression
logic = LogisticRegression(solver='liblinear',C=1e9)
logic.fit(X_tr, Y_tr)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [71]:
Y_pr = logic.predict(X_te)

In [72]:
from sklearn.metrics import confusion_matrix

In [73]:
e = confusion_matrix(Y_te,Y_pr)
dataframe=pd.DataFrame(e, columns=['No','Yes'], index=['No','Yes']) 
dataframe

Unnamed: 0,No,Yes
No,1245,66
Yes,235,79


In [74]:
from sklearn.metrics import f1_score

In [75]:
f1_score(Y_te, Y_pr, average='macro')

0.6181899647872207

In [76]:
logic.coef_

array([[ 1.72729017e-01, -4.15602119e+00, -4.91566139e-01,
         7.12402746e-02, -5.71325644e+00,  1.52814643e-02,
        -5.01328776e-03, -8.09461191e+00,  9.69110363e-01,
         2.10612101e+00,  9.49094844e-01,  1.70300811e-01]])

Al realizar la regresión logística se determina que el número de falsos positivos y de falsos negativos resulta ser muy alto, por lo cual esto me puede estar indicando que hay una gran posibilidad de overfitting

Así mismo se logra apreciar que al evaluar el F1 score este nos indica que la predicción no ha sido muy acertada dado que la calificación global esta sobre el 61% 

# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [77]:
Xlogn=Xlog.drop('type', axis=1)

In [78]:
for r in Xlogn.columns:
    Xlogn[r]=preprocessing.scale(data[r])

In [79]:
Xlogn['type']=Xlog['type']

In [80]:
Xlogn.head(10)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,type
0,-0.166089,-0.423183,0.284686,3.206929,-0.314975,0.815565,0.959976,2.102214,-1.359049,-0.546178,-1.418558,True
1,-0.706073,-0.240949,0.147046,-0.807837,-0.20079,-0.931107,0.287618,-0.232332,0.506915,-0.277351,-0.831615,True
2,0.682458,-0.362438,0.559966,0.306208,-0.172244,-0.029599,-0.33166,0.134525,0.25812,-0.613385,-0.328521,True
3,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,True
4,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,True
5,0.682458,-0.362438,0.559966,0.306208,-0.172244,-0.029599,-0.33166,0.134525,0.25812,-0.613385,-0.328521,True
6,-0.783214,-0.11946,-1.091713,0.327228,-0.314975,-0.029599,0.358392,0.067824,-0.239471,-0.411765,-0.747766,True
7,-0.166089,-0.423183,0.284686,3.206929,-0.314975,0.815565,0.959976,2.102214,-1.359049,-0.546178,-1.418558,True
8,-0.706073,-0.240949,0.147046,-0.807837,-0.20079,-0.931107,0.287618,-0.232332,0.506915,-0.277351,-0.831615,True
9,0.682458,-0.726906,0.766426,-0.828857,-0.343521,-0.142287,0.234537,-0.299033,0.009325,-0.546178,0.42612,True


In [81]:
X_tr0, X_te0, Y_tr0, Y_te0 = train_test_split(Xlogn, Ylog, random_state=2)

In [82]:
C_r = pd.DataFrame([0.01, 0.1, 1.0], columns=['cr'],index=['0.01', '0.1', '1.0'])
penalty = pd.DataFrame(['l1', 'l2'],columns=['p'],index=['l1', 'l2'])

In [83]:
resu = pd.DataFrame(columns=['l1', 'l2'],index=['0.01', '0.1', '1.0'])

In [84]:
for C in C_r['cr'].iteritems():
    for P in penalty['p'].iteritems():
        for i in range(12):
            logic0 = LogisticRegression(solver='liblinear', C=C[1],penalty=P[1])
            logic0.fit(X_tr0, Y_tr0)
            Y_pr0 = logic0.predict(X_te0)
            resu[P[0]][C[0]]=f1_score(Y_te0, Y_pr0, average='macro')
resu

Unnamed: 0,l1,l2
0.01,0.587446,0.611537
0.1,0.620496,0.623847
1.0,0.631748,0.634253


In [85]:
logic1 = LogisticRegression(solver='liblinear', C=0.01,penalty='l1')
logic1.fit(X_tr0, Y_tr0)

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [86]:
logic2 = LogisticRegression(solver='liblinear', C=0.1,penalty='l1')
logic2.fit(X_tr0, Y_tr0)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [87]:
logic3 = LogisticRegression(solver='liblinear', C=1,penalty='l1')
logic3.fit(X_tr0, Y_tr0)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [89]:
logic4 = LogisticRegression(solver='liblinear', C=0.01,penalty='l2')
logic4.fit(X_tr0, Y_tr0)

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [90]:
logic5 = LogisticRegression(solver='liblinear', C=0.1,penalty='l2')
logic5.fit(X_tr0, Y_tr0)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [91]:
logic6 = LogisticRegression(solver='liblinear', C=1,penalty='l2')
logic6.fit(X_tr0, Y_tr0)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [92]:
Vars=np.array(list(X_tr0))
coeficientes=logic1.coef_
result=pd.DataFrame(coeficientes, columns=Vars, index = [1])
result

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,type
1,0.0,-0.297371,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052903,0.786971,0.0


In [93]:
coef2 = pd.DataFrame(logic2.coef_,columns=Vars, index=[2])
result = pd.concat([result, coef2], axis=0)

In [94]:
coef3 = pd.DataFrame(logic3.coef_,columns=Vars, index=[3])
result = pd.concat([result, coef3], axis=0)

In [95]:
coef4 = pd.DataFrame(logic4.coef_,columns=Vars, index=[4])
result = pd.concat([result, coef4], axis=0)

In [96]:
coef5 = pd.DataFrame(logic5.coef_,columns=Vars, index=[5])
result = pd.concat([result, coef5], axis=0)

In [97]:
coef6 = pd.DataFrame(logic6.coef_,columns=Vars, index=[6])
result = pd.concat([result, coef6], axis=0)

In [98]:
result

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,type
1,0.0,-0.297371,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.052903,0.786971,0.0
2,0.265388,-0.570453,0.0,0.502392,-0.196496,0.195091,-0.187168,-0.404482,0.162556,0.310096,0.889422,-0.225967
3,0.58433,-0.616696,-0.028748,0.994813,-0.223328,0.223154,-0.196622,-1.148595,0.34322,0.379102,0.614331,-0.799321
4,0.122263,-0.421296,0.04441,0.33324,-0.236277,0.161084,-0.089877,-0.35345,0.077412,0.217325,0.68522,-0.641674
5,0.420029,-0.60101,-0.018776,0.751439,-0.261502,0.221224,-0.184218,-0.790978,0.248221,0.336267,0.727386,-0.692034
6,0.589207,-0.620818,-0.032354,1.000439,-0.231897,0.227095,-0.199209,-1.155098,0.345558,0.380022,0.612143,-0.818632


In [121]:
resu

Unnamed: 0,l1,l2
0.01,0.587446,0.611537
0.1,0.620496,0.623847
1.0,0.631748,0.634253


Se logra determinar que al modificar los parámetros el nivel de penalidad y el C determinan el número de variables que el modelo deja tomar como relevantes entre más se acercan a L2 y C =1 se empieza a categorizar muchas más variables como variables relevantes hasta que alcanza la configuración de una regresión logística convencional 