# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [3]:
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
5370,12.5,0.37,0.55,2.6,0.083,25.0,68.0,0.9995,3.15,0.82,10.4,6,red
1166,6.1,0.45,0.27,0.8,0.039,13.0,82.0,0.9927,3.23,0.32,9.5,5,white
1786,7.4,0.33,0.26,2.6,0.04,29.0,115.0,0.9913,3.07,0.52,11.8,7,white
1756,6.7,0.27,0.25,8.0,0.053,54.0,202.0,0.9961,3.22,0.43,9.3,5,white
4508,5.8,0.26,0.3,2.6,0.034,75.0,129.0,0.9902,3.2,0.38,11.5,4,white


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [4]:
data.groupby(['quality', 'type']).size().unstack(fill_value=0)

type,red,white
quality,Unnamed: 1_level_1,Unnamed: 2_level_1
3,10,20
4,53,163
5,681,1457
6,638,2198
7,199,880
8,18,175
9,0,5


# SVM

# Exercise 6.2

* Standarized the features (not the quality)
* Create a binary target for each type of wine
* Create two Linear SVM's for the white and red wines, repectively.


In [5]:
#Standarized the features (not the quality)
from sklearn import preprocessing

data_rnorm = pd.DataFrame(index=data_r.index)
data_wnorm = pd.DataFrame(index=data_w.index)

for r in data_r.loc[:, data_r.columns != 'quality'].columns:
    data_rnorm[r]=preprocessing.scale(data_r[r])

for w in data_w.loc[:, data_w.columns != 'quality'].columns:
    data_wnorm[w]=preprocessing.scale(data_w[w])
    
data_rnorm['quality']=data_r['quality']
data_wnorm['quality']=data_w['quality']

In [6]:
#Create a binary target for each type of wine
data_rnorm['good']=data_rnorm['quality']
data_wnorm['good']=data_wnorm['quality']

data_rnorm.loc[data_rnorm['good'] <= 6, 'good'] = 0
data_rnorm.loc[data_rnorm['good'] >=7, 'good'] = 1
data_rnorm['good']=data_rnorm['good'].astype(bool)

data_wnorm.loc[data_wnorm['good'] <= 6, 'good'] = 0
data_wnorm.loc[data_wnorm['good'] >=7, 'good'] = 1
data_wnorm['good']=data_wnorm['good'].astype(bool)

In [7]:
#Create two Linear SVM's for the white and red wines, repectively.
from sklearn.svm import SVC
#para los vinos rojos
X_r=data_rnorm.loc[:, ~data_r.columns.isin(['quality', 'good'])]
y_r=data_rnorm['good']
clf_red = SVC(kernel='linear')
clf_red.fit(X_r, y_r)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [8]:
#para los vinos blancos
X_w=data_wnorm.loc[:, ~data_w.columns.isin(['quality', 'good'])]
y_w=data_wnorm['good']
clf_white = SVC(kernel='linear')
clf_white.fit(X_w, y_w)


SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='linear',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


In [9]:
#poly
X_w=data_wnorm.loc[:, ~data_w.columns.isin(['quality', 'good'])]
y_w=data_wnorm['good']
clf_whitep = SVC(kernel='poly')
clf_whitep.fit(X_w, y_w)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [10]:
#rbf
X_w=data_wnorm.loc[:, ~data_w.columns.isin(['quality', 'good'])]
y_w=data_wnorm['good']
clf_whiterbf = SVC(kernel='rbf')
clf_whiterbf.fit(X_w, y_w)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [11]:
#sigmoid
X_w=data_wnorm.loc[:, ~data_w.columns.isin(['quality', 'good'])]
y_w=data_wnorm['good']
clf_whitesig = SVC(kernel='sigmoid')
clf_whitesig.fit(X_w, y_w)

SVC(C=1.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma='auto', kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [12]:
a=[clf_white, clf_whitep, clf_whitesig, clf_whiterbf]
for f in a:
    print(f.score(X_w, y_w))

0.7835851367905268
0.8156390363413638
0.7133523887300939
0.8409554920375664


In [13]:
C_range=pd.DataFrame([0.1, 1, 10, 100, 1000], columns=['c'], index=[0.1, 1, 10, 100, 1000])
gamma=pd.DataFrame([0.01, 0.001, 0.0001], columns=['g'], index=[0.01, 0.001, 0.0001])
ResultsSVM=pd.DataFrame(index=C_range.index, columns=gamma.index)
for c in C_range['c'].iteritems():
    for g in gamma['g'].iteritems():
        wclf_p=SVC(kernel='rbf', gamma=g[1], C=c[1])
        wclf_p.fit(X_w, y_w)
        ResultsSVM[g[0]][c[0]]=wclf_p.score(X_w, y_w)

In [14]:
ResultsSVM

Unnamed: 0,0.01,0.001,0.0001
0.1,0.783585,0.783585,0.783585
1.0,0.800735,0.783585,0.783585
10.0,0.819314,0.783585,0.783585
100.0,0.834422,0.806452,0.783585
1000.0,0.846876,0.818293,0.783585


El mejor modelo es con C=1000 y gamma=0.1

# Exercise 6.5

Compare the results with other methods

In [15]:
Rclf_pr = SVC(kernel='rbf', gamma=0.01, C=1000)
Rclf_pr.fit(X_w, y_w)

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.01, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [16]:
Rclf_pp = SVC(kernel='poly', gamma=0.01, C=1000)
Rclf_pp.fit(X_w, y_w)

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.01, kernel='poly',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [17]:
Rclf_ps = SVC(kernel='sigmoid', gamma=0.01, C=1000)
Rclf_ps.fit(X_w, y_w)

SVC(C=1000, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape='ovr', degree=3, gamma=0.01, kernel='sigmoid',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

In [18]:
print( "primera rbf: ", Rclf_pr.score(X_w, y_w))
print( "Segunda poly: ", Rclf_pp.score(X_w, y_w))
print( "Tercera sigmoid: ", Rclf_pr.score(X_w, y_w))

primera rbf:  0.8468762760310331
Segunda poly:  0.8164556962025317
Tercera sigmoid:  0.8468762760310331


En este caso el pero modelo es el segundo, mientras que el primero y el tercero parecen ser idénticos

# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [19]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white


In [20]:
#se crea una variable dummy para el tipo de vino
data['type01']=data['type']
data['type01']=np.where(data['type01']=='white',0,1)
data['type01']=data['type01'].astype(bool)

In [21]:
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,type01
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,white,False
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,white,False
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,white,False
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,False
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,white,False


In [22]:
X = data.drop(['quality', 'type'], axis=1)
y = data['quality']

In [23]:
# split into training and testing sets
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)

In [24]:
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X_train, y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)

In [25]:
Vars=np.array(list(X))
coeficientes=linreg.coef_
result=pd.DataFrame(coeficientes, index=Vars)
result.rename(columns={list(result)[0]:'Coeficientes'}, inplace=True)
result

Unnamed: 0,Coeficientes
fixed acidity,0.097622
volatile acidity,-1.550473
citric acid,-0.136419
residual sugar,0.066747
chlorides,-0.76794
free sulfur dioxide,0.003998
total sulfur dioxide,-0.001057
density,-113.045446
pH,0.51589
sulphates,0.701082


In [26]:
from sklearn import metrics
y_pred = linreg.predict(X_test)
print(np.sqrt(metrics.mean_squared_error(y_test, y_pred)))

0.7176907067288502


# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [27]:
datanorm=pd.DataFrame(preprocessing.scale(data.loc[:, ~ data.columns.isin(['type', 'type01'])]))
datanorm.columns=np.array(data.columns[0:12])
datanorm['type01']=data['type01']
datanorm.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type01
0,-0.166089,-0.423183,0.284686,3.206929,-0.314975,0.815565,0.959976,2.102214,-1.359049,-0.546178,-1.418558,0.207999,False
1,-0.706073,-0.240949,0.147046,-0.807837,-0.20079,-0.931107,0.287618,-0.232332,0.506915,-0.277351,-0.831615,0.207999,False
2,0.682458,-0.362438,0.559966,0.306208,-0.172244,-0.029599,-0.33166,0.134525,0.25812,-0.613385,-0.328521,0.207999,False
3,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,0.207999,False
4,-0.011808,-0.666161,0.009406,0.642523,0.056126,0.928254,1.243074,0.301278,-0.177272,-0.882212,-0.496219,0.207999,False


In [28]:
from sklearn.linear_model import Ridge
ridgereg_01 = Ridge(alpha=0.1, normalize=True)
ridgereg_01.fit(X_train, y_train)
y_pred = ridgereg_01.predict(X_test)
RMSEalpha_01=np.sqrt(metrics.mean_squared_error(y_test, y_pred))
rid_Alpha_01=pd.DataFrame(ridgereg_01.coef_)

In [29]:
ridgereg_1 = Ridge(alpha=1, normalize=True)
ridgereg_1.fit(X_train, y_train)
y_pred = ridgereg_1.predict(X_test)
RMSEalpha_1=np.sqrt(metrics.mean_squared_error(y_test, y_pred))
rid_Alpha_1=pd.DataFrame(ridgereg_1.coef_)

In [30]:
#Compare the coefficients with the linear regression
result_rigde=pd.DataFrame(index=Vars)
result_rigde['MCO']=result['Coeficientes']
result_rigde['rid_Alpha01']=ridgereg_01.coef_
result_rigde['rid_Alpha_1']=ridgereg_1.coef_
result_rigde

Unnamed: 0,MCO,rid_Alpha01,rid_Alpha_1
fixed acidity,0.097622,0.028857,0.001598
volatile acidity,-1.550473,-1.28227,-0.585135
citric acid,-0.136419,-0.023936,0.148863
residual sugar,0.066747,0.029764,0.005687
chlorides,-0.76794,-1.186639,-1.274221
free sulfur dioxide,0.003998,0.003804,0.001296
total sulfur dioxide,-0.001057,-0.001289,-0.000583
density,-113.045446,-38.237306,-22.285153
pH,0.51589,0.2085,0.082972
sulphates,0.701082,0.59155,0.300782


In [31]:
#Evaluate the RMSE
RMSE=[['Para alpha=1', RMSEalpha_1], ['Para alpha=0.1', RMSEalpha_01]]
print(RMSE)

[['Para alpha=1', 0.7607146212084613], ['Para alpha=0.1', 0.7199798419799723]]


# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [32]:
from sklearn.linear_model import Lasso
#Lasso con alpha 0.01
lassoreg_001 = Ridge(alpha=0.01, normalize=True)
lassoreg_001.fit(X_train, y_train)
y_pred = lassoreg_001.predict(X_test)
RMSE_lasso_alpha_001=np.sqrt(metrics.mean_squared_error(y_test, y_pred))
lass_Alpha_001=pd.DataFrame(lassoreg_001.coef_)

#Lasso con alpha 0.1
lassoreg_01 = Ridge(alpha=0.1, normalize=True)
lassoreg_01.fit(X_train, y_train)
y_pred = lassoreg_01.predict(X_test)
RMSE_lasso_alpha_01=np.sqrt(metrics.mean_squared_error(y_test, y_pred))
lass_Alpha_01=pd.DataFrame(lassoreg_01.coef_)


#Lasso con alpha 1
lassoreg_1 = Ridge(alpha=1, normalize=True)
lassoreg_1.fit(X_train, y_train)
y_pred = lassoreg_1.predict(X_test)
RMSE_lasso_alpha_1=np.sqrt(metrics.mean_squared_error(y_test, y_pred))
lass_Alpha_1=pd.DataFrame(lassoreg_1.coef_)

In [33]:
#Compare the coefficients with the linear regression
result_lasso=pd.DataFrame(index=Vars)
result_lasso['MCO']=result['Coeficientes']
result_lasso['rid_lasso001']=lassoreg_001.coef_
result_lasso['rid_lasso01']=lassoreg_01.coef_
result_lasso['rid_lasso1']=lassoreg_1.coef_
result_lasso

Unnamed: 0,MCO,rid_lasso001,rid_lasso01,rid_lasso1
fixed acidity,0.097622,0.073793,0.028857,0.001598
volatile acidity,-1.550473,-1.521911,-1.28227,-0.585135
citric acid,-0.136419,-0.126037,-0.023936,0.148863
residual sugar,0.066747,0.055037,0.029764,0.005687
chlorides,-0.76794,-0.864181,-1.186639,-1.274221
free sulfur dioxide,0.003998,0.004144,0.003804,0.001296
total sulfur dioxide,-0.001057,-0.001177,-0.001289,-0.000583
density,-113.045446,-85.79224,-38.237306,-22.285153
pH,0.51589,0.40496,0.2085,0.082972
sulphates,0.701082,0.668581,0.59155,0.300782


In [34]:
#Evaluate the RMSE
RMSE_lasso=['Para alpha=0.01', RMSE_lasso_alpha_001, 'Para alpha=0.1', RMSE_lasso_alpha_01, 'Para alpha=1',RMSE_lasso_alpha_1]
print(RMSE_lasso)

['Para alpha=0.01', 0.7171904221904549, 'Para alpha=0.1', 0.7199798419799723, 'Para alpha=1', 0.7607146212084613]


# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [35]:
#Create a binary target
data['good']=data['quality']
data.loc[data['good'] <= 6, 'good'] = 0
data.loc[data['good'] >=7, 'good'] = 1
data['good']=data['good'].astype(bool)

In [36]:
data.sample(7)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,type01,good
1733,8.1,0.3,0.31,1.1,0.041,49.0,123.0,0.9914,2.99,0.45,11.1,6,white,False,False
1950,8.0,0.25,0.13,17.2,0.036,49.0,219.0,0.9996,2.96,0.46,9.7,5,white,False,False
4580,5.7,0.2,0.24,13.8,0.047,44.0,112.0,0.99837,2.97,0.66,8.8,6,white,False,False
5686,10.0,0.56,0.24,2.2,0.079,19.0,58.0,0.9991,3.18,0.56,10.1,6,red,True,False
2158,7.4,0.18,0.27,1.3,0.048,26.0,105.0,0.994,3.52,0.66,10.6,6,white,False,False
4445,5.0,0.35,0.25,7.8,0.031,24.0,116.0,0.99241,3.39,0.4,11.3,6,white,False,False
1766,6.6,0.32,0.26,7.7,0.054,56.0,209.0,0.9961,3.17,0.45,8.8,5,white,False,False


In [37]:
from sklearn.linear_model import LogisticRegression
X = data.drop(['quality', 'type','good' ], axis=1)
y = data['good']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
logreg = LogisticRegression(solver='liblinear',C=1e9)
logreg.fit(X_train, y_train)

LogisticRegression(C=1000000000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)

In [38]:
#Analyze the coefficients
Vars=np.array(list(X))
coeficientes=np.array(logreg.coef_)
print(Vars, coeficientes)

['fixed acidity' 'volatile acidity' 'citric acid' 'residual sugar'
 'chlorides' 'free sulfur dioxide' 'total sulfur dioxide' 'density' 'pH'
 'sulphates' 'alcohol' 'type01'] [[ 1.76607597e-01 -3.96894632e+00 -5.32474860e-01  6.57558075e-02
  -1.38321858e+01  1.13575199e-02 -4.82096919e-03 -8.15955223e+00
   1.10801524e+00  1.70475566e+00  8.64064047e-01  1.17338207e-01]]


In [39]:
from sklearn.metrics import f1_score
y_pred = logreg.predict(X_test)
f1_score(y_test, y_pred, average='macro')

0.6117011816819322

# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [40]:
Xlogn=data.drop(['type', 'good'], axis=1)
Xlogn.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type01
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6,False
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6,False
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6,False
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,False
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6,False


In [41]:
C_r = pd.DataFrame([0.01, 0.1, 1.0], columns=['cr'],index=['0.01', '0.1', '1.0'])
penalty = pd.DataFrame(['l1', 'l2'],columns=['p'],index=['l1', 'l2'])

In [42]:
X_tr0, X_te0, Y_tr0, Y_te0 = train_test_split(Xlogn, y, random_state=2)

In [43]:
resu = pd.DataFrame(columns=['l1', 'l2'],index=['0.01', '0.1', '1.0'])

In [44]:
for C in C_r['cr'].iteritems():
    for P in penalty['p'].iteritems():
        for i in range(12):
            logic0 = LogisticRegression(solver='liblinear', C=C[1],penalty=P[1])
            logic0.fit(X_tr0, Y_tr0)
            Y_pr0 = logic0.predict(X_te0)
            resu[P[0]][C[0]]=f1_score(Y_te0, Y_pr0, average='macro')
resu

Unnamed: 0,l1,l2
0.01,0.958473,0.895309
0.1,0.993084,0.988997
1.0,1.0,0.999012


In [45]:
logic1 = LogisticRegression(solver='liblinear', C=0.01,penalty='l1')
logic1.fit(X_tr0, Y_tr0)

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [46]:
logic2 = LogisticRegression(solver='liblinear', C=0.1,penalty='l1')
logic2.fit(X_tr0, Y_tr0)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [47]:
logic3 = LogisticRegression(solver='liblinear', C=1,penalty='l1')
logic3.fit(X_tr0, Y_tr0)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l1', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [48]:
logic4 = LogisticRegression(solver='liblinear', C=0.01,penalty='l2')
logic4.fit(X_tr0, Y_tr0)

LogisticRegression(C=0.01, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [49]:
logic5 = LogisticRegression(solver='liblinear', C=0.1,penalty='l2')
logic5.fit(X_tr0, Y_tr0)

LogisticRegression(C=0.1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [50]:
logic6 = LogisticRegression(solver='liblinear', C=1,penalty='l2')
logic6.fit(X_tr0, Y_tr0)

LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

In [51]:
Vars=np.array(list(X_tr0))
coeficientes=logic1.coef_
result=pd.DataFrame(coeficientes, columns=Vars, index = [1])

In [52]:
coef2 = pd.DataFrame(logic2.coef_,columns=Vars, index=[2])
result = pd.concat([result, coef2], axis=0)

In [53]:
coef3 = pd.DataFrame(logic3.coef_,columns=Vars, index=[3])
result = pd.concat([result, coef3], axis=0)

In [54]:
coef4 = pd.DataFrame(logic4.coef_,columns=Vars, index=[4])
result = pd.concat([result, coef4], axis=0)

In [55]:
coef5 = pd.DataFrame(logic5.coef_,columns=Vars, index=[5])
result = pd.concat([result, coef5], axis=0)

In [56]:
coef6 = pd.DataFrame(logic6.coef_,columns=Vars, index=[6])
result = pd.concat([result, coef6], axis=0)
result

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type01
1,-0.564058,0.0,0.0,-0.045127,0.0,0.0,-0.011943,0.0,-3.058861,0.0,-0.306674,2.796014,0.0
2,-1.104443,0.0,0.0,-0.146058,0.0,-0.004005,-0.018678,0.0,-10.911585,0.0,-0.745802,8.287156,0.220647
3,-0.683961,0.0,0.0,-0.090865,0.0,-0.003606,-0.013148,-21.350302,-6.888487,0.0,-0.507251,11.675189,0.0
4,-0.538827,-0.24045,0.08424,-0.046813,-0.060602,0.00138,-0.012407,-0.39008,-1.285366,-0.096194,-0.34171,2.078795,-0.242313
5,-0.75615,-0.488271,0.177243,-0.085189,-0.271128,-0.004071,-0.013109,-1.803385,-4.511828,-0.19546,-0.541632,4.786591,0.185852
6,-1.04592,-0.845954,0.009289,-0.129822,-1.010003,-0.006426,-0.012387,-5.564253,-8.883806,-0.432328,-0.70912,8.771712,1.298124


Se aprecia que al modificar los parámetros el nivel de penalidad y el C, cambia las variables que el modelo deja tomar como relevantes. Entre más se acercan a L2 y C =1 se empieza a categorizar muchas más variables como variables relevantes, hasta que alcanza la configuración de una regresión logística convencional