# Exercise 6

## SVM & Regularization


For this homework we consider a set of observations on a number of red and white wine varieties involving their chemical properties and ranking by tasters. Wine industry shows a recent growth spurt as social drinking is on the rise. The price of wine depends on a rather abstract concept of wine appreciation by wine tasters, opinion among whom may have a high degree of variability. Pricing of wine depends on such a volatile factor to some extent. Another key factor in wine certification and quality assessment is physicochemical tests which are laboratory-based and takes into account factors like acidity, pH level, presence of sugar and other chemical properties. For the wine market, it would be of interest if human quality of tasting can be related to the chemical properties of wine so that certification and quality assessment and assurance process is more controlled.

Two datasets are available of which one dataset is on red wine and have 1599 different varieties and the other is on white wine and have 4898 varieties. All wines are produced in a particular area of Portugal. Data are collected on 12 different properties of the wines one of which is Quality, based on sensory data, and the rest are on chemical properties of the wines including density, acidity, alcohol content etc. All chemical properties of wines are continuous variables. Quality is an ordinal variable with possible ranking from 1 (worst) to 10 (best). Each variety of wine is tasted by three independent tasters and the final rank assigned is the median rank given by the tasters.

A predictive model developed on this data is expected to provide guidance to vineyards regarding quality and price expected on their produce without heavy reliance on volatility of wine tasters.

In [239]:
import pandas as pd
import numpy as np

In [240]:
data_r = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_red.csv')
data_w = pd.read_csv('https://github.com/albahnsen/PracticalMachineLearningClass/raw/master/datasets/Wine_data_white.csv')

In [241]:
data = data_w.assign(type = 'white')

data = data.append(data_r.assign(type = 'red'), ignore_index=True)
data.sample(5)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type
621,6.5,0.26,0.43,8.9,0.083,50.0,171.0,0.9965,2.85,0.5,9.0,5,white
1486,6.2,0.15,0.49,0.9,0.033,17.0,51.0,0.9932,3.3,0.7,9.4,6,white
5825,8.4,0.67,0.19,2.2,0.093,11.0,75.0,0.99736,3.2,0.59,9.2,4,red
73,8.6,0.23,0.46,1.0,0.054,9.0,72.0,0.9941,2.95,0.49,9.1,6,white
5258,8.2,0.7,0.23,2.0,0.099,14.0,81.0,0.9973,3.19,0.7,9.4,5,red


# Exercise 6.1

Show the frecuency table of the quality by type of wine

In [242]:
func = lambda x:x.count()
table=pd.pivot_table(data,  index=['quality'], columns='type', values='density', aggfunc=func)
table

type,red,white
quality,Unnamed: 1_level_1,Unnamed: 2_level_1
3,10.0,20.0
4,53.0,163.0
5,681.0,1457.0
6,638.0,2198.0
7,199.0,880.0
8,18.0,175.0
9,,5.0


<font color=blue>*Se evidencia que las mayores frecuencias tanto del vino rojo como blanco se  encuentran entre tipo de la calidad número 5 y la calidad número 6.*</font>

# SVM

# Exercise 6.2

* Create a quality binary target quality
* Standarized the features (not the quality)
* Create two Linear SVM's for the white and red wines, repectively.


In [243]:
data['Calidad2']=data['quality']>=6
datawhite=data[data['type']=='white']
datared=data[data['type']=='red']

<font color=blue>*Se crea un campo de tipo booleano donde dada la escala de calidad le defino si el vino es bueno o malo que para mi criterio se considera bueno independiente del vino si la calidad es mayor o igual a 6 el cual lo denomino Calidad2,  luego se crearon dos datasets uno para cada tipo de vino y con sus nombres respectivamente.*</font>

In [244]:
## Dataset con información de los vinos rojos
datared.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,type,Calidad2
4898,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,False
4899,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red,False
4900,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red,False
4901,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red,True
4902,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red,False


<font color=blue>Se van a crear 2 modelos uno para cada tipo de vino, pero antes debemos estandarizar los datasets con la variables explicativas, excepto los campos 'quality'	'type'	</font>

In [245]:
## Revisamos los tipos de datos que tiene el dataframe
datawhite.dtypes

fixed acidity           float64
volatile acidity        float64
citric acid             float64
residual sugar          float64
chlorides               float64
free sulfur dioxide     float64
total sulfur dioxide    float64
density                 float64
pH                      float64
sulphates               float64
alcohol                 float64
quality                   int64
type                     object
Calidad2                   bool
dtype: object

In [246]:
##Creamos una estructura con las que serán nuestra variables explicativas del dataset del vino blanco como del vino rojo
Xw=datawhite[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide',
              'total sulfur dioxide','density','pH','sulphates','alcohol']].values
Xr=datared[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide',
              'total sulfur dioxide','density','pH','sulphates','alcohol']].values

## Creamos nuestras variables de respuesta
Yw=datawhite['Calidad2'].values
Yr=datared['Calidad2'].values

In [271]:
## Dividir la base en entrenamiento y test para cada uno de los vinos
from sklearn.model_selection import train_test_split  
Xw_train, Xw_test, Yw_train, Yw_test = train_test_split(Xw, Yw, test_size=0.3, random_state=0) 
Xr_train, Xr_test, Yr_train, Yr_test = train_test_split(Xr, Yr, test_size=0.3, random_state=0)  

## Estandarizamos los valores de las variables explicativas
from sklearn.preprocessing import StandardScaler
scaler  =  StandardScaler ()

Xw_train_s = scaler.fit_transform(Xw_train)
Xw_test_s = scaler.transform (Xw_test)

Xr_train_s = scaler.fit_transform(Xr_train)
Xr_test_s = scaler.transform (Xr_test)

In [248]:
##Crear modelo SVC lineal para el vino blanco 
from sklearn.svm import SVC 

clf_lin = SVC(kernel='linear')

clf_lin.fit(Xw_train_s, Yw_train)
y_pred_w = clf_lin.predict(Xw_test_s)

##Crear modelo SVC lineal para el vino rojo
clf_lin.fit(Xr_train_s, Yr_train)
y_pred_r = clf_lin.predict(Xr_test_s)

# Exercise 6.3

Test the two SVM's using the different kernels (‘poly’, ‘rbf’, ‘sigmoid’)


In [249]:
## Cambiar el kernel a poly y entrenar  dos datases de vino
clf_poly = SVC(kernel='poly')

clf_poly.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_p=clf_poly.predict(Xr_test_s)

clf_poly.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_p=clf_poly.predict(Xw_test_s)

In [250]:
## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf')

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

In [251]:
## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_sig = SVC(kernel='sigmoid')

clf_sig.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_sig=clf_sig.predict(Xr_test_s)

clf_sig.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_sig=clf_sig.predict(Xw_test_s)

Obtener el accuracy de los modelos de clasificacion por soportes entrenados con los diferentes kernels

In [252]:
from sklearn.metrics import accuracy_score
## Vino blanco 
acc_w_lin=accuracy_score(Yw_test,y_pred_w)
acc_w_poly=accuracy_score(Yw_test,y_pred_w_p)
acc_w_rbf=accuracy_score(Yw_test,y_pred_w_rbf)
acc_w_sig=accuracy_score(Yw_test,y_pred_w_sig)
## Vino rojo
acc_r_lin=accuracy_score(Yr_test,y_pred_r)
acc_r_poly=accuracy_score(Yr_test,y_pred_r_p)
acc_r_rbf=accuracy_score(Yr_test,y_pred_r_rbf)
acc_r_sig=accuracy_score(Yr_test,y_pred_r_sig)

print('Metrica de clasificación de la calidad para los modelos entrenados con el dataset de VINO BLANCO por tipo de kernel empleado:')
print('\n Lin: '+ str(round(acc_w_lin,2))+' \n Polyn: '+ str(round(acc_w_poly,2))+' \n RBF: '+ str(round(acc_w_rbf,2))+' \n Sig: '+ str(round(acc_w_sig,2))+' \n')
print('Metrica de clasificación de la calidad para los modelos entrenados con el dataset de VINO ROJO por tipo de kernel empleado:')
print('\n Lin: '+ str(round(acc_r_lin,2))+' \n Polyn: '+ str(round(acc_r_poly,2))+' \n RBF: '+ str(round(acc_r_rbf,2))+' \n Sig: '+ str(round(acc_r_sig,2)))

Metrica de clasificación de la calidad para los modelos entrenados con el dataset de VINO BLANCO por tipo de kernel empleado:

 Lin: 0.74 
 Polyn: 0.73 
 RBF: 0.77 
 Sig: 0.65 

Metrica de clasificación de la calidad para los modelos entrenados con el dataset de VINO ROJO por tipo de kernel empleado:

 Lin: 0.75 
 Polyn: 0.72 
 RBF: 0.74 
 Sig: 0.67


<font color=Blue> *El kernel con mejor clasificacion es RBF* </font>

# Exercise 6.4
Using the best SVM find the parameters that gives the best performance

'C': [0.1, 1, 10, 100, 1000], 'gamma': [0.01, 0.001, 0.0001]

In [253]:
## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=0.1,  gamma=0.01)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=0.1,  gamma=0.01:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2)))

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=0.1,  gamma=0.01)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=0.1,  gamma=0.001:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2)))

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=0.1,  gamma=0.0001)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=0.1,  gamma=0.0001:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2))+'\n')

#############################################################################

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=1.0,  gamma=0.01)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=1.0,  gamma=0.01:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2)))

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=1.0,  gamma=0.001)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=1.0,  gamma=0.001:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2)))

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=1.0,  gamma=0.0001)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=1.0,  gamma=0.0001:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2))+'\n')

#############################################################################

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=10.0,  gamma=0.01)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=10.0,  gamma=0.01:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2)))

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=10.0,  gamma=0.001)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=10.0,  gamma=0.001:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2)))

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=10.0,  gamma=0.0001)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=10.0,  gamma=0.0001:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2))+'\n')


#############################################################################

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=100.0,  gamma=0.01)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=100.0,  gamma=0.01:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2)))

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=100.0,  gamma=0.001)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=100.0,  gamma=0.001:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2)))

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=100.0,  gamma=0.0001)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=100.0,  gamma=0.0001:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2))+'\n')

#############################################################################

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=1000.0,  gamma=0.01)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=1000.0,  gamma=0.01:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2)))

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=1000.0,  gamma=0.001)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=1000.0,  gamma=0.001:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2)))

## Cambiar el kernel a rbf y entrenar  dos datases de vino
clf_rbf = SVC(kernel='rbf', C=1000.0,  gamma=0.0001)

clf_rbf.fit(Xr_train_s, Yr_train)   ## Vino rojo
y_pred_r_rbf=clf_rbf.predict(Xr_test_s)

clf_rbf.fit(Xw_train_s, Yw_train)   ## Vino blanco
y_pred_w_rbf=clf_rbf.predict(Xw_test_s)

acc_w=accuracy_score(Yw_test,y_pred_w_rbf)
acc_r=accuracy_score(Yr_test,y_pred_r_rbf)

print('C=1000.0,  gamma=0.0001:')    
print('Vino Blanco -->  '+ str(round(acc_w,2))+'. Inicial-->'+str(round(acc_w_rbf,2)))
print('Vino Rojo --> '+ str(round(acc_r,2))+'. Inicial-->'+str(round(acc_r_rbf,2))+'\n')

C=0.1,  gamma=0.01:
Vino Blanco -->  0.67. Inicial-->0.77
Vino Rojo --> 0.73. Inicial-->0.74
C=0.1,  gamma=0.001:
Vino Blanco -->  0.67. Inicial-->0.77
Vino Rojo --> 0.73. Inicial-->0.74
C=0.1,  gamma=0.0001:
Vino Blanco -->  0.64. Inicial-->0.77
Vino Rojo --> 0.53. Inicial-->0.74

C=1.0,  gamma=0.01:
Vino Blanco -->  0.74. Inicial-->0.77
Vino Rojo --> 0.74. Inicial-->0.74
C=1.0,  gamma=0.001:
Vino Blanco -->  0.67. Inicial-->0.77
Vino Rojo --> 0.73. Inicial-->0.74
C=1.0,  gamma=0.0001:
Vino Blanco -->  0.64. Inicial-->0.77
Vino Rojo --> 0.53. Inicial-->0.74

C=10.0,  gamma=0.01:
Vino Blanco -->  0.76. Inicial-->0.77
Vino Rojo --> 0.75. Inicial-->0.74
C=10.0,  gamma=0.001:
Vino Blanco -->  0.72. Inicial-->0.77
Vino Rojo --> 0.74. Inicial-->0.74
C=10.0,  gamma=0.0001:
Vino Blanco -->  0.67. Inicial-->0.77
Vino Rojo --> 0.73. Inicial-->0.74

C=100.0,  gamma=0.01:
Vino Blanco -->  0.76. Inicial-->0.77
Vino Rojo --> 0.73. Inicial-->0.74
C=100.0,  gamma=0.001:
Vino Blanco -->  0.75. Inicial

<font color=Blue> *De las opciones dadas en los parámetros, que brindan un mejor desempeño (aunque son muy cercanos y similares) comparado con el accuracy de SVM inicial, es el costo de penalización de clasificación más alto, y el gama mas bajo que serían C=1000.0,  gamma=0.01:* </font>

# Exercise 6.5

Compare the results with other methods

In [254]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear',C=1e9)
logreg.fit(Xr_train_s, Yr_train)  
y_pred_rlog = logreg.predict(Xr_test_s) 

logreg.fit(Xw_train_s, Yw_train)  
y_pred_wlog = logreg.predict(Xw_test_s) 

acc_wlog=accuracy_score(Yw_test,y_pred_wlog)
acc_rlog=accuracy_score(Yr_test,y_pred_rlog)

print('Vino Blanco -->  '+ str(round(acc_wlog,2)))
print('Vino Rojo --> '+ str(round(acc_rlog,2)))

Vino Blanco -->  0.74
Vino Rojo --> 0.75


<font color=Blue> *La regresión logística aquí me da también un accuracy cercano a los SVM para la clasificación de la calidad de los vinos blanco y rojo* </font>

# Regularization

# Exercise 6.6


* Train a linear regression to predict wine quality (Continous)

* Analyze the coefficients

* Evaluate the RMSE

In [255]:
# Para predecir mi calidad continua tomo el campo quality
## Creamos nuestras variables de respuesta
Yw_c=datawhite['quality'].values
Yr_c=datared['quality'].values

## Dividir la base en entrenamiento y test para cada uno de los vinos
from sklearn.model_selection import train_test_split  
Xw_train, Xw_test, Yw_train, Yw_test = train_test_split(Xw, Yw_c, test_size=0.3, random_state=0) 
Xr_train, Xr_test, Yr_train, Yr_test = train_test_split(Xr, Yr_c, test_size=0.3, random_state=0)  

from sklearn.linear_model import LinearRegression
linreg = LinearRegression()

linreg.fit(Xr_train_s, Yr_train) 
print('\n Coeficientes Vino Rojo \n'+str(linreg.coef_))
y_pred_rlin = linreg.predict(Xr_test_s) 


linreg.fit(Xw_train_s, Yw_train)
print('\n Coeficientes Vino Blanco \n'+str(linreg.coef_))
y_pred_wlin = linreg.predict(Xw_test_s) 


 Coeficientes Vino Rojo 
[ 0.03523454 -0.22500299 -0.01917881  0.03301609 -0.09404393  0.02142016
 -0.10287699 -0.03175893 -0.06166729  0.15260506  0.28471337]

 Coeficientes Vino Blanco 
[ 0.09636772 -0.18212585 -0.00178702  0.48284855 -0.00936514  0.0871764
 -0.01019314 -0.5914199   0.13181841  0.07656047  0.15829533]


<font color=Blue> *Los coeficientes son pequeños y están cercanos a cero  por lo que con la regulacion vamos a identificar cuales coeficientes entrarían a mi modelo* </font>

In [256]:
from math import sqrt
from sklearn import metrics
RMSE_W = sqrt(metrics.mean_squared_error(Yw_test, y_pred_wlin))
RMSE_R = sqrt(metrics.mean_squared_error(Yr_test, y_pred_rlin))
print('Vino blanco RMSE:', RMSE_W)
print('Vino rojo RMSE:', RMSE_R)

Vino blanco RMSE: 0.7797679548193168
Vino rojo RMSE: 0.6330721652193918


<font color=Blue> *Tenemos unos RMSE altos por lo que nos expresa que hay diferencias entre los valores de calidad predichos y los valores de calidad reales para cada uno de los datasets de vino* </font>

# Exercise 6.7

* Estimate a ridge regression with alpha equals 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

*ridge regression*

In [257]:
from sklearn.linear_model import Ridge
#alpha=0.1
ridgereg = Ridge(alpha=0.1, normalize=True)

ridgereg.fit(Xr_train_s, Yr_train) 
print('\n Coeficientes Vino Rojo \n'+str(ridgereg.coef_))
y_pred_rridg0 = ridgereg.predict(Xr_test_s) 

ridgereg.fit(Xw_train_s, Yw_train)
print('\n Coeficientes Vino Blanco \n'+str(ridgereg.coef_))
y_pred_wridg0 = ridgereg.predict(Xw_test_s) 

###############################################################

#alpha=1.0
ridgereg = Ridge(alpha=1.0, normalize=True)

ridgereg.fit(Xr_train_s, Yr_train) 
print('\n Coeficientes Vino Rojo \n'+str(ridgereg.coef_))
y_pred_rridg1 = ridgereg.predict(Xr_test_s) 

ridgereg.fit(Xw_train_s, Yw_train)
print('\n Coeficientes Vino Blanco \n'+str(ridgereg.coef_))
y_pred_wridg1 = ridgereg.predict(Xw_test_s) 


 Coeficientes Vino Rojo 
[ 0.04671119 -0.20098627  0.0110126   0.03533417 -0.08771638  0.01206531
 -0.09294553 -0.05696875 -0.03941277  0.14213138  0.24847562]

 Coeficientes Vino Blanco 
[-0.00705278 -0.1649712   0.0011319   0.17096024 -0.03894095  0.0937976
 -0.03558273 -0.15091429  0.04693054  0.04856757  0.31119153]

 Coeficientes Vino Rojo 
[ 0.03198728 -0.12515816  0.04854226  0.01505283 -0.05036445 -0.00898519
 -0.05605514 -0.05090195 -0.01553189  0.0819124   0.1500815 ]

 Coeficientes Vino Blanco 
[-0.02264332 -0.08658794  0.00600274  0.03011663 -0.05449677  0.04654489
 -0.03102156 -0.07614674  0.02447019  0.02576662  0.15879778]


<font color=Blue> *Los coeficientes de la regresión ridge son más grandes que los coeficientes de la regresión lineal si se utiliza un alpha pequeño que fue de 0.1, lo contrario  utilizar un alpha=1.0 donde los coeficientes se reducen respecto a los de la regresión lineal.*</font>  
<font color=Green> **Conclusión**: los coeficientes empleando regresión ridge con un alpha=1.0 son más pequeños que la regresión lineal</font>

In [258]:
print('Con alpha 0.1:')

RMSE_W = sqrt(metrics.mean_squared_error(Yw_test, y_pred_wridg0 ))
RMSE_R = sqrt(metrics.mean_squared_error(Yr_test,y_pred_rridg0 ))
print('Vino blanco RMSE:', RMSE_W)
print('Vino rojo RMSE:', RMSE_R)

print('Con alpha 1.0:')

RMSE_W = sqrt(metrics.mean_squared_error(Yw_test, y_pred_wridg1 ))
RMSE_R = sqrt(metrics.mean_squared_error(Yr_test, y_pred_rridg1 ))
print('Vino blanco RMSE:', RMSE_W)
print('Vino rojo RMSE:', RMSE_R)

Con alpha 0.1:
Vino blanco RMSE: 0.7797233024279017
Vino rojo RMSE: 0.6344194631893529
Con alpha 1.0:
Vino blanco RMSE: 0.8120347448335603
Vino rojo RMSE: 0.656966348695471


<font color=Blue> *Como la regresión rigde con un alpha de 0.1 es una regresión lineal,los RMSE son similares entre estos dos ejercicios. 
    
Se tienen un error RMSE más alto con la regresión ridge y un alpha de 1.0 que con la regresión lineal*</font>

# Exercise 6.8

* Estimate a lasso regression with alpha equals 0.01, 0.1 and 1.
* Compare the coefficients with the linear regression
* Evaluate the RMSE

In [259]:
from sklearn.linear_model import Lasso
#alpha=0.01
lassoreg = Lasso(alpha=0.01, normalize=True)

lassoreg.fit(Xr_train_s, Yr_train) 
print('\n Coeficientes Vino Rojo \n')
print(lassoreg.coef_)
y_pred_rlass0 = lassoreg.predict(Xr_test_s) 

lassoreg.fit(Xw_train_s, Yw_train)
print('\n Coeficientes Vino Blanco \n')
print(lassoreg.coef_)
y_pred_wlass0 = lassoreg.predict(Xw_test_s) 

###############################################################
#alpha=0.1
lassoreg = Lasso(alpha=0.1, normalize=True)

lassoreg.fit(Xr_train_s, Yr_train) 
print('\n Coeficientes Vino Rojo \n')
print(lassoreg.coef_)
y_pred_rlass01 = lassoreg.predict(Xr_test_s) 

lassoreg.fit(Xw_train_s, Yw_train)
print('\n Coeficientes Vino Blanco \n')
print(lassoreg.coef_)
y_pred_wlass01 = lassoreg.predict(Xw_test_s) 

###############################################################

#alpha=1.0
lassoreg = Lasso(alpha=1.0, normalize=True)

lassoreg.fit(Xr_train_s, Yr_train) 
print('\n Coeficientes Vino Rojo \n')
print(lassoreg.coef_)
y_pred_rlass1 = lassoreg.predict(Xr_test_s) 

ridgereg.fit(Xw_train_s, Yw_train)
print('\n Coeficientes Vino Blanco \n')
print(lassoreg.coef_)
y_pred_wlass1 = lassoreg.predict(Xw_test_s) 


 Coeficientes Vino Rojo 

[ 0.         -0.00601074  0.          0.         -0.         -0.
 -0.         -0.         -0.          0.          0.04489403]

 Coeficientes Vino Blanco 

[-0. -0. -0. -0. -0.  0. -0. -0.  0.  0.  0.]

 Coeficientes Vino Rojo 

[ 0. -0.  0.  0. -0. -0. -0. -0. -0.  0.  0.]

 Coeficientes Vino Blanco 

[-0. -0. -0. -0. -0.  0. -0. -0.  0.  0.  0.]

 Coeficientes Vino Rojo 

[ 0. -0.  0.  0. -0. -0. -0. -0. -0.  0.  0.]

 Coeficientes Vino Blanco 

[ 0. -0.  0.  0. -0. -0. -0. -0. -0.  0.  0.]


<font color=Blue> *Se identifica que con el alfa más alto se penalizan los coeficientes demasiado hasta dejarlos en 0*</font>

In [260]:
print('Con alpha 0.01:')

RMSE_W = sqrt(metrics.mean_squared_error(Yw_test, y_pred_wlass0 ))
RMSE_R = sqrt(metrics.mean_squared_error(Yr_test,y_pred_rlass0 ))
print('Vino blanco RMSE:', RMSE_W)
print('Vino rojo RMSE:', RMSE_R)

print('Con alpha 0.1:')

RMSE_W = sqrt(metrics.mean_squared_error(Yw_test, y_pred_wlass01 ))
RMSE_R = sqrt(metrics.mean_squared_error(Yr_test, y_pred_rlass01 ))
print('Vino blanco RMSE:', RMSE_W)
print('Vino rojo RMSE:', RMSE_R)

print('Con alpha 1.0:')

RMSE_W = sqrt(metrics.mean_squared_error(Yw_test, y_pred_wlass1 ))
RMSE_R = sqrt(metrics.mean_squared_error(Yr_test, y_pred_rlass1 ))
print('Vino blanco RMSE:', RMSE_W)
print('Vino rojo RMSE:', RMSE_R)

Con alpha 0.01:
Vino blanco RMSE: 0.9016157829038678
Vino rojo RMSE: 0.7463573531910372
Con alpha 0.1:
Vino blanco RMSE: 0.9016157829038678
Vino rojo RMSE: 0.7698374031730867
Con alpha 1.0:
Vino blanco RMSE: 0.9173632627457485
Vino rojo RMSE: 0.7698374031730867


<font color=Blue> *Pero con estos coeficientes  regularizados incluso el RMSE es más alto que los valores de la regresión lineal en el literal 6.6*</font>

# Exercise 6.9

* Create a binary target

* Train a logistic regression to predict wine quality (binary)

* Analyze the coefficients

* Evaluate the f1score

In [268]:
#Regresion logistica No Regularizada

Yw=datawhite['Calidad2'].values
Yr=datared['Calidad2'].values

Xw=datawhite[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide',
              'total sulfur dioxide','density','pH','sulphates','alcohol']].values
Xr=datared[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide',
              'total sulfur dioxide','density','pH','sulphates','alcohol']].values

from sklearn.model_selection import train_test_split  
Xw_train, Xw_test, Yw_train, Yw_test = train_test_split(Xw, Yw, test_size=0.3, random_state=0) 
Xr_train, Xr_test, Yr_train, Yr_test = train_test_split(Xr, Yr, test_size=0.3, random_state=0)  

logreg.fit(Xr_train, Yr_train)  

print('\n Coeficientes Vino Rojo \n')
print(logreg.coef_)
y_pred_rlog = logreg.predict(Xr_test) 

logreg.fit(Xw_train, Yw_train)  
y_pred_wlog = logreg.predict(Xw_test) 

print('\n Coeficientes Vino Blanco \n')
print(logreg.coef_)


 Coeficientes Vino Rojo 

[[ 1.76600860e-04 -2.81077502e+00 -5.43776994e-01  3.77630268e-02
  -1.43025943e+00  1.83700127e-02 -1.72805109e-02 -1.02359854e+00
  -1.59096912e+00  1.84585685e+00  8.18645383e-01]]

 Coeficientes Vino Blanco 

[[-2.48722852e-01 -5.34851309e+00 -2.01215938e-01  5.64052825e-02
  -7.65574694e-01  1.54999406e-02 -3.63377029e-03 -2.31314189e+00
  -5.99766443e-01  1.20527141e+00  9.28992538e-01]]


<font color=blue> *Tenemos unos coeficiente muy altos por lo que revisaremos más adelante  qué pasa si en la regularizacion*</font> 

In [269]:
from sklearn.metrics import f1_score

f1_wlog=f1_score(Yw_test,y_pred_wlog,average='binary')
f1_rlog=f1_score(Yr_test,y_pred_rlog,average='binary')
print('\n F1 Score \n')
print('Vino Blanco -->  '+ str(round(f1_wlog,2)))
print('Vino Rojo --> '+ str(round(f1_rlog,2)))


 F1 Score 

Vino Blanco -->  0.81
Vino Rojo --> 0.76


# Exercise 6.10

* Estimate a regularized logistic regression using:
* C = 0.01, 0.1 & 1.0
* penalty = ['l1, 'l2']
* Compare the coefficients and the f1score

In [272]:
print('\n C = 0.01 penalty=l1  \n')
##C = 0.01 penalty='l1'
logreg = LogisticRegression(C=0.01, penalty='l1',solver='liblinear')
logreg.fit(Xw_train_s, Yw_train)
print('Coeficientes Vino Blanco')
print(logreg.coef_)
y_pred_wlog = logreg.predict(Xw_test_s)
f1_wlog=f1_score(Yw_test,y_pred_wlog,average='binary')

logreg.fit(Xr_train_s, Yr_train)
print('Coeficientes Vino Rojo')
print(logreg.coef_)
y_pred_rlog = logreg.predict(Xr_test_s) 
f1_rlog=f1_score(Yr_test,y_pred_rlog,average='binary')

print('\n F1 Score \n')
print('Vino Blanco -->  '+ str(round(f1_wlog,2)))
print('Vino Rojo --> '+ str(round(f1_rlog,2)))

#####################################################################

print('\n C = 0.1 penalty=l2  \n')
##C = 0.1 penalty='l2'
logreg = LogisticRegression(C=0.1, penalty='l2',solver='liblinear')
logreg.fit(Xw_train_s, Yw_train)
print('Coeficientes Vino Blanco')
print(logreg.coef_)
y_pred_wlog = logreg.predict(Xw_test_s)
f1_wlog=f1_score(Yw_test,y_pred_wlog,average='binary')

logreg.fit(Xr_train_s, Yr_train)
print('Coeficientes Vino Rojo')
print(logreg.coef_)
y_pred_rlog = logreg.predict(Xr_test_s) 
f1_rlog=f1_score(Yr_test,y_pred_rlog,average='binary')

print('\n F1 Score \n')
print('Vino Blanco -->  '+ str(round(f1_wlog,2)))
print('Vino Rojo --> '+ str(round(f1_rlog,2)))

#######################################################################
print('\n C = 0.1 penalty=l1  \n')
##C = 0.1 penalty='l1'
logreg = LogisticRegression(C=0.1, penalty='l1',solver='liblinear')
logreg.fit(Xw_train_s, Yw_train)
print('Coeficientes Vino Blanco')
print(logreg.coef_)
y_pred_wlog = logreg.predict(Xw_test_s)
f1_wlog=f1_score(Yw_test,y_pred_wlog,average='binary')

logreg.fit(Xr_train_s, Yr_train)
print('Coeficientes Vino Rojo')
print(logreg.coef_)
y_pred_rlog = logreg.predict(Xr_test_s) 
f1_rlog=f1_score(Yr_test,y_pred_rlog,average='binary')

print('\n F1 Score \n')
print('Vino Blanco -->  '+ str(round(f1_wlog,2)))
print('Vino Rojo --> '+ str(round(f1_rlog,2)))

#####################################################################

print('\n C = 0.1 penalty=l2  \n')
##C = 0.1 penalty='l2'
logreg = LogisticRegression(C=0.1, penalty='l2',solver='liblinear')
logreg.fit(Xw_train_s, Yw_train)
print('Coeficientes Vino Blanco')
print(logreg.coef_)
y_pred_wlog = logreg.predict(Xw_test_s)
f1_wlog=f1_score(Yw_test,y_pred_wlog,average='binary')

logreg.fit(Xr_train_s, Yr_train)
print('Coeficientes Vino Rojo')
print(logreg.coef_)
y_pred_rlog = logreg.predict(Xr_test_s) 
f1_rlog=f1_score(Yr_test,y_pred_rlog,average='binary')

print('\n F1 Score \n')
print('Vino Blanco -->  '+ str(round(f1_wlog,2)))
print('Vino Rojo --> '+ str(round(f1_rlog,2)))

################################################################################

print('\n C = 1.0 penalty=l1  \n')
##C = 0.01 penalty='l1'
logreg = LogisticRegression(C=1.0, penalty='l1',solver='liblinear')
logreg.fit(Xw_train_s, Yw_train)
print('Coeficientes Vino Blanco')
print(logreg.coef_)
y_pred_wlog = logreg.predict(Xw_test_s)
f1_wlog=f1_score(Yw_test,y_pred_wlog,average='binary')

logreg.fit(Xr_train_s, Yr_train)
print('Coeficientes Vino Rojo')
print(logreg.coef_)
y_pred_rlog = logreg.predict(Xr_test_s) 
f1_rlog=f1_score(Yr_test,y_pred_rlog,average='binary')

print('\n F1 Score \n')
print('Vino Blanco -->  '+ str(round(f1_wlog,2)))
print('Vino Rojo --> '+ str(round(f1_rlog,2)))

#####################################################################

print('\n C = 1.0 penalty=l2  \n')
##C = 1.0 penalty='l2'
logreg = LogisticRegression(C=1.0, penalty='l2',solver='liblinear')
logreg.fit(Xw_train_s, Yw_train)
print('Coeficientes Vino Blanco')
print(logreg.coef_)
y_pred_wlog = logreg.predict(Xw_test_s)
f1_wlog=f1_score(Yw_test,y_pred_wlog,average='binary')

logreg.fit(Xr_train_s, Yr_train)
print('Coeficientes Vino Rojo')
print(logreg.coef_)
y_pred_rlog = logreg.predict(Xr_test_s) 
f1_rlog=f1_score(Yr_test,y_pred_rlog,average='binary')

print('\n F1 Score \n')
print('Vino Blanco -->  '+ str(round(f1_wlog,2)))
print('Vino Rojo --> '+ str(round(f1_rlog,2)))


 C = 0.01 penalty=l1  

Coeficientes Vino Blanco
[[ 0.         -0.42327251  0.          0.01602842  0.          0.04358978
   0.          0.          0.          0.          0.77825124]]
Coeficientes Vino Rojo
[[ 0.         -0.24246861  0.          0.          0.          0.
  -0.01922545  0.          0.          0.          0.44893893]]

 F1 Score 

Vino Blanco -->  0.8
Vino Rojo --> 0.73

 C = 0.1 penalty=l2  

Coeficientes Vino Blanco
[[-0.00157654 -0.62798939 -0.02584504  0.68928294 -0.03278152  0.2306784
  -0.07980088 -0.64293464  0.12298759  0.19818929  0.87952009]]
Coeficientes Vino Rojo
[[ 0.1879359  -0.55352889 -0.1410814   0.0994284  -0.20672211  0.13054003
  -0.49703206 -0.09974538 -0.07467492  0.47284181  0.78670739]]

 F1 Score 

Vino Blanco -->  0.82
Vino Rojo --> 0.76

 C = 0.1 penalty=l1  

Coeficientes Vino Blanco
[[-0.04654983 -0.62817604 -0.01535439  0.53097126 -0.02583383  0.19842833
  -0.04748152 -0.40132266  0.07019489  0.16269065  1.00209436]]
Coeficientes Vino 

<font color=blue>*Efectivamente cuando jugamos con las combinaciones de costo de penalizacion y el tipo de penalización,
nos permite identificar las variables que realmente nos servirían al modelo, evidenciando que el ponderado de éxitos en la clasificacion no cambia mucho y los puedo maximizar con las variables que aplican*</font> 

Por Ana Milena Rodriguez