The two datasets are related to red and white variants of the Portuguese "Vinho Verde" wine. For more details, consult the reference [Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and sensory (the output) variables are available (e.g. there is no data about grape types, wine brand, wine selling price, etc.).

Input variables (based on physicochemical tests):
1 - fixed acidity 
2 - volatile acidity 
3 - citric acid 
4 - residual sugar 
5 - chlorides 
6 - free sulfur dio 
7 - total sulfur dioxide 
8 - density 
9 - pH 


r2, RMSE, and MAE of your model
See if you can improve the model using lasso regression, what alpha gives best results?
See if you can improve the model using ridge regression, what alpha gives best results?
Any other way to improve the results?


In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")

%matplotlib inline

In [2]:
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression, Lasso, Ridge
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error
from sklearn.pipeline import make_pipeline

In [3]:
df_wine=pd.read_csv('winequality-red.csv')

In [4]:
df_wine.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [24]:
df_train, df_test = train_test_split(df_wine,test_size=0.30, random_state=30)
y_alc_train = df_train['alcohol']
x_train = df_train[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH']]

y_alc_test = df_test['alcohol']
x_test = df_test[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH']]

alc_linreg = LinearRegression()

In [25]:
alc_linreg.fit(x_train,y_alc_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

In [26]:
y_alc_predict = alc_linreg.predict(x_test)

In [37]:
def meas_perf(name,actual,predicted):
    r2_name = r2_score(actual, predicted) * 100
    mae_name = mean_absolute_error(actual, predicted)
    mse_name = mean_squared_error(actual, predicted) 
    rmse_name = np.sqrt(mean_squared_error(actual, predicted))
    
    print(("r2_%s: %.2f") %(name, r2_name))
    print(("mae_%s: %.2f") %(name, mae_name))
    print(("mse_%s: %.2f") %(name, mse_name))
    print(("rmse_%s: %.2f") %(name, rmse_name))

In [39]:
meas_perf('alcohol_linreg',y_alc_test,y_alc_predict)

r2_alcohol_linreg: 63.71
mae_alcohol_linreg: 0.51
mse_alcohol_linreg: 0.44
rmse_alcohol_linreg: 0.67


In [40]:
df_train, df_test = train_test_split(df_wine,test_size=0.30, random_state=30)
y_qual_train = df_train['quality']
x_train = df_train[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH']]

y_qual_test = df_test['quality']
x_test = df_test[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH']]

qual_linreg = LinearRegression()
qual_linreg.fit(x_train,y_qual_train)
y_qual_predict = qual_linreg.predict(x_test)

meas_perf('quality_linreg',y_qual_test,y_qual_predict)

r2_quality_linreg: 25.61
mae_quality_linreg: 0.56
mse_quality_linreg: 0.50
rmse_quality_linreg: 0.71


In [42]:
df_train, df_test = train_test_split(df_wine,test_size=0.30, random_state=30)
y_sul_train = df_train['sulphates']
x_train = df_train[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH']]

y_sul_test = df_test['sulphates']
x_test = df_test[['fixed acidity','volatile acidity','citric acid','residual sugar','chlorides','free sulfur dioxide','total sulfur dioxide','density','pH']]

sul_linreg = LinearRegression()
sul_linreg.fit(x_train,y_sul_train)
y_sul_predict = sul_linreg.predict(x_test)

meas_perf('sulphates_linreg',y_sul_test,y_sul_predict)

r2_sulphates_linreg: 32.29
mae_sulphates_linreg: 0.10
mse_sulphates_linreg: 0.02
rmse_sulphates_linreg: 0.14


In [55]:
r2score=0
for alpha in range(1,100,1):
    alpha=alpha/100
    alc_lasso = Lasso(alpha=alpha,random_state=30)
    alc_lasso.fit(x_train, y_alc_train)
    alc_lasso_predict = alc_lasso.predict(x_test)
    r2=r2_score(y_alc_test,alc_lasso_predict)*100
    if r2>r2score:
        r2score=r2
    else:
        continue
print('lasso_alpha_alc: ',alpha)
print('lasso_r2_alc:',r2score)

alpha_alc:  0.99
r2_alc: 11.043444558375814


In [53]:
r2score=0
for alpha in range(1,100,1):
    alpha=alpha/100
    qual_lasso = Lasso(alpha=alpha,random_state=30)
    qual_lasso.fit(x_train, y_qual_train)
    qual_lasso_predict = qual_lasso.predict(x_test)
    r2=r2_score(y_qual_test,qual_lasso_predict)*100
    if r2>r2score:
        r2score=r2
    else:
        continue
print('lasso_alpha_qual: ',alpha)
print('lasso_r2_qual:',r2score)

alpha_qual:  0.99
r2_qual: 16.473556827837776


In [54]:
r2score=0
for alpha in range(1,100,1):
    alpha=alpha/100
    sul_lasso = Lasso(alpha=alpha,random_state=30)
    sul_lasso.fit(x_train, y_sul_train)
    sul_lasso_predict = sul_lasso.predict(x_test)
    r2=r2_score(y_sul_test,sul_lasso_predict)*100
    if r2>r2score:
        r2score=r2
    else:
        continue
print('lasso_alpha_sul: ',alpha)
print('lasso_r2_sul:',r2score)

alpha_sul:  0.99
r2_sul: 3.83067953408075


In [58]:
r2score=0
for alpha in range(1,100,1):
    alpha=alpha/100
    qual_ridge = Ridge(alpha=alpha,random_state=30)
    qual_ridge.fit(x_train, y_qual_train)
    qual_ridge_predict = qual_ridge.predict(x_test)
    r2=r2_score(y_qual_test,qual_ridge_predict)*100
    if r2>r2score:
        r2score=r2
    else:
        continue
print('ridge_alpha_qual: ',alpha)
print('ridge_r2_qual:',r2score)

ridge_alpha_qual:  0.99
ridge_r2_qual: 19.839077929978156


In [None]:
#4. Any other way to improve the results?
"""Use a polynomial model. independent variables should consider duration and not just chemical content. Consider removing outliers"""