<img src="header.png" align="left"/>

# Exercise Regression of Wine Quality

The aim of the example is to estimate the quality of a wine from physical measurands. To do this, we use different types of regression.
We use a dataset of wines from Portugal created by Paulo Cortez [1]. The details of the creation of the data can be found at the following link. [http://www3.dsi.uminho.pt/pcortez/wine5.pdf](http://www3.dsi.uminho.pt/pcortez/wine5.pdf). 

```
[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
```



## Import of python modules

In [6]:
#
# Turn of some warnings
#
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
simplefilter(action='ignore', category=Warning)

#
# Import modules
#
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn import metrics 

In [7]:
#
# Load data from a CSV file. The feature separator is a ';'
#
df = pd.read_csv('data/winequality/winequality-red.csv', sep=';')

In [8]:
#
# Show dimensions of the dataframe
#
print(df.shape)

(1599, 12)


<div class="alert alert-block alert-info">

## Task
    
Print the first 20 records for checking (1 point)

</div>

In [10]:
#
# Print also the last records for checking.
#
df.tail()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1594,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5
1595,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5
1598,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6


In [11]:
#
# Labels are stored in y_complete, x_complete contains only the features. 
# Drop removes a column if axis=1
#
y_complete = df['quality']
x_complete = df.drop(['quality'], axis=1)

<div class="alert alert-block alert-info">

## Task
    
Search the internet for the description of Dataframe.drop in pandas and give a short description here (1 point)

</div>

In [12]:
#
# Check labels
#
y_complete.head()

0    5
1    5
2    5
3    6
4    5
Name: quality, dtype: int64

In [13]:
#
# Check features
#
x_complete.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4


In [14]:
#
# Split data into training and test data
#
# Note this special feature of python to assign many variables from the return values of a function
#
x_train, x_test, y_train, y_test = train_test_split ( x_complete, y_complete, train_size=0.8, random_state=42 )

In [15]:
#
# Check shape of training data
#
x_train.shape

(1279, 11)

In [16]:
#
# Setup a model for linear regression
#
regressor = LinearRegression()

In [17]:
#
# Train (fit) the model with training features and training labels
#
regressor.fit(x_train,y_train)

LinearRegression()

In [18]:
#
# Check the resulting parameters of the model
#
print(regressor.coef_)

[ 2.30853339e-02 -1.00130443e+00 -1.40821461e-01  6.56431104e-03
 -1.80650315e+00  5.62733439e-03 -3.64444893e-03 -1.03515936e+01
 -3.93687732e-01  8.41171623e-01  2.81889567e-01]


<div class="alert alert-block alert-info">

## Task
    
Explain the values in regressor.coef_ . How can we intuitively understand those parameters? (4 points)

</div>

In [19]:
#
# Run the model with training and test data and store the results
#
prediction_train = regressor.predict(x_train)
prediction_test = regressor.predict(x_test) 

In [20]:
#
# Check the prediction results
#
prediction_train

array([5.68864364, 6.05664943, 5.69269687, ..., 4.9703554 , 6.61115563,
       6.69768634])

In [21]:
# 
# Measure the quality of the model estimation on the test data
# Using root mean sqare
#
print('test  root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, prediction_test))))

test  root mean squared error: 0.6245199307980128


In [22]:
#
# Measure the quality of the model estimation on the training data
#
print('train root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_train, prediction_train))))

train root mean squared error: 0.6512995910592836


In [23]:
#
# Using mean absolute error and mean squared error
#
print('test mean absolute error:     {}'.format(metrics.mean_absolute_error(y_test, prediction_test)))
print('test mean squared error:      {}'.format(metrics.mean_squared_error(y_test, prediction_test)))

test mean absolute error:     0.5035304415524369
test mean squared error:      0.3900251439639547


In [41]:
#
# Support function for counting usable predictions
# Note: continuous regression result has to be rounded to categorical quality class (label)
#
def countAccuracy(prediction,y):
    prediction_quality_test = np.round_(prediction)
    y_test_data = y.values

    correct, incorrect = 0,0
    for index in range(prediction_test.shape[0]):
        if prediction_quality_test[index] == y_test_data[index]:
            correct= correct + 1
        else:
            incorrect= incorrect + 1

    print('count accuracy: {:.4f}'.format((correct/(correct+incorrect))))

In [42]:
#
# Use function to measure accuracy of test data prediction
#
countAccuracy(prediction_test,y_test)

count accuracy: 0.5594


In [43]:
#
# Use function to measure accuracy of training data prediction
#
countAccuracy(prediction_train,y_train)

count accuracy: 0.5687


# Test a Random Forest Regression Model

<div class="alert alert-block alert-info">

## Task
    
Implement a new regressor model using the class RandomForestRegressor (2 points)

</div>

In [1]:
random_regressor = ...
random_regressor.fit(x_train, y_train)

AttributeError: 'ellipsis' object has no attribute 'fit'

In [45]:
#
# Predict and measure quality
#
prediction_train = random_regressor.predict(x_train)
prediction_test = random_regressor.predict(x_test) 

print('test  root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, prediction_test))))
print('train root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_train, prediction_train))))

test  root mean squared error: 0.5662099875487892
train root mean squared error: 0.24482126364696796


In [46]:
#
# Count accuracy
#
countAccuracy(prediction_test,y_test)

count accuracy: 0.6438


In [47]:
#
# Is there a way to measure the quality in a more relaxed way?
#

In [48]:
#
# New function for measuring accuracy in a more relaxed way
#
def countAccuracyRelaxed(prediction,y):
    prediction_quality_test = np.round_(prediction)
    y_test_data = y.values

    correct, incorrect = 0,0
    for index in range(prediction_test.shape[0]):
        if prediction_quality_test[index] == y_test_data[index]:
            correct= correct + 1
        elif prediction_quality_test[index] == y_test_data[index] + 1: 
            correct= correct + 1
        elif prediction_quality_test[index] == y_test_data[index] - 1:
            correct= correct + 1
        else:
            incorrect= incorrect + 1

    print('count accuracy relaxed: {:.4f}'.format((correct/(correct+incorrect))))

In [49]:
#
# Test new measurement
#
countAccuracyRelaxed(prediction_test,y_test)

count accuracy relaxed: 0.9750


# Test a Neuronal Network

In [50]:
#
# Build a neural network regressor (using MLPRegressor)
#
nn_regressor = MLPRegressor(hidden_layer_sizes=(20,40,10), random_state=42, max_iter=4000, activation='relu')
nn_regressor.fit(x_train, y_train);

In [51]:
#
# Predict and measure quality
#
prediction_train = nn_regressor.predict(x_train)
prediction_test = nn_regressor.predict(x_test) 

print('test  root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, prediction_test))))
print('train root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_train, prediction_train))))

test  root mean squared error: 0.6860031880948588
train root mean squared error: 0.6789002656809678


In [52]:
#
# Measure quality
#
countAccuracy(prediction_test,y_test)

count accuracy: 0.5594


In [53]:
#
# Measure quality relaxed
#
countAccuracyRelaxed(prediction_test,y_test)

count accuracy relaxed: 0.9719
