<img src="header.png" align="left"/>

# Exercise Regression of Wine Quality

The aim of the example is to estimate the quality of a wine from physical measurands. To do this, we use different types of regression.
We use a dataset of wines from Portugal created by Paulo Cortez [1]. The details of the creation of the data can be found at the following link. [http://www3.dsi.uminho.pt/pcortez/wine5.pdf](http://www3.dsi.uminho.pt/pcortez/wine5.pdf). 

```
[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
```

**NOTE**

Document your results by simply adding a markdown cell or a python cell (as comment) and writing your statements into this cell. For some tasks the result cell is already available.


[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ditomax/mlexercises/blob/master/03%20Exercise%20Regression%20wine%20quality.ipynb)



## Import of python modules

In [None]:
#
# Prepare colab
#
COLAB=False
try:
    %tensorflow_version 2.x
    print("running on google colab")
    COLAB=True
except:
    print("not running on google colab")

#
# Turn of some warnings
#
from warnings import simplefilter
# ignore all future warnings
simplefilter(action='ignore', category=FutureWarning)
simplefilter(action='ignore', category=Warning)

#
# Import modules
#
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn import metrics 

In [None]:
#
# Load data from a CSV file. The feature separator is a ';'
#


if COLAB:
    df = pd.read_csv('https://raw.githubusercontent.com/ditomax/mlexercises/master/data/winequality/winequality-red.csv', sep=';')
else:
    df = pd.read_csv('data/winequality/winequality-red.csv', sep=';')

In [None]:
#
# Show dimensions of the dataframe
#
print(df.shape)

<div class="alert alert-block alert-info">

## Task
    
Print the first 20 records for checking (1 point)

</div>

In [None]:
#
# your code here
#

In [None]:
#
# Print also the last records for checking.
#
df.tail()

In [None]:
#
# Labels are stored in y_complete, x_complete contains only the features. 
# Drop removes a column if axis=1
#
y_complete = df['quality']
x_complete = df.drop(['quality'], axis=1)

<div class="alert alert-block alert-info">

## Task
    
Search the internet for the description of Dataframe.drop in pandas and give a short description here (1 point)

</div>

In [None]:
#
# your description here
#

In [None]:
#
# Check labels
#
y_complete.head()

In [None]:
#
# Check features
#
x_complete.head()

In [None]:
#
# Split data into training and test data
#
# Note this special feature of python to assign many variables from the return values of a function
#
x_train, x_test, y_train, y_test = train_test_split ( x_complete, y_complete, train_size=0.8, random_state=42 )

In [None]:
#
# Check shape of training data
#
x_train.shape

In [None]:
#
# Setup a model for linear regression
#
regressor = LinearRegression()

In [None]:
#
# Train (fit) the model with training features and training labels
#
regressor.fit(x_train,y_train)

In [None]:
#
# Check the resulting parameters of the model
#
print(regressor.coef_)

<div class="alert alert-block alert-info">

## Task
    
Explain the values in regressor.coef_ . How can we intuitively understand those parameters? (4 points)

</div>

In [None]:
#
# your description here
#

In [None]:
#
# Run the model with training and test data and store the results
#
prediction_train = regressor.predict(x_train)
prediction_test = regressor.predict(x_test) 

In [None]:
#
# Check the prediction results
#
prediction_train

In [None]:
# 
# Measure the quality of the model estimation on the test data
# Using root mean sqare
#
print('test  root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, prediction_test))))

In [None]:
#
# Measure the quality of the model estimation on the training data
#
print('train root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_train, prediction_train))))

In [None]:
#
# Using mean absolute error and mean squared error
#
print('test mean absolute error:     {}'.format(metrics.mean_absolute_error(y_test, prediction_test)))
print('test mean squared error:      {}'.format(metrics.mean_squared_error(y_test, prediction_test)))

In [None]:
#
# Support function for counting usable predictions
# Note: continuous regression result has to be rounded to categorical quality class (label)
#
def countAccuracy(prediction,y):
    prediction_quality_test = np.round_(prediction)
    y_test_data = y.values

    correct, incorrect = 0,0
    for index in range(prediction_test.shape[0]):
        if prediction_quality_test[index] == y_test_data[index]:
            correct= correct + 1
        else:
            incorrect= incorrect + 1

    print('count accuracy: {:.4f}'.format((correct/(correct+incorrect))))

In [None]:
#
# Use function to measure accuracy of test data prediction
#
countAccuracy(prediction_test,y_test)

In [None]:
#
# Use function to measure accuracy of training data prediction
#
countAccuracy(prediction_train,y_train)

# Test a Random Forest Regression Model

<div class="alert alert-block alert-info">

## Task
    
Implement a new regressor model using the class RandomForestRegressor (2 points)

</div>

In [None]:
random_regressor = ...
random_regressor.fit(x_train, y_train)

In [None]:
#
# Predict and measure quality
#
prediction_train = random_regressor.predict(x_train)
prediction_test = random_regressor.predict(x_test) 

print('test  root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, prediction_test))))
print('train root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_train, prediction_train))))

In [None]:
#
# Count accuracy
#
countAccuracy(prediction_test,y_test)

In [None]:
#
# Is there a way to measure the quality in a more relaxed way?
#

In [None]:
#
# New function for measuring accuracy in a more relaxed way
#
def countAccuracyRelaxed(prediction,y):
    prediction_quality_test = np.round_(prediction)
    y_test_data = y.values

    correct, incorrect = 0,0
    for index in range(prediction_test.shape[0]):
        if prediction_quality_test[index] == y_test_data[index]:
            correct= correct + 1
        elif prediction_quality_test[index] == y_test_data[index] + 1: 
            correct= correct + 1
        elif prediction_quality_test[index] == y_test_data[index] - 1:
            correct= correct + 1
        else:
            incorrect= incorrect + 1

    print('count accuracy relaxed: {:.4f}'.format((correct/(correct+incorrect))))

In [None]:
#
# Test new measurement
#
countAccuracyRelaxed(prediction_test,y_test)

# Test a Neuronal Network

<div class="alert alert-block alert-info">

## Task
    
Experiment with a neural network for regression. Use MLPRegressor class for this task. (2 points)

</div>

In [None]:
#
# Build a neural network regressor (using MLPRegressor)
#
nn_regressor = ...
nn_regressor.fit(x_train, y_train);

In [None]:
#
# Predict and measure quality
#
prediction_train = nn_regressor.predict(x_train)
prediction_test = nn_regressor.predict(x_test) 

print('test  root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, prediction_test))))
print('train root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_train, prediction_train))))

In [None]:
#
# Measure quality
#
countAccuracy(prediction_test,y_test)

In [None]:
#
# Measure quality relaxed
#
countAccuracyRelaxed(prediction_test,y_test)