<img src="header_anwender.png" align="left"/>

# Anwendungsbeispiel Regression Wine Quality

Das Ziel des Beispieles ist es, die Qualität eines Weines aus physikalischen Messgrößen zu schätzen. Dazu verwenden wir verschiedene Arten der Regression.
Wir verwenden einen Datensatz von Weinen aus Portugal erstellt von Paulo Cortez [1]. Die Details der Erstellung der Daten sind unter folgendem Link zu finden [http://www3.dsi.uminho.pt/pcortez/wine5.pdf](http://www3.dsi.uminho.pt/pcortez/wine5.pdf). 

```
[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
```



## Methode und Details der Daten


- Import der Module
- Laden der Daten



In [1]:
#
# Import der Module
#
import pandas as pd 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor
from sklearn.neural_network import MLPRegressor
from sklearn import metrics 

In [30]:
#
# Laden der Daten aus einem CSV File. Der Separator ist hier ein ';'
#
df = pd.read_csv('data/winequality/winequality-red.csv', sep=';')

In [3]:
#
# Anzeige der Dimensionen des Datensatzes
#
print(df.shape)

(1599, 12)


In [4]:
#
# Anzeige der ersten Datensätze für Kontrolle
#
df.head(20)

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5
6,7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,5
7,7.3,0.65,0.0,1.2,0.065,15.0,21.0,0.9946,3.39,0.47,10.0,7
8,7.8,0.58,0.02,2.0,0.073,9.0,18.0,0.9968,3.36,0.57,9.5,7
9,7.5,0.5,0.36,6.1,0.071,17.0,102.0,0.9978,3.35,0.8,10.5,5


In [5]:
df.tail()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1594,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.9949,3.45,0.58,10.5,5
1595,5.9,0.55,0.1,2.2,0.062,39.0,51.0,0.99512,3.52,0.76,11.2,6
1596,6.3,0.51,0.13,2.3,0.076,29.0,40.0,0.99574,3.42,0.75,11.0,6
1597,5.9,0.645,0.12,2.0,0.075,32.0,44.0,0.99547,3.57,0.71,10.2,5
1598,6.0,0.31,0.47,3.6,0.067,18.0,42.0,0.99549,3.39,0.66,11.0,6


In [6]:
#
# Labels werden in y gespeichert, die restlichen Daten in x (ohne quality). Drop löscht ein Feature
#
y_complete = df['quality']
x_complete = df.drop(['quality'], axis=1)


In [7]:
y_complete.head()

0    5
1    5
2    5
3    6
4    5
Name: quality, dtype: int64

In [8]:
x_complete.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4


In [9]:
#
# Aufteilen der Daten in Training Daten und Testdaten
# Beachte auch die neue Schreibweise bei der Zuweisung eines Rückgabewertes der Funktion
#
x_train, x_test, y_train, y_test = train_test_split ( x_complete, y_complete, train_size=0.8, random_state=42 )

In [10]:
x_train.shape

(1279, 11)

In [31]:
#
# Anlegen eines Modelles für lineare Regression
# Training des Modelles mit Daten (fit)
#
regressor = LinearRegression()
regressor.fit(x_train,y_train)

In [12]:
#
# Kurzer Blick auf die Parameter des Modelles
#
print(regressor.coef_)

[ 2.30853339e-02 -1.00130443e+00 -1.40821461e-01  6.56431104e-03
 -1.80650315e+00  5.62733439e-03 -3.64444893e-03 -1.03515936e+01
 -3.93687732e-01  8.41171623e-01  2.81889567e-01]


In [13]:
#
# Test durch Vorhersage mit dem Modell auf beiden Datensätzen (test und train)
#
prediction_train = regressor.predict(x_train)
prediction_test = regressor.predict(x_test) 

In [14]:
prediction_train

array([5.68864364, 6.05664943, 5.69269687, ..., 4.9703554 , 6.61115563,
       6.69768634])

In [32]:
prediction_train.shape

(1279,)

In [15]:
# 
# Auswertungen der Qualität des Modelles für Regression
# Unterschied zwischen test und train Qualität
#
print('test  root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, prediction_test))))
print('train root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_train, prediction_train))))

test  root mean squared error: 0.6245199307980126
train root mean squared error: 0.6512995910592836


In [16]:
#
# Auswertung weiterer Qualitätsparameter für test
#
print('test mean absolute error:     {}'.format(metrics.mean_absolute_error(y_test, prediction_test)))
print('test mean squared error:      {}'.format(metrics.mean_squared_error(y_test, prediction_test)))

test mean absolute error:     0.5035304415524375
test mean squared error:      0.3900251439639544


In [17]:
#
# Hilfsfunktion zum zählen
#
def countAccuracy(prediction,y):
    prediction_quality_test = np.round_(prediction)
    y_test_data = y.values

    correct, incorrect = 0,0
    for index in range(prediction_test.shape[0]):
        if prediction_quality_test[index] == y_test_data[index]:
            correct= correct + 1
        else:
            incorrect= incorrect + 1

    print('count accuracy: {}'.format((correct/(correct+incorrect))))

In [18]:
#
# Jetzt Aufruf der Funktion
# Accuracy für Test Daten
#
countAccuracy(prediction_test,y_test)

count accuracy: 0.571875


In [19]:
#
# Accuracy für Training Daten
#
countAccuracy(prediction_train,y_train)

count accuracy: 0.578125


In [36]:
#
# Test eines anderen Modelles für Regression (randomforest regression)
# 
random_regressor = RandomForestRegressor(n_estimators = 10, random_state = 42)
random_regressor.fit(x_train, y_train);

In [37]:
prediction_train = random_regressor.predict(x_train)
prediction_test = random_regressor.predict(x_test) 

print('test  root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, prediction_test))))
print('train root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_train, prediction_train))))

test  root mean squared error: 0.5662099875487892
train root mean squared error: 0.24482126364696796


In [38]:
countAccuracy(prediction_test,y_test)

count accuracy: 0.64375


In [23]:
#
# Wie könnten wir Qualität in diesem Kontext breiter definieren?
#

In [24]:
#
# Hilfsfunktion zum zählen
#
def countAccuracyRelaxed(prediction,y):
    prediction_quality_test = np.round_(prediction)
    y_test_data = y.values

    correct, incorrect = 0,0
    for index in range(prediction_test.shape[0]):
        if prediction_quality_test[index] == y_test_data[index]:
            correct= correct + 1
        elif prediction_quality_test[index] == y_test_data[index] + 1: 
            # wir betrachten die Qualität auch als richtig geschätzt, wenn es die nächsten oder vorherige Klasse war
            correct= correct + 1
        elif prediction_quality_test[index] == y_test_data[index] - 1:
            correct= correct + 1
        else:
            incorrect= incorrect + 1

    print('count accuracy: {}'.format((correct/(correct+incorrect))))


In [25]:
countAccuracyRelaxed(prediction_test,y_test)

count accuracy: 0.975


# Test eines Neuronalen Netzwerkes

In [47]:
nn_regressor = MLPRegressor(hidden_layer_sizes=(200,40,10), random_state=42, max_iter=20000, activation='relu')
nn_regressor.fit(x_train, y_train);

In [48]:
prediction_train = nn_regressor.predict(x_train)
prediction_test = nn_regressor.predict(x_test) 

print('test  root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_test, prediction_test))))
print('train root mean squared error: {}'.format(np.sqrt(metrics.mean_squared_error(y_train, prediction_train))))

test  root mean squared error: 0.6261059668753602
train root mean squared error: 0.6414603439966595


In [49]:
countAccuracy(prediction_test,y_test)

count accuracy: 0.553125


In [50]:
countAccuracyRelaxed(prediction_test,y_test)

count accuracy: 0.978125
