<img src="header.png" align="center"/>

# Anwendungsbeispiel Regression Wine Quality

Das Ziel des Beispieles ist es, die Qualität eines Weines aus physikalischen Messgrößen zu schätzen. Dazu verwenden wir verschiedene Arten der Regression.
Wir verwenden einen Datensatz von Weinen aus Portugal erstellt von Paulo Cortez [1]. Die Details der Erstellung der Daten sind unter folgendem Link zu finden [http://www3.dsi.uminho.pt/pcortez/wine5.pdf](http://www3.dsi.uminho.pt/pcortez/wine5.pdf). 

```
[1] P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
```



## Methode und Details der Daten


- Import der Module
- Laden der Daten



In [1]:
# Import der Module
import pandas as pd 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression 
from sklearn import metrics 
import matplotlib.pyplot as plt 
import numpy as np 
import seaborn as sns

In [11]:
df = pd.read_csv('data/winequality/winequality-red.csv', sep=';')
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [12]:
correlations = df.corr()['quality'].drop('quality')
print(correlations)

fixed acidity           0.124052
volatile acidity       -0.390558
citric acid             0.226373
residual sugar          0.013732
chlorides              -0.128907
free sulfur dioxide    -0.050656
total sulfur dioxide   -0.185100
density                -0.174919
pH                     -0.057731
sulphates               0.251397
alcohol                 0.476166
Name: quality, dtype: float64


In [13]:
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [14]:
y = df['quality']
x = df.drop(['quality'], axis=1)


In [15]:
x_train,x_test,y_train,y_test=train_test_split(x,y,random_state=3)

In [16]:
regressor = LinearRegression()
regressor.fit(x_train,y_train)

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)

In [17]:
print(regressor.coef_)

[ 2.32736478e-02 -9.91535569e-01 -1.41267594e-01  8.11925585e-03
 -1.59192407e+00  5.50005690e-03 -3.54198549e-03 -6.06916616e+00
 -4.06325022e-01  8.23603060e-01  2.94180891e-01]


In [18]:
train_pred = regressor.predict(x_train)
print(train_pred)
test_pred = regressor.predict(x_test) 
print(test_pred)

[5.33390209 5.33458216 5.94987004 ... 6.39109929 6.20184044 5.27719203]
[5.09908272 5.65580865 5.90927233 6.13810421 5.00495043 5.44066916
 5.05213654 6.15418124 5.52055599 5.77519663 5.61796132 5.23498287
 5.23127869 5.31466808 6.46439345 5.04000017 5.85280918 5.19300859
 6.0919118  6.34255254 6.41600994 5.52588684 5.80534686 4.93255733
 5.16159004 5.48207651 5.13834113 6.59480979 5.89478275 5.73709
 6.09133736 6.29529369 4.91616391 5.88376873 5.10515437 5.96400538
 6.80732578 5.03724291 5.25485064 5.88376873 5.17431803 4.84899008
 6.4903037  5.40465942 5.30375415 5.83513199 5.70825368 5.23988973
 5.24870634 5.46267267 5.08516492 5.61701512 6.01804854 6.32751521
 5.4628648  5.36127481 5.10151339 4.92009423 5.2240759  5.08722001
 4.79258875 5.43567381 5.25054561 5.6798788  5.85050157 6.52603804
 5.37941315 5.71598525 5.16966353 5.98159839 5.63912543 5.6004759
 5.74068429 5.22739422 5.98184324 5.51332746 5.40647057 5.68342011
 5.64578506 5.73709    6.23278066 5.29710528 4.66398697 6.042

In [23]:
# calculating rmse
train_rmse = metrics.mean_squared_error(train_pred, y_train) ** 0.5
print(train_rmse)
test_rmse = metrics.mean_squared_error(test_pred, y_test) ** 0.5
print(test_rmse)
# rounding off the predicted values for test set
predicted_data = np.round_(test_pred)
#print(predicted_data)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, test_pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, test_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, test_pred)))






0.6524682504422629
0.6269476348621658
Mean Absolute Error: 0.483685159820527
Mean Squared Error: 0.39306333685926353
Root Mean Squared Error: 0.6269476348621658
