# Wine quality

What is the quality of a wine, based on 11 features? Let's try to make a model using ensemble methods.

![Red wine](wine.jpg "A delicious glass of quality-8 red wine.")

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix
from sklearn.neural_network import MLPClassifier

In [None]:
# Read data and plot the first few rows.
df = pd.read_csv('winequality-red.csv', delimiter=';')
df.head()

## Data exploration

Let's first see if we can find any correlations by eye. Play with the pairs of features in the feature plots, to see if you can find any good ones!

In [None]:
# Make some feature plots.
def plot_features(feat_x, feat_y):
    plt.figure()
    plt.scatter(df[feat_x], df[feat_y], alpha=0.5, c=df['quality'], cmap='viridis')
    cbar = plt.colorbar()
    cbar.set_label('quality')
    plt.xlabel(feat_x)
    plt.ylabel(feat_y)
    plt.show()
    
plot_features('fixed acidity', 'sulphates')
plot_features('chlorides', 'total sulfur dioxide')

## Class imbalance

There is a class imbalance in this dataset. Let's visualize the classes by making a histogram.

In [None]:
# Make train/test split. Also, check histogram of y.
X = df[['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 
        'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol']].astype(float)
y = df['quality'].astype(int)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)
plt.hist(y, np.linspace(2.5, 8.5, 7))
plt.xlabel('quality')
plt.ylabel('#')
plt.show()

## Trying different models

Now let's try two ensemble methods to predict wine quality from the features we measured.

Questions:
1. What do the parameters n_estimators, max_depth and random_state mean? What are good values for these parameters? Did you find other useful parameters?
2. What are the differences in the results between the random forest and gradient boosting methods?

In [None]:
# Change the RandomForestClassfier parameters to make a better model!
# See: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
clf = RandomForestClassifier(n_estimators=3, max_depth=2, random_state=0)
clf.fit(X_train, y_train)
print("Accuracy on train set: ", clf.score(X_train, y_train))
print("Accuracy on test set: ", clf.score(X_test, y_test))

# Print histogram
plt.hist(y_test, np.linspace(2.5, 8.5, 7), alpha=0.5, label="true labels")
plt.hist(clf.predict(X_test), np.linspace(2.5, 8.5, 7), alpha=0.5, label="predicted labels")
plt.legend()
plt.show()

# Print confusion matrix on test set
# Horizontal: actual class; vertical: predicted class
classes = [3, 4, 5, 6, 7, 8]
conf_matrix = confusion_matrix(y_test, clf.predict(X_test), labels=classes)
pd.DataFrame(data=conf_matrix.T, columns=classes, index=classes)

In [None]:
# Change the GradientBoostingClassifier parameters to make a better model!
# See: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.GradientBoostingClassifier.html
clf = GradientBoostingClassifier(n_estimators=2, learning_rate=1, max_depth=2, random_state=0)
clf.fit(X_train, y_train)
print("Accuracy on train set: ", clf.score(X_train, y_train))
print("Accuracy on test set: ", clf.score(X_test, y_test))

# Print histogram
plt.hist(y_test, np.linspace(2.5, 8.5, 7), alpha=0.5, label="true labels")
plt.hist(clf.predict(X_test), np.linspace(2.5, 8.5, 7), alpha=0.5, label="predicted labels")
plt.legend()
plt.show()

# Print confusion matrix on test set
# Horizontal: actual class; vertical: predicted class
classes = [3, 4, 5, 6, 7, 8]
conf_matrix = confusion_matrix(y_test, clf.predict(X_test), labels=classes)
pd.DataFrame(data=conf_matrix.T, columns=classes, index=classes)