In [None]:
%pip install --upgrade pip
%pip install pandas
%pip install seaborn
%pip install numpy

After installing basic packages we need for graphing, we will import the data from the csv file. We will use the pandas library to do this. We will also use the matplotlib library to graph the data.

In [None]:
import pandas as pd
import seaborn as sns
import numpy as np


# data for red wine and white wine. 
red_wine = pd.read_csv("./winequality-red.csv", sep=";")
white_wine = pd.read_csv("./winequality-white.csv", sep=";")

# remove any empty rows
red_wine = red_wine.dropna()
white_wine = white_wine.dropna()


In [None]:
red_wine.head()

In [None]:
white_wine.head()

Seems to have worked so now we can graph the data showing wine quality distribution. 

In [None]:
# plot white wine quality distribution 
sns.countplot(x="quality", data=red_wine)
print(red_wine["quality"].describe()) 

In [None]:
sns.countplot(x="quality", data=white_wine)
# describe white wine quality
print(white_wine["quality"].describe())
print(white_wine["quality"].value_counts())

Looks like the data is normally distributed. We can also see that the average quality is `5.8` for the white wine and `5.6` for the red wine.

Using the pandas library we can create a new column called `alcohol_category` and assign the values to it.

We are going use the `25th`, `50th` and `75th` percentile as low, medium and high respectively.

In [None]:
print(red_wine["alcohol"].describe())

# 3-valued alcohol category
red_wine["alcohol_category"] = pd.cut(red_wine["alcohol"], 
                                      bins=[
                                        red_wine["alcohol"].min()-1,
                                        red_wine["alcohol"].quantile(0.25),
                                        red_wine["alcohol"].quantile(0.75),
                                        red_wine["alcohol"].max()
                                            ], 
                                      labels=["low", "medium", "high"])

For some reason It would ignore the minimum value so i added - 1 to the minimum value. 

In [None]:
print(white_wine["alcohol"].describe())

# 3-valued alcohol category
# low: low < 0.25
# medium: mean - std <= medium < mean + std
# high: high >= mean + std

white_wine["alcohol_category"] = pd.cut(white_wine["alcohol"],
                                        bins=[
                                            white_wine["alcohol"].min()-1,
                                            white_wine["alcohol"].quantile(0.25),
                                            white_wine["alcohol"].quantile(0.75),
                                            white_wine["alcohol"].max()
                                        ], 
                                        labels=["low", "medium", "high"])


white_wine["alcohol_category"].value_counts()

C) Describe the distribution of the quality of the wine for each alcohol category.

In [None]:
# compare low, medium, high alcohol category
sns.countplot(x="alcohol_category", data=red_wine, hue="quality")

Low alcohol category has the highest number of wines with quality `5` and `6` with the lowest number of wines with quality `8`. 
The quality is normally distributed for the low alcohol category, with the average quality being `5.4`. 


Medium alcohol category has the highest number of wines with quality `5` and `6`, with the lowest number of wines with quality `3` and `8` with the average quality being `5.7`. 

The highest quality in red wine is a `6`, with `7` not far behind. with the lowest qaulity being a `4` compared to low and medium categories where the lowest quality is a `3`. 

It seems that their is a positive correlation between alcohol content and quality, despite high alcogol red wine being the second highest count. 

In [None]:
sns.countplot(x="alcohol_category", data=white_wine, hue="quality")

Low alcohol for white wine has the highest number of wines with quality `5` and `6` with the lowest number of wines with quality `9`, this was suprising since comparing it to red wine, the highest rating was an `8`. The average quality is `5.6`.

Medium alcohol for white wine has the highest number of wines with quality `6`,  with the lowest number of wines with quality `3`. The average quality is `6`. It may be a better earlier to state this but it seems that users rate white wine higher than red wine as people who tend to enjoy it less are in the minority.

High alcohol for white wine has the highest number of wines with quality `6` and `7`, with the lowest number of wines with quality `3`. The average quality is `6.4`.
From looking at the trend it seems that not only do people rate white wine higher than red wine, but they also rate it higher when the alcohol content is higher. 

D) Plot residual sugar variable and identify suitable threshold for classifying wines as sweet or dry.

In [None]:
# plot residual sugar distribution for both red wine
sns.histplot(x="residual sugar", data=red_wine, kde=True)

print(red_wine["residual sugar"].describe())

Mean is `2.5`. 

In [None]:
# plot residual sugar distribution for both red wine
sns.histplot(x="residual sugar", data=white_wine, kde=True)

print(white_wine["residual sugar"].describe())

In [None]:
red_wine["type"] = "r"
white_wine["type"] = "w"

wine = pd.concat([red_wine, white_wine], ignore_index=True)
wine.head()

sns.displot(x="residual sugar", data=wine, hue="type")
# mean 
wine["residual sugar"].mean()

# percentile formula 
def find_percentile(data, mean):
    # number of values less than or equal to mean / total number of values
    return (data[data <= mean].count() / data.count()) * 100 

print("wine residual sugar percentile:", find_percentile(wine["residual sugar"], wine["residual sugar"].mean()))

In [None]:
wine["isSweet"] = wine["residual sugar"] > wine["residual sugar"].quantile(0.62)
print(wine["isSweet"].value_counts())
wine.head()

When looking at the residual sugar one red wine, the mean is `2.5` and the median is `2.2`. 

When looking at the residual sugar one white wine, the mean is `6.4` and the median is `5.2`, although std is `5.1` which is quite high showing that the data is not normally distributed.

When looking at sugar in wine chart [here](https://winefolly.com/deep-dive/sugar-in-wine-chart/) it seems that the threshold is around `6cal`. If I was to put the threshold at `6` the records would not totally be evenly split, when looking at the threshold, it seems that the total wine residual sugar content value around `5.4` is the best threshold for classifying wines as sweet or dry and is still "close" to the threshold of the official source.

Seems to have worked so now we can graph the data showing wine quality distribution. It's not perfectly even but it's close enough, and it's a good threshold for classifying wines as sweet or dry.

In [None]:
# quality vs isSweet
sns.countplot(x="quality", hue="isSweet", data=wine)
# print wine quality count isSweat 
print(wine.groupby(["quality", "isSweet"]).size())

From this data we can see that when looking at lower quaity wines, a greater number of dry wines are present compared to sweet wines. However, when looking at higher quality wines the number of sweet wines is increase, despite their being more dry wines. 

But this does show that the higher the quality of the wine, the more likely it is to be sweet, as there is a greater percentage of sweet wines in the higher quality wines compared to the lower quality wines.

In [None]:
# alcohol_cat
sns.countplot(x="quality", hue="alcohol_category", data=wine)

In [None]:
# analyze correlation between quality and features
wine.corr(
    method="pearson"
)["quality"].sort_values(ascending=False)

When looking at the correlation between the quality and other variables, it seems that the quality is positively correlated with alcohol content, and negatively correlated with residual sugar content. But it seems that sulphates, citric acid, free sulfur dioxide and pH have some correlation with quality although it is not as strong as the other two; this could be due to the fact that the correlation is not linear.

In [None]:
# correlation matrix
sns.heatmap(wine.corr()[wine.corr() > 0.1], annot=True, vmin=0, vmax=1)

The correlation matrix shows a lot of useful correlations between the varaibles, ignoring correlation values lower than 0.1, as anthing lower is not considered a strong correlation.

I immediately noticed a high correlation between residual sugar and isSweet, but that makes sense since the isSweet variable is based on the residual sugar variable. Furthermore, the correlation between density and residual sugar is also high, density also shows a positve correlation with fixed acidity.

Second highest was between total sulfur dioxide and free sulfur dioxide, which makes sense since they are both related. 

In [None]:
sns.heatmap(wine.corr()[wine.corr() < 0], annot=True, vmin=-1, vmax=0)

When looking at negative correlations, the highest was between alcohol and density, which makes sense since alcohol is a liquid and density is a measure of mass per unit volume.The second highest was between alcohol and residual sugar.

I find these negative correlations interesting because they show that the higher the alcohol content, the lower the density and the lower the residual sugar content.

Showing we can predict the quality of the wine based on the other variables.

In [None]:
# classify wine quality into 3 categories

# low: <6
# high: >=6
print(wine["quality"].describe())

wine["quality_category"] = wine["quality"].apply(lambda x: "low" if x < 6 else "high")

wine["quality_category"].value_counts()
wine.head()

Split quality into two categories: low and high. Anything higher or equal to 6 is high quality, anything lower is low quality,
When chaning the threshold it greatly affects the accuracy of the model. e.g when changing the threshold to 5 the accuracy woud go up to 97% but have low auc and macro average, but suprisingly, when changing the threshold to a 7 the accuracy not only increases but I get marco averages too although i get low auc and roc curves.

In [None]:
%pip install scikit-learn

In [None]:
# split data into train and test

from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt

train, test = train_test_split(wine, test_size=0.25, random_state=42)

print("train:", train.shape)
print("test:", test.shape)

In [None]:
features = [
    "fixed acidity",
    "volatile acidity",
    "citric acid",
    "residual sugar",
    "chlorides",
    "free sulfur dioxide",
    "total sulfur dioxide",
    "density",
    "pH",
    "sulphates",
    "alcohol",
    "alcohol_category",
    "isSweet"
]

train["alcohol_category"] = train["alcohol_category"].map({"low": 0, "medium": 1, "high": 2})
train["isSweet"] = train["isSweet"].map({False: 0, True: 1})
train["quality_category"] = train["quality_category"].map({"low": 0, "high": 1})

test["alcohol_category"] = test["alcohol_category"].map({"low": 0, "medium": 1, "high": 2})
test["isSweet"] = test["isSweet"].map({False: 0, True: 1})
test["quality_category"] = test["quality_category"].map({"low": 0, "high": 1})


Now that I have split the data into training and testing data, I can use the training data to train the model and the testing data to test the model. 

One problem I had was attemping to use the `train_test_split` function from the `sklearn` library, but I kept getting an error saying that the data was not in the correct format. I tried to fix it by converting them into integers so instead of "low" and "high", it would be `0` and `1`.  

In [None]:
# k-fold cross validation

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import classification_report

def k_fold_cross_validation(model, X, y, k=10):
    kfold = KFold(n_splits=k, shuffle=False)
    scores = cross_val_score(model, X, y, cv=kfold)
    print("Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() * 2))

In [None]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100, random_state=0)
rf.fit(train[features], train["quality_category"])
k_fold_cross_validation(rf, train[features], train["quality_category"])

rf_predictions = rf.predict(test[features])

print("Random Forest")
print(classification_report(test["quality_category"], rf_predictions, zero_division=0))

Using the random forest classifier and evaluating the model using k-fold cross validation, I got an accuracy of `0.83` which is pretty good. Now I can run metrics on the model to see how well it performed.

In [None]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression(random_state=42, max_iter=5000)
k_fold_cross_validation(lr, train[features], train["quality_category"])
lr.fit(train[features], train["quality_category"])

lr_predictions = lr.predict(test[features])
print("Logistic Regression")
print(classification_report(test["quality_category"], lr_predictions, zero_division=0))

Using logistic regression we were able to get an accuracy of about `0.75`, which is not as good as the random forest classifier.

In [None]:
from sklearn.tree import DecisionTreeClassifier

dt = DecisionTreeClassifier(random_state=42)
k_fold_cross_validation(dt, train[features], train["quality_category"])
dt.fit(train[features], train["quality_category"])

dt_predictions = dt.predict(test[features])
print("Decision Tree")
print(classification_report(test["quality_category"], dt_predictions, zero_division=0))

Decision tree classifier gave us an accuracy of `0.77` which is not as good as the random forest classifier, but better than logistic regression. 

In [None]:
from sklearn.svm import SVC

svm = SVC()
k_fold_cross_validation(svm, train[features], train["quality_category"])
svm.fit(train[features], train["quality_category"])

svm_predictions = svm.predict(test[features])
print("SVM")
print(classification_report(test["quality_category"], svm_predictions, zero_division=0))


Quite possiblly the worst model, but it's still better than random guessing. 

In [None]:
linear_svm = SVC(C=1, kernel="linear")
k_fold_cross_validation(linear_svm, train[features], train["quality_category"])
linear_svm.fit(train[features], train["quality_category"])

linear_svm_predictions = linear_svm.predict(test[features])
print("Linear SVM")
print(classification_report(test["quality_category"], linear_svm_predictions, zero_division=0))

I did not expect a massive difference between the models. 

In [None]:
kernel = 'poly'
degree = 3
C = 5
coef0 = 1

poly_svm = SVC(C=C, kernel=kernel, degree=degree, coef0=coef0)
k_fold_cross_validation(poly_svm, train[features], train["quality_category"])
poly_svm.fit(train[features], train["quality_category"])

poly_svm_predictions = poly_svm.predict(test[features])
print("Polynomial SVM")
print(classification_report(test["quality_category"], poly_svm_predictions, zero_division=0))

Using poly kernel, we were able to get an accuracy of `0.71` which is less than the previous model but I am not sure how to manipulate `degree`, `C`, `coetf0` to get a better accuracy.

In [None]:
kernel = "rbf"

rbf_svm = SVC(kernel=kernel, degree=degree, C=C, coef0=coef0)
k_fold_cross_validation(rbf_svm, train[features], train["quality_category"])
rbf_svm.fit(train[features], train["quality_category"])

rbf_svm_predictions = rbf_svm.predict(test[features])
print("RBF SVM")
print(classification_report(test["quality_category"], rbf_svm_predictions, zero_division=0))

In [None]:
from sklearn.metrics import roc_curve, auc

def plot_roc_curve(y_test, y_pred, model_name):
    fpr, tpr, thresholds = roc_curve(y_test, y_pred, pos_label=1)
    roc_auc = auc(fpr, tpr)
    print("AUC for " + model_name + ": " + str(roc_auc))
    plt.title('ROC - ' + model_name)
    plt.plot(fpr, tpr, 'b', label='AUC = %0.2f' % roc_auc)
    plt.legend(loc='lower right')
    plt.plot([0, 1], [0, 1], 'r--')
    plt.ylabel('True Positive Rate')
    plt.xlabel('False Positive Rate')
    plt.show()

# best model 
plot_roc_curve(test["quality_category"], rf_predictions, "Random Forest")
plot_roc_curve(test["quality_category"], lr_predictions, "Logistic Regression")
plot_roc_curve(test["quality_category"], dt_predictions, "Decision Tree")
# best model for svm
plot_roc_curve(test["quality_category"], linear_svm_predictions, "SVM")

Largest Area Under the Curve (AUC) was for the random forest classifier, which is not suprising since it had the highest accuracy and the opposite for the SVM model.

In [None]:
# mean squared error

from sklearn.metrics import mean_squared_error

# logistic regression
print("Logistic Regression")
print("MSE:", mean_squared_error(test["quality_category"], lr_predictions))

# SVM
print("Support Vector Machine")
print("MSE:", mean_squared_error(test["quality_category"], svm_predictions))
print("MSE:", mean_squared_error(test["quality_category"], linear_svm_predictions))
print("Polynomial SVM")
print("MSE:", mean_squared_error(test["quality_category"], poly_svm_predictions))
print("RBF SVM")
print("MSE:", mean_squared_error(test["quality_category"], rbf_svm_predictions))



In [None]:
# instead of low and high, we can use the actual quality score
from sklearn.model_selection import GridSearchCV

rf = RandomForestClassifier(n_estimators=100, random_state=42)
grid_search = GridSearchCV(estimator=rf, param_grid=param_grid, cv=3, n_jobs=-1, verbose=2)
grid_search.fit(train[features], train["quality"])

Hyperparameter tuning is the process of finding the best combination of hyperparameters for a model.

This took about 2 hours to run, and I was not able to get the best parameters for the model.

In [None]:
grid_search.best_params_

In [None]:
rf = RandomForestClassifier(bootstrap=True, max_depth=30, max_features='sqrt', min_samples_leaf=1, min_samples_split=2, n_estimators=600)

k_fold_cross_validation(rf, train[features], train["quality"])
rf.fit(train[features], train["quality"])

rf_predictions = rf.predict(test[features])

print("Random Forest")
print(classification_report(test["quality"], rf_predictions, zero_division=0))