# **Introduction**

**This dataset contains the amounts of chemicals found in red varieties of Portuguese Vinho Verde wine. The aim of this project is to predict the quality of wine using these chemicals. It is also to find out the effects of chemicals on Vinho Verde wine.**

# **Data**

* **Id**
* **Fixed acidity**
* **Volatile acidity** 
* **Citric acid**
* **Residual sugar**
* **Chlorides**
* **Free sulfur dioxide** 
* **Total sulfur dioxide** 
* **Density**
* **pH**
* **Sulphates**
* **Alcohol**
* **Quality**

# **Libraries**

In [1]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
sns.set()

import warnings
warnings.filterwarnings("ignore")

import plotly.express as px
import plotly.graph_objects as go

import plotly.figure_factory as ff
from plotly.subplots import make_subplots

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

from collections import Counter
from imblearn.over_sampling import RandomOverSampler

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay, classification_report
from sklearn.feature_selection import RFE

# **Explore Data**

In [2]:
df_raw = pd.read_csv("../input/wine-quality-dataset/WineQT.csv")
df_raw.head()

In [3]:
df_raw.info()

In [4]:
df_raw.describe().transpose()

# **Visual EDA**

In [5]:
corr = df_raw.drop("Id", axis=1).corr()
fig = ff.create_annotated_heatmap(
    z=corr.to_numpy().round(3),
    x=list(corr.index.values),
    y=list(corr.columns.values),       
    zmin=-1, zmax=1,
    colorscale='RdBu',
    showscale = True,
)
fig.update_layout(title_text='<b>Correlation matrix<b>',
                  title_x=0.5,
                  titlefont={'size': 24},
                  xaxis={'side': 'bottom'},
                  yaxis_autorange='reversed',                   
                  )
fig.show()

**Respectively, alcohol, volatile acidity, sulphates, and citric acid features are more correlated with wine quality than other features.**

In [6]:
def add_histogram(feature, row, col, axis_no):
    fig.add_trace(go.Histogram(x=df_raw[feature], histnorm='probability', name=feature), row=row, col=col)
    fig.update_layout({'xaxis{}'.format(axis_no): {'title':{'text':feature}}})
                      
                       
def add_box(feature, start, end, row, col, axis_no):
    fig.add_trace(go.Box(x=df_raw[feature], boxmean=True), row=row, col=col)
    fig.update_layout({'xaxis{}'.format(axis_no): {'title':{'text':feature}}})
    

fig = make_subplots(rows=11, cols=2)
columns = df_raw.drop(["quality", "Id"], axis=1).columns.to_list()
i, j = 1, 0
for row in range(1,12):
    add_histogram(feature = columns[j], row = row, col = 1, axis_no = i)
    i += 1
    add_box(feature = columns[j], start = 3, end = 9, row = row, col = 2, axis_no = i)
    i += 1
    j += 1

fig.update_layout(title_text='<b>Distributions<b>',
                  title_x=0.5, titlefont={'size': 24},
                  height=2400, showlegend=False)
fig.update_yaxes(showticklabels=False) 
fig.show()

In [7]:
columns = df_raw.drop(["Id", "quality"], axis=1).columns.to_list()
mean = df_raw[columns].mean().round(4)
median = df_raw[columns].median().round(4)
std = df_raw[columns].std().round(4)
skewness = df_raw[columns].skew().round(4)
kurtosis = df_raw[columns].kurtosis().round(4)

fig = go.Figure(data=[go.Table(
    header=dict(values=['Features', 'Median', 'Mean', 'Std', 'Skewness', 'Kurtosis'], height = 25),
    cells=dict(values=[columns, median, mean, std, skewness, kurtosis], height = 25 ))])

fig.show()

* **Fixed Acidity** has a highly right-skewed distribution with positive excess kurtosis.
* **Volatile Acidity** has a moderately right-skewed distribution with positive excess kurtosis.
* **Citric Acid** has a approximately symmetric distribution with negative excess kurtosis.
* **Residual Sugar** has a highly right-skewed distribution with positive excess kurtosis.
* **Chlorides** has a highly right-skewed distribution with positive excess kurtosis.
* **Free Sulfur Dioxide** has a highly right-skewed distribution with positive excess kurtosis.
* **Total Sulfur Dioxide** has a highly right-skewed distribution with positive excess kurtosis.
* **Density** has a approximately symmetric distribution with positive excess kurtosis.
* **pH** has a approximately symmetric distribution with positive excess kurtosis.
* **Sulphates** has a highly right-skewed distribution with positive excess kurtosis.
* **Alcohol** has a moderately right-skewed distribution with positive excess kurtosis.

## **Quality**

In [8]:
temp = pd.DataFrame(df_raw["quality"].value_counts())
temp.reset_index(inplace=True)
text = np.round(df_raw["quality"].value_counts()/len(df_raw["quality"]) * 100, 1).apply(lambda x: '{}%'.format(x))
temp["Percentage"] = text.values
temp.rename(columns = {'index':'Quality', 'quality':'Count'}, inplace = True)

fig = px.bar(temp, x = 'Quality', y = 'Count', text = 'Percentage')
fig.update_layout(xaxis_title = 'Quality',
                  yaxis_title = 'Count',
                  title_text='<b>Quality distribution</b>',
                  titlefont_size= 24,
                  font_size= 13,
                  title_x=0.5, title_y=0.95)
fig.show()

In [9]:
fig, axes = plt.subplots(11, 2, figsize=(20, 50))

for i in range(11):
    sns.boxplot(ax=axes[i,0], x="quality", y=df_raw.columns[i], data=df_raw)
    sns.lineplot(ax=axes[i,1], x="quality", y=df_raw.columns[i], data=df_raw)
    
fig.suptitle('Relations between chemicals and qualities', 
             y = 0.90, fontsize = 24, fontweight = 'bold')

plt.show()

**In general, it is seen that as the wine quality increases,**

* **fixed acidity** increases.
* **volatile acidity** decreases.
* **citric acid** increases.
* **chlorides** decreases.
* **density** decreases.
* **sulphates** increases.
* **alcohol** increases.

# **Outliers**

In [10]:
def iqr_outliers(data, feature):
    Q1 = df_raw[feature].quantile(0.25)
    Q3 = df_raw[feature].quantile(0.75)

    IQR = Q3 - Q1 
    low_lim = Q1 - 1.5 * IQR
    up_lim = Q3 + 1.5 * IQR
    
    low_out = len(data[data[feature] < low_lim])
    up_out = len(data[data[feature] > up_lim])
    
    total_out = low_out + up_out
    return total_out
    

temp = {}
columns = df_raw.drop(["quality", "Id"], axis=1).columns.to_list()
for quality in range(3,9):
    outliers = []
    data = df_raw[df_raw["quality"] == quality].copy()
    for j in range(11):
        outliers.append(iqr_outliers(data, columns[j]))
    outliers.append(sum(outliers))
    temp[quality] = outliers

columns.append("total")
df_outlier = pd.DataFrame.from_dict(temp, orient ='index', columns=columns) 
df_outlier.loc['total'] = df_outlier.sum()

In [11]:
header = ["quality"]
header.extend(columns)
cells = [df_outlier.index.tolist(), df_outlier["fixed acidity"], df_outlier["volatile acidity"],
         df_outlier["citric acid"], df_outlier["residual sugar"], df_outlier["chlorides"],
         df_outlier["free sulfur dioxide"], df_outlier["total sulfur dioxide"], df_outlier["density"],
         df_outlier["pH"], df_outlier["sulphates"], df_outlier["alcohol"], df_outlier["total"]]

fig = go.Figure(data=[go.Table(
    header=dict(values=header, height = 30),
    cells=dict(values=cells, height = 30))])

fig.update_layout(title_text='<b>Outliers<b>',
                  title_x=0.5,
                  font_size=14,
                  xaxis={'side': 'bottom', 'title' : 'Predicted class'},
                  yaxis={'title' : 'True class', 'autorange' : 'reversed'},                   
                  )
fig.show()

**The numbers of outliers of each quality in each chemical are shown in the table. These outliers make up a large part of the dataset. Therefore, the outliers are left as they are, as replacing or dropping them will result in the loss of a significant amount of valuable data.**

# **Train test split** 

In [12]:
X = df_raw.drop(["quality", "Id"], axis=1).copy()
y = df_raw["quality"].copy()

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

print('X_train shape : {} - y_train shape : {}'.format(X_train.shape, y_train.shape))
print('X_test shape  : {} - y_test shape  : {}'.format(X_test.shape, y_test.shape))

# **Handling imbalanced dataset**

In [13]:
counter = Counter(y_train)
print('Before :', counter)
ros = RandomOverSampler(random_state=42)

X_train_rs, y_train_rs = ros.fit_resample(X_train, y_train)
counter = Counter(y_train_rs)
print('After :', counter)

**The random oversampling method over-samples the minority classes by picking samples at random with replacement. Using this method, it is aimed to adjust the class distribution of the dataset.**

# **Feature Selection** 

In [14]:
rf = RandomForestClassifier(random_state=42)
rfe_rf = RFE(estimator = rf, n_features_to_select = 8, verbose = 1)
rfe_rf.fit(X_train_rs, y_train_rs)

**The goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. Using this method, it is aimed to eliminate redundant and irrelevant features in the data set and improve the predictive power of the machine learning algorithm.**

In [15]:
data = dict(zip(X_train_rs.columns, rfe_rf.ranking_))
df = pd.DataFrame.from_dict(data, orient ='index')

df.rename(columns = {0:'Iteration'}, inplace = True)
df.sort_values('Iteration')

**The selected features are assigned the rank 1, and the eliminated features are assigned the iteration order in which they are eliminated.**

# **Random Forest Classifier** 

In [16]:
mask = X_train_rs.columns[rfe_rf.support_].tolist()

rf = RandomForestClassifier(random_state=42)
rf.fit(X_train_rs[mask], y_train_rs)

y_pred = rf.predict(X_test[mask])
cm = confusion_matrix(y_test, y_pred)
print(classification_report(y_test, y_pred))

**Due to the imbalance in the dataset, precision and recall metrics as well as confusion matrix results are used to evaluate the model.**

In [17]:
fig = ff.create_annotated_heatmap(
    z=cm,
    x=rf.classes_.tolist(),
    y=rf.classes_.tolist(),       
    zmin=-1, zmax=1,
    colorscale='RdBu',
    showscale = True,
)
fig.update_layout(title_text='<b>Confusion matrix<b>',
                  title_x=0.5,
                  font_size=14,
                  xaxis={'side': 'bottom', 'title' : 'Predicted class'},
                  yaxis={'title' : 'True class', 'autorange' : 'reversed'},                   
                  )
fig.show()

* **The model cannot classify 3rd quality wine correctly.**
* **The model can correctly classify less than 35 percent of 4th and 8th quality wines.**
* **The model can correctly classify almost half of 7th quality wines.**
* **The model can correctly classify more than 70 percent of 5th and 6th quality wines.**

In [18]:
data ={'Feature':X_train_rs[mask].columns,
       'Importance': rf.feature_importances_}
  
importances_rf = pd.DataFrame.from_dict(data) 
importances_rf.sort_values("Importance", ascending=False, inplace=True)

fig = px.bar(importances_rf, x='Feature', y='Importance', text_auto='.4f')
fig.update_layout(title_text='<b>Random Forest Feature Importance</b>', 
                  title_x=0.5, title_y=0.95, font_size=13,)
fig.show()