# Environment setup

## Google Drive mount
I'm using Google Colaboratory as my default platform, therefore I need to set up my environment to integrate it with Google Drive. You can skip this bit if you're working locally.

1. Mount Google Drive on the runtime to be able to read and write files. This will ask you to log in to your Google Account and provide an authorization code.
2. Create a symbolic link to a working directory 
3. Change the directory to the one where I cloned my repository.


In [1]:
# mount Google Drive on the runtime
from google.colab import drive
drive.mount('/content/gdrive', force_remount=True)

Mounted at /content/gdrive


In [2]:
# create a symbolic link to a working directory
!ln -s /content/gdrive/My\ Drive/Colab\ Notebooks/datacourage_wine /mydrive

# navigate to the working directory
%cd /mydrive

ln: failed to create symbolic link '/mydrive/datacourage_wine': File exists
/content/gdrive/My Drive/Colab Notebooks/datacourage_wine


## Libraries & functions
Let's now import the necessary libraries and function we're gonna use in this notebook.

- `tqdm.notebook` - loop progress bar for notebooks
- `timeit` - cell runtime check
- `numpy` - linear algebra
- `pandas` - data manipulation & analysis


In [3]:
import tqdm.notebook as tq
import timeit
import numpy as np
import pandas as pd
from scipy import stats
import re
import plotly
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.preprocessing import OneHotEncoder

from sklearn.decomposition import KernelPCA, PCA

from sklearn.ensemble import AdaBoostClassifier, BaggingClassifier, GradientBoostingClassifier, RandomForestClassifier, VotingClassifier, StackingClassifier, ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.linear_model import LogisticRegression, SGDClassifier, RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier

from sklearn.svm import SVC

# Load data
Load data using `pd.read_csv` function. The file is not really comma-separated, so we need to change the separator object to semicolon using `sep` parameter and check the shape of the dataset.

## Red wine

In [4]:
df_red = pd.read_csv('winequality-red.csv', sep=';')
df_red.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5


In [5]:
df_red.shape

(1599, 12)

## White wine

In [6]:
df_white = pd.read_csv('winequality-white.csv', sep=';')
df_white.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


In [7]:
df_white.shape

(4898, 12)

## Merge datasets
The number of features in each dataset is equal, so we're free to combine the red and white wine datasets to have more observations and hopefully, a better generalising estimator later on. Let's merge them using `pd.concat()` function.

Before doing so, let's add a `color` feature representing red or white color.

In [8]:
df_red['color'] = 'red'
df_white['color'] = 'white'

df = pd.concat([df_red, df_white], ignore_index=True)
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


## Rename columns
For a better readibility let's rename the columns and replace spaces with an underscore.

In [9]:
old_column_names = df.columns.values.tolist()
new_column_names = [re.sub('\s', '_', col_name) for col_name in old_column_names]

df.columns = new_column_names
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


## Convert categorical values
Let's convert the categorical values in our datasets to dummy encoded columns, for sklearn compatibility. This could have as well been done at the very beginning ;)

In [10]:
df = pd.get_dummies(df, columns=['color'])
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color_red,color_white
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1,0
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1,0
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1,0
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1,0
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1,0


# Data exploration

## Values info
Let's print an overview of the type information in the DataFrame and find potential issues.

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed_acidity         6497 non-null   float64
 1   volatile_acidity      6497 non-null   float64
 2   citric_acid           6497 non-null   float64
 3   residual_sugar        6497 non-null   float64
 4   chlorides             6497 non-null   float64
 5   free_sulfur_dioxide   6497 non-null   float64
 6   total_sulfur_dioxide  6497 non-null   float64
 7   density               6497 non-null   float64
 8   pH                    6497 non-null   float64
 9   sulphates             6497 non-null   float64
 10  alcohol               6497 non-null   float64
 11  quality               6497 non-null   int64  
 12  color_red             6497 non-null   uint8  
 13  color_white           6497 non-null   uint8  
dtypes: float64(11), int64(1), uint8(2)
memory usage: 621.9 KB


The data seems to be really clean. There are no missing values in the dataset and the dtypes are very consistent, so we're good to go.

## Statistical info
Let's generate a DataFrame of statistical measures for each column to find potential issues. 

In [12]:
df.describe()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color_red,color_white
count,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0,6497.0
mean,7.215307,0.339666,0.318633,5.443235,0.056034,30.525319,115.744574,0.994697,3.218501,0.531268,10.491801,5.818378,0.246114,0.753886
std,1.296434,0.164636,0.145318,4.757804,0.035034,17.7494,56.521855,0.002999,0.160787,0.148806,1.192712,0.873255,0.430779,0.430779
min,3.8,0.08,0.0,0.6,0.009,1.0,6.0,0.98711,2.72,0.22,8.0,3.0,0.0,0.0
25%,6.4,0.23,0.25,1.8,0.038,17.0,77.0,0.99234,3.11,0.43,9.5,5.0,0.0,1.0
50%,7.0,0.29,0.31,3.0,0.047,29.0,118.0,0.99489,3.21,0.51,10.3,6.0,0.0,1.0
75%,7.7,0.4,0.39,8.1,0.065,41.0,156.0,0.99699,3.32,0.6,11.3,6.0,0.0,1.0
max,15.9,1.58,1.66,65.8,0.611,289.0,440.0,1.03898,4.01,2.0,14.9,9.0,1.0,1.0


At the first glance it looks like there might be some possible outliers in various columns as some `min` and `max` values are quite distant from `mean` considering standard deviation values `std`. Let's investigate fruther.

## Box plot
I'll use box plot to visual potential outliers in the dataset. To do so let's create:

1. A list of features to investigate (excluding wine colors and quality labels)
2. An initial data box plot using Plotly's `go.Box()` function.
3. A dropdown menu to update the graph to show other feature's box plot.
4. A layout object to specify figure's properties.
5. A figure object using `go.Figure()` to put everything together.

In [13]:
feature_list = [
                'fixed_acidity',
                'volatile_acidity',
                'citric_acid',
                'residual_sugar',
                'chlorides',
                'free_sulfur_dioxide',
                'total_sulfur_dioxide',
                'density',
                'pH',
                'sulphates',
                'alcohol' 
]



initial_data = [
        go.Box(
            y = df['fixed_acidity'],
            boxpoints='suspectedoutliers', # display outliers
            name = 'fixed_acidity',
            marker = dict(color='orange')
            ) 
]

updatemenus = [
               dict(
                    buttons=list([
                                  dict(
                                  label=feat, # dropdown menu item name
                                  method='update', #  modify data and layout attributes
                                  args=[
                                        {'y': [df[feat]]}, # update y-axis data
                                        {'name': feat} # update y-axis title
                                       ]
                                      ) for feat in feature_list # for each feature
                                ]),
                    direction='down',
                    showactive=True,
                    x=0.5, # x positioning
                    xanchor='center',
                    y=1.07, # y positioning
                    yanchor='top'
                   )
              ]

layout = go.Layout(
                   title=go.layout.Title(text='Box plot for:', x=0.5),
                   width=800,
                   height=800,
                   updatemenus=updatemenus               
                  )

fig = go.Figure(
    data=initial_data,
    layout=layout
)

fig.show()

As we can see there are some outliers in the dataset which can negatively influence the performane of the model. I'll address the issue later on.

## Correlation heatmap
Let's investigate the correlation between the features. The further away the value is from 0, the more positively or negatively correlated the features are. I'll use Plotly's `go.Heatmap()` function to visualize it.

In [14]:
figure = go.Heatmap(
                    z=df.corr(), # compute correlation matrix
                    x=df.columns,
                    y=df.columns,
                    colorscale='Viridis'
                   )                      

layout = go.Layout(
                   title=go.layout.Title(text="Feature correlation heatmap", x=0.5),
                   width=800,
                   height=800
                  )

fig = go.Figure(
                data=figure,
                layout=layout
               )
fig.show()

## Label correlation
From the heatmap we can see that the correlation coefficient between the quality of the wine and alcohol content is the most positively correlated with each other, whereas density most negatively correlated. Let's see how other features influence the quality using visualizing a simple bar plot using `go.Bar()`.

In [15]:
correlations = df.corr()['quality'].sort_values(ascending=False)

data = [
        go.Bar(
            x=correlations.index,
            y=correlations.values,
            marker_color='orange',
    )            
        ]

layout = go.Layout(
                   title=go.layout.Title(text='Quality correlation', x=0.5),
                   xaxis=go.layout.XAxis(title='Feature name'),
                   yaxis=go.layout.YAxis(title='Correlation value'),
                   bargap=0.2,
                   width=800,
                   height=800
                  )

fig = go.Figure(
    data=data,
    layout=layout
    )

fig.show()

As we can see alcohol has the largest positive correlation with quality, whereas density and volatile acidity influence the quality negatively the most.

## Class balance
Another thing to check is the representation of each of the labels. Let's see if there're issues with the balance of the number of observations for each class which can cause issues with minority class detection of our classifier.

In [16]:
data = [
        go.Histogram(
            x=df['quality'],
            xbins=dict(
                size=0.5
            ),
            marker_color='orange',
    )            
        ]

layout = go.Layout(
                   title=go.layout.Title(text="Class distribution", x=0.5),
                   xaxis=go.layout.XAxis(title='Class name'),
                   yaxis=go.layout.YAxis(title='Number of observations'),
                   bargap=0.2,
                   width=800,
                   height=800
                  )

fig = go.Figure(
    data=data,
    layout=layout
    )

fig.show()

There's a vast overrepresentation of 5, 6 and 7-rated wines (a total of a couple of thousand observations) whereas wines rated the lowest and the highest make only around 400 cases. There's a clear class imbalance, which will have to be addressed later on.

# Data preprocessing
Let's preprocess our dataset to get it ready for model building.

## Outliers
Outliers, meaning the observations which are significatly different from the rest of the data, can badly influence the performance of our classifier. 

Do get rid of them I'll use Z-score which calculates the distance of the observation from the mean using a multiplication of standard devation. Most of the outliers are positioned more than three standard deviations away from the mean. The formula is:


$ Z_{score} = \frac{value - mean}{std}$


We'll use `zscore()` function from `scipy` library to calculate the absolute Z-score and then filter out the values which are larger than 3 from the dataset.

To make sure we're not losing too much data, I'll compare the shapes of the original and filtered DataFrames.



### Find outliers

In [17]:
df_filter = df.copy()

for f in df.columns:  
  
  if f in ['color_white', 'color_red', 'quality']:
    df_filter[f] = True
  else:
    z_score = np.abs(stats.zscore(df[f]))
    outliers = np.where(z_score>3, False, True)
    df_filter[f] = pd.Series(outliers, index=df[f].index)

df_filter.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color_red,color_white
0,True,True,True,True,True,True,True,True,True,True,True,True,True,True
1,True,False,True,True,True,True,True,True,True,True,True,True,True,True
2,True,True,True,True,True,True,True,True,True,True,True,True,True,True
3,False,True,True,True,True,True,True,True,True,True,True,True,True,True
4,True,True,True,True,True,True,True,True,True,True,True,True,True,True


### Drop outlier rows

In [18]:
df = df[df_filter]
df.dropna(inplace=True)
df.reset_index(drop=True, inplace=True)
df.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,pH,sulphates,alcohol,quality,color_red,color_white
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1,0
1,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1,0
2,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1,0
3,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5,1,0
4,7.9,0.6,0.06,1.6,0.069,15.0,59.0,0.9964,3.3,0.46,9.4,5,1,0


In [19]:
df.shape

(6009, 14)

In [20]:
|sdfgdsfg

SyntaxError: ignored

In [None]:
# shape of the original dataset
org_shape = df.shape

# calculate z-score
z_score = np.abs(stats.zscore(df))
df = df[(z_score<3).all(axis=1)]

# shape of the filtered dataset
new_shape = df.shape

print(f'Shape of the original dataset: {org_shape}\nShape of the filtered dataset: {new_shape}')

We lost around 500 observations which makes less than 10% of the whole dataset. This amount is not significant, so we can carry on.

In [None]:
df['quality'].unique()

In [None]:
fghdfhdf

# Dataset split
Let's split the dataset into train and test sets to avoid any data leakage.

In [None]:
X = df.drop('quality', axis=1)
y = df['quality']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

In [None]:
def ClassificationReport(clf):
  clf.fit(X_train, y_train)
  preds = clf.predict(X_test)
  return classification_report(y_test, preds)

In [None]:
classifiers = {
    'k-nearest neighbors': KNeighborsClassifier(),
    'svc': SVC(),
    'gaussian process': GaussianProcessClassifier(),
    'decision tree': DecisionTreeClassifier(),
    'random forest': RandomForestClassifier(),
    'mlp': MLPClassifier(),
    'ada boost': AdaBoostClassifier(),
    'gaussian nb': GaussianNB(),
    'quadratic discriminant analysis': QuadraticDiscriminantAnalysis()
}

# Premature model
The dataset seems to be fairly well-structured and doesn't contain any null values, so before we start further data exploration and engineering let's create a premature classification model to see how the model performs on the original data.

In [None]:
for name, clf in classifiers.items():
  print(f'\n\n\n{name}')
  print(ClassificationReport(clf))

## Model
Let's fit a basic logistic regression classification estimator.

In [None]:
clf = LogisticRegression(max_iter=10000)
clf.fit(X_train, y_train)

## Classification report

In [None]:
preds = clf.predict(X_test)
print(classification_report(y_test, preds))

We can see the there's quite a significant class imbalance.

In [None]:
# from sklearn.preprocessing import StandardScaler
# scaler = StandardScaler()
# X_scaled = scaler.fit_transform(X)

In [None]:
# from sklearn.decomposition import PCA
# pca = PCA(n_components=8)
# X_pca = pca.fit_transform(X_scaled)

In [None]:
# X_train, X_test, y_train, y_test = train_test_split(X_pca, y)

In [None]:
# clf = LogisticRegression(max_iter=10000)
# clf.fit(X_train, y_train)

In [None]:
# preds = clf.predict(X_test)
# print(classification_report(y_test, preds))

In [None]:
# plottting lib
import seaborn as sns
import matplotlib.pyplot as plt

import xgboost
from xgboost import XGBClassifier
### pre-processing lib
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import LabelEncoder,OneHotEncoder,StandardScaler
from sklearn. model_selection import train_test_split,GridSearchCV,KFold,cross_val_predict,RandomizedSearchCV
### classification lib required
from sklearn.ensemble import AdaBoostClassifier,BaggingClassifier,GradientBoostingClassifier,RandomForestClassifier,VotingClassifier,StackingClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.decomposition import KernelPCA,PCA
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.linear_model import LogisticRegression,SGDClassifier,RidgeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.svm import SVC
## different metrices
from sklearn.metrics import accuracy_score,r2_score

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y)

svm=SVC(gamma='scale', probability=True)
svm.fit(X_train,y_train)

svm_pred=svm.predict(X_test)
print("*"* 30)
score=accuracy_score(y_test,svm_pred)
print("SVM accuracy is :{}".format(score))

In [None]:
random_f=RandomForestClassifier(n_estimators=250)
random_f.fit(X_train,y_train)
random_f_pred=random_f.predict(X_test)
print("*"* 30)
score=accuracy_score(y_test,random_f_pred)
print("random forest accuracy is :{}".format(score))

In [None]:
log=LogisticRegression(solver='liblinear')
log.fit(X_train,y_train)
pred=log.predict(X_test)
print("*"* 30)
score=accuracy_score(y_test,pred)
print("LogisticRegression accuracy is :{}".format(score))

In [None]:
Decision=DecisionTreeClassifier()
Decision.fit(X_train,y_train)
pred=Decision.predict(X_test)
print("*"* 30)
score=accuracy_score(y_test,pred)
print("DecisionTreeClassifier accuracy is :{}".format(score))

In [None]:
guassian=GaussianNB()
guassian.fit(X_train,y_train)
pred=guassian.predict(X_test)
print("*"* 30)
score=accuracy_score(y_test,pred)
print("GaussianNB accuracy is :{}".format(score))

In [None]:
KNN=KNeighborsClassifier()
KNN.fit(X_train,y_train)
pred=KNN.predict(X_test)
print("*"* 30)
score=accuracy_score(y_test,pred)
print("KNeighborsClassifier accuracy is :{}".format(score))

In [None]:
Ada=AdaBoostClassifier()
Ada.fit(X_train,y_train)
pred=Ada.predict(X_test)
print("*"* 30)
score=accuracy_score(y_test,pred)
print("AdaBoostClassifier accuracy is :{}".format(score))

In [None]:
Bagging=BaggingClassifier(n_estimators=300)
Bagging.fit(X_train,y_train)
pred=Bagging.predict(X_test)
print("*"* 30)
score=accuracy_score(y_test,pred)
print("BaggingClassifier accuracy is :{}".format(score))

In [None]:
Ex_Tree=ExtraTreesClassifier(n_estimators=300)
Ex_Tree.fit(X_train,y_train)
pred=Ex_Tree.predict(X_test)
print("*"* 30)
score=accuracy_score(y_test,pred)
print("ExtraTreesClassifier accuracy is :{}".format(score))

In [None]:
XGB=XGBClassifier()
XGB.fit(X_train,y_train)
pred=XGB.predict(X_test)
print("*"* 30)
score=accuracy_score(y_test,pred)
print("XGBClassifier accuracy is :{}".format(score))

## Statistical distribution

### Box plot
Let's investigate the statistical features of our dataset using a box plot.

In [None]:
data = [
        go.Box(
            y = df[col_name],
            notched=True, # notched appearance
            showlegend=True,
            name=col_name,
            boxpoints='outliers' # display outliers
            ) for col_name in df.columns
]

fig = go.Figure(
    data=data
)

fig.show()

## Remove outliers
We can see there's quite a lot of outliers which can skew our later predictions. Let's clear them.

In [None]:
# def get_iqr_values(df_in, col_name):
#     median = df_in[col_name].median()
#     q1 = df_in[col_name].quantile(0.25) # 25th percentile / 1st quartile
#     q3 = df_in[col_name].quantile(0.75) # 7th percentile / 3rd quartile
#     iqr = q3-q1 #Interquartile range
#     minimum  = q1-1.5*iqr # The minimum value or the |- marker in the box plot
#     maximum = q3+1.5*iqr # The maximum value or the -| marker in the box plot
#     return median, q1, q3, iqr, minimum, maximum

In [None]:
# def remove_outliers(df_in, col_name):
#     _, _, _, _, minimum, maximum = get_iqr_values(df_in, col_name)
#     df_out = df_in.loc[(df_in[col_name] > minimum) & (df_in[col_name] < maximum)]
#     return df_out

In [None]:
# for col_name in df_red.columns:
#   df_red = remove_outliers(df_red, col_name)
# df_red


In [None]:
df_red.corr()

https://towardsdatascience.com/comparing-classification-models-for-wine-quality-prediction-6c5f26669a4f