# Classification vs Regression: Vinho Verde Wine Quality

The following project is based on the Wine Quality dataset found on UCI's Machine Learning Repository (https://archive.ics.uci.edu/ml/datasets/wine+quality). It contains two data sets (white and red wine), with 12 attributes each. 

The main task is to create a model that would predict the quality of the wine, which is the score between 0 and 10. 

This is the first project I attempt, which can be solved with both regression and classification models. As a result, this project would concentrate on comparing the performances of one regression and one classification model.

Since there are two dataset, we are going to combine them in one but add an extra column reflecting wine type.|

## Plan of Attack

<ol>
    <li>Data Cleaning: check for</li>
        <ul>
            <li>Null values</li>
            <li>Missing values</li>
            <li>Duplicates</li>
        </ul>
    <li>Exploratory Analysis: helps us to understand the dataset better</li>
    <li>Regression model: including</li>
        <ul>
            <li>Data Preprocessing</li>
            <li>Training model</li>
            <li>Evaluating model</li>
        </ul>
    <li>Classification model: including</li>
        <ul>
            <li>Data Preprocessing</li>
            <li>Training model</li>
            <li>Evaluating model</li>
        </ul>
    <li>Final words</li>
</ol>

## Importing Libraries

In [1]:
# data manipulation
import pandas as pd
import numpy as np

# data visualisation
import matplotlib.pyplot as plt
import seaborn as sns

## Loading Datasets

Firstly, we need to read csv files. While csv stands for comma-separated values, this particular documents have values separated by ';' so we have to include 'sep' parameter in read_csv() method.

In [6]:
red_wine = pd.read_csv('datasets/winequality-red.csv', sep=';')
white_wine = pd.read_csv('datasets/winequality-white.csv', sep=';')

Since we want to unite both dataframes in one, we need to create a custom function which combines both dataframes and creates a new column type that records from which dataframe whine comes from (red vs white).

In [7]:
def unite_wines(red_wine, white_wine):
    red_wine['Type'] = 'Red'
    white_wine['Type'] = 'White'
    return pd.concat([red_wine, white_wine], sort=True).reset_index(drop=True)

In [36]:
df_wine = unite_wines(red_wine, white_wine)
df_wine.head()

Unnamed: 0,Type,alcohol,chlorides,citric acid,density,fixed acidity,free sulfur dioxide,pH,quality,residual sugar,sulphates,total sulfur dioxide,volatile acidity
0,Red,9.4,0.076,0.0,0.9978,7.4,11.0,3.51,5,1.9,0.56,34.0,0.7
1,Red,9.8,0.098,0.0,0.9968,7.8,25.0,3.2,5,2.6,0.68,67.0,0.88
2,Red,9.8,0.092,0.04,0.997,7.8,15.0,3.26,5,2.3,0.65,54.0,0.76
3,Red,9.8,0.075,0.56,0.998,11.2,17.0,3.16,6,1.9,0.58,60.0,0.28
4,Red,9.4,0.076,0.0,0.9978,7.4,11.0,3.51,5,1.9,0.56,34.0,0.7


In [10]:
df_wine.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6497 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   Type                  6497 non-null   object 
 1   alcohol               6497 non-null   float64
 2   chlorides             6497 non-null   float64
 3   citric acid           6497 non-null   float64
 4   density               6497 non-null   float64
 5   fixed acidity         6497 non-null   float64
 6   free sulfur dioxide   6497 non-null   float64
 7   pH                    6497 non-null   float64
 8   quality               6497 non-null   int64  
 9   residual sugar        6497 non-null   float64
 10  sulphates             6497 non-null   float64
 11  total sulfur dioxide  6497 non-null   float64
 12  volatile acidity      6497 non-null   float64
dtypes: float64(11), int64(1), object(1)
memory usage: 660.0+ KB


In [35]:
df_wine = df_wine.drop_duplicates()
print(len(df_wine))
df_wine.reset_index()

5320


Unnamed: 0,index,Type,alcohol,chlorides,citric acid,density,fixed acidity,free sulfur dioxide,pH,quality,residual sugar,sulphates,total sulfur dioxide,volatile acidity
0,0,Red,9.4,0.076,0.00,0.99780,7.4,11.0,3.51,5,1.9,0.56,34.0,0.70
1,1,Red,9.8,0.098,0.00,0.99680,7.8,25.0,3.20,5,2.6,0.68,67.0,0.88
2,2,Red,9.8,0.092,0.04,0.99700,7.8,15.0,3.26,5,2.3,0.65,54.0,0.76
3,3,Red,9.8,0.075,0.56,0.99800,11.2,17.0,3.16,6,1.9,0.58,60.0,0.28
4,5,Red,9.4,0.075,0.00,0.99780,7.4,13.0,3.51,5,1.8,0.56,40.0,0.66
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5315,6492,White,11.2,0.039,0.29,0.99114,6.2,24.0,3.27,6,1.6,0.50,92.0,0.21
5316,6493,White,9.6,0.047,0.36,0.99490,6.6,57.0,3.15,5,8.0,0.46,168.0,0.32
5317,6494,White,9.4,0.041,0.19,0.99254,6.5,30.0,2.99,6,1.2,0.46,111.0,0.24
5318,6495,White,12.8,0.022,0.30,0.98869,5.5,20.0,3.34,7,1.1,0.38,110.0,0.29
