The Wine Quality dataset contains data on various physicochemical properties of red and white variants of Portuguese "Vinho Verde" wine. It is commonly used for classification and regression tasks in machine learning. The goal is typically to predict the quality of the wine based on its attributes.

### Dataset Details

- **Source**: The dataset is available from the UCI Machine Learning Repository.
- **Attributes**: It includes 11 input variables and 1 output variable.

### Attributes

1. **Fixed Acidity**: Most acids involved with wine or fixed or nonvolatile (do not evaporate readily).
2. **Volatile Acidity**: The amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste.
3. **Citric Acid**: Found in small quantities, citric acid can add 'freshness' and flavor to wines.
4. **Residual Sugar**: The amount of sugar remaining after fermentation stops, important for determining the sweetness of the wine.
5. **Chlorides**: The amount of salt in the wine.
6. **Free Sulfur Dioxide**: The free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine.
7. **Total Sulfur Dioxide**: Amount of free and bound forms of SO2; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine.
8. **Density**: The density of water is close to that of water depending on the percent alcohol and sugar content.
9. **pH**: Describes how acidic or basic the wine is.
10. **Sulphates**: A wine additive which can contribute to sulfur dioxide gas (SO2) levels, acts as an antimicrobial and antioxidant.
11. **Alcohol**: The percent alcohol content of the wine.

### Output Variable

- **Quality**: Wine quality score (between 0 and 10).

### Applications

- **Classification**: Predicting the quality category of the wine (e.g., low, medium, high).
- **Regression**: Predicting the exact quality score of the wine.

### Accessing the Dataset

You can download the Wine Quality dataset from the UCI Machine Learning Repository:
- [Wine Quality Dataset (UCI)](https://archive.ics.uci.edu/ml/datasets/wine+quality)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from matplotlib import cm
warnings.filterwarnings('ignore')
pd.options.display.float_format = "{:.2f}".format


In [None]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


##LOAD DATASET

In [None]:
df_red =pd.read_csv("/content/drive/MyDrive/0.Latest_DS_Course/Statistics/Descriptive/8.EDA/data/winequality-red.csv",delimiter=";")

In [None]:
df_white =pd.read_csv("/content/drive/MyDrive/0.Latest_DS_Course/Statistics/Descriptive/8.EDA/data/winequality-white.csv",delimiter=";")

##UNDERSTANDNG DATA

#### RED WINE DATASET

In [None]:
df_red.head()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.4,0.7,0.0,1.9,0.08,11.0,34.0,1.0,3.51,0.56,9.4,5
1,7.8,0.88,0.0,2.6,0.1,25.0,67.0,1.0,3.2,0.68,9.8,5
2,7.8,0.76,0.04,2.3,0.09,15.0,54.0,1.0,3.26,0.65,9.8,5
3,11.2,0.28,0.56,1.9,0.07,17.0,60.0,1.0,3.16,0.58,9.8,6
4,7.4,0.7,0.0,1.9,0.08,11.0,34.0,1.0,3.51,0.56,9.4,5


In [None]:
df_red.tail()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
1594,6.2,0.6,0.08,2.0,0.09,32.0,44.0,0.99,3.45,0.58,10.5,5
1595,5.9,0.55,0.1,2.2,0.06,39.0,51.0,1.0,3.52,0.76,11.2,6
1596,6.3,0.51,0.13,2.3,0.08,29.0,40.0,1.0,3.42,0.75,11.0,6
1597,5.9,0.65,0.12,2.0,0.07,32.0,44.0,1.0,3.57,0.71,10.2,5
1598,6.0,0.31,0.47,3.6,0.07,18.0,42.0,1.0,3.39,0.66,11.0,6


In [None]:
df_red.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB


In [None]:
df_red.dtypes

Unnamed: 0,0
fixed acidity,float64
volatile acidity,float64
citric acid,float64
residual sugar,float64
chlorides,float64
free sulfur dioxide,float64
total sulfur dioxide,float64
density,float64
pH,float64
sulphates,float64


In [None]:
df_red.shape

(1599, 12)

In [None]:
df_red.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

#### WHITE WINE DATASET

In [None]:
df_white.head()


Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.04,45.0,170.0,1.0,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.05,14.0,132.0,0.99,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,1.0,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.06,47.0,186.0,1.0,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.06,47.0,186.0,1.0,3.19,0.4,9.9,6


In [None]:
df_white.tail()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
4893,6.2,0.21,0.29,1.6,0.04,24.0,92.0,0.99,3.27,0.5,11.2,6
4894,6.6,0.32,0.36,8.0,0.05,57.0,168.0,0.99,3.15,0.46,9.6,5
4895,6.5,0.24,0.19,1.2,0.04,30.0,111.0,0.99,2.99,0.46,9.4,6
4896,5.5,0.29,0.3,1.1,0.02,20.0,110.0,0.99,3.34,0.38,12.8,7
4897,6.0,0.21,0.38,0.8,0.02,22.0,98.0,0.99,3.26,0.32,11.8,6


In [None]:
df_white.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         4898 non-null   float64
 1   volatile acidity      4898 non-null   float64
 2   citric acid           4898 non-null   float64
 3   residual sugar        4898 non-null   float64
 4   chlorides             4898 non-null   float64
 5   free sulfur dioxide   4898 non-null   float64
 6   total sulfur dioxide  4898 non-null   float64
 7   density               4898 non-null   float64
 8   pH                    4898 non-null   float64
 9   sulphates             4898 non-null   float64
 10  alcohol               4898 non-null   float64
 11  quality               4898 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 459.3 KB


In [None]:
df_white.dtypes

Unnamed: 0,0
fixed acidity,float64
volatile acidity,float64
citric acid,float64
residual sugar,float64
chlorides,float64
free sulfur dioxide,float64
total sulfur dioxide,float64
density,float64
pH,float64
sulphates,float64


In [None]:
df_white.shape

(4898, 12)

In [None]:
df_white.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')

In [None]:
df_red.columns, df_white.columns

(Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
        'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
        'pH', 'sulphates', 'alcohol', 'quality'],
       dtype='object'),
 Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
        'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
        'pH', 'sulphates', 'alcohol', 'quality'],
       dtype='object'))

In [None]:
df_red.to_csv("red.csv")

In [None]:
df_red.duplicated().sum()

240

In [None]:
df_white.duplicated().sum()

937


### Observations

- The red wine dataset contains 1,599 records and 13 attributes, while the white wine dataset contains 4,898 records with the same attributes.
- Both datasets are clean, with no missing values or duplicate entries.
- For more comprehensive analysis, we plan to merge the two datasets and add a new categorical column to indicate the wine color.
- The data types across all columns are consistent and appropriate.


#### Merge both datasets, adding color column to distinguish between white and red wine records

In [None]:
df_red.shape, df_white.shape

((1599, 12), (4898, 12))

In [None]:
# create color array for red dataframe
color_red = np.repeat('red', df_red.shape[0])

# create color array for white dataframe
color_white = np.repeat('white', df_white.shape[0])

In [None]:
color_red, len(color_red)

(array(['red', 'red', 'red', ..., 'red', 'red', 'red'], dtype='<U3'), 1599)

In [None]:
df_red.columns == df_white.columns

array([ True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True,  True])

In [None]:
df_red.shape, df_white.shape

((1599, 12), (4898, 12))

In [None]:
# appending new column and confirming changes
df_white['color'] = color_white
df_white.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,7.0,0.27,0.36,20.7,0.04,45.0,170.0,1.0,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.05,14.0,132.0,0.99,3.3,0.49,9.5,6,white
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,1.0,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.06,47.0,186.0,1.0,3.19,0.4,9.9,6,white
4,7.2,0.23,0.32,8.5,0.06,47.0,186.0,1.0,3.19,0.4,9.9,6,white


In [None]:
# appending new column and confirming changes
df_red['color'] = color_red
df_red.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,7.4,0.7,0.0,1.9,0.08,11.0,34.0,1.0,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.1,25.0,67.0,1.0,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.09,15.0,54.0,1.0,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.07,17.0,60.0,1.0,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.08,11.0,34.0,1.0,3.51,0.56,9.4,5,red


###CONCATINATING BOTH DATASETS

In [None]:
# append dataframes and confirm changes
wine_df = pd.concat([df_white, df_red], axis = 0, ignore_index = True)
wine_df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,color
0,7.0,0.27,0.36,20.7,0.04,45.0,170.0,1.0,3.0,0.45,8.8,6,white
1,6.3,0.3,0.34,1.6,0.05,14.0,132.0,0.99,3.3,0.49,9.5,6,white
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,1.0,3.26,0.44,10.1,6,white
3,7.2,0.23,0.32,8.5,0.06,47.0,186.0,1.0,3.19,0.4,9.9,6,white
4,7.2,0.23,0.32,8.5,0.06,47.0,186.0,1.0,3.19,0.4,9.9,6,white


In [None]:
wine_df.to_csv("wine.csv")

In [None]:

wine_df.shape[0] == df_red.shape[0] +df_white.shape[0]

True

In [None]:
df_red.duplicated().sum(), df_white.duplicated().sum()

(240, 937)

In [None]:
df_red.duplicated().sum() + df_white.duplicated().sum()

1177

In [None]:
wine_df.duplicated().sum() == df_red.duplicated().sum() + df_white.duplicated().sum()

True

In [None]:
wine_df.duplicated().sum()

1177

In [None]:
wine_df = wine_df.drop_duplicates()

In [None]:
wine_df.duplicated().sum()

0

In [None]:
wine_df.isnull().sum()

Unnamed: 0,0
fixed acidity,0
volatile acidity,0
citric acid,0
residual sugar,0
chlorides,0
free sulfur dioxide,0
total sulfur dioxide,0
density,0
pH,0
sulphates,0


In [None]:
wine_df.shape

(5320, 13)

In [None]:
wine_df.to_csv("wine.csv")

In [None]:
# casting color column and confirming changes
wine_df['color'] = wine_df['color'].astype('category')
wine_df['color'].dtype

CategoricalDtype(categories=['red', 'white'], ordered=False, categories_dtype=object)

In [None]:
wine_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 5320 entries, 0 to 6496
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   fixed acidity         5320 non-null   float64 
 1   volatile acidity      5320 non-null   float64 
 2   citric acid           5320 non-null   float64 
 3   residual sugar        5320 non-null   float64 
 4   chlorides             5320 non-null   float64 
 5   free sulfur dioxide   5320 non-null   float64 
 6   total sulfur dioxide  5320 non-null   float64 
 7   density               5320 non-null   float64 
 8   pH                    5320 non-null   float64 
 9   sulphates             5320 non-null   float64 
 10  alcohol               5320 non-null   float64 
 11  quality               5320 non-null   int64   
 12  color                 5320 non-null   category
dtypes: category(1), float64(11), int64(1)
memory usage: 545.6 KB


In [None]:
wine_df.columns

Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality', 'color'],
      dtype='object')

## DATA CLEANING