# Wine Quality Analysis

## Summary
TODO


## Introduction 
### (I) Background Introduction : 
Using chemical properties from both the red and white vinho verde wine samples, from the north of Portugal, we want to classify and predict whether a wine is high or low quality. 
### (II) Data Description : 
We will be working with the `winequality-red.csv` and `winequality-white.csv` dataset. `winequality-red.csv` contains 1599 observations and `winequality-white.csv` contains 4898 observations. Both datasets contain 12 variables including : 
- `fixed acidity` (dbl) - concentration of fixed acids in the wine
- `volatile acidity` (dbl) - amount of acetic acid, which can give a vinegar taste
- `citric acid` (dbl) – concentration of citric acid
- `residual sugar` (dbl) – amount of sugar remaining after fermentation
- `chlorides (dbl)` – salt content of the wine
- `free sulfur dioxide` (dbl) – free SO₂ level
- `total sulfur dioxide` (dbl) – total SO₂ level
- `density` (dbl) – density of the wine
- `pH` (dbl) – acidity level of the wine
- `sulphates` (dbl) – potassium sulphate level (wine preservative)
- `alcohol` (dbl) – alcohol percentage by volume
- `quality` (int) – quality score assigned to the wine with score of 1 to 10 (target variable) 


## Loading Library and Dataset 

In [1]:
import pandas as pd 

# loading data
red = pd.read_csv("data/raw/winequality-red.csv", sep=";")
white = pd.read_csv("data/raw/winequality-white.csv", sep=";")

# information about variables and observation 
red.info()
white.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1599 entries, 0 to 1598
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   fixed acidity         1599 non-null   float64
 1   volatile acidity      1599 non-null   float64
 2   citric acid           1599 non-null   float64
 3   residual sugar        1599 non-null   float64
 4   chlorides             1599 non-null   float64
 5   free sulfur dioxide   1599 non-null   float64
 6   total sulfur dioxide  1599 non-null   float64
 7   density               1599 non-null   float64
 8   pH                    1599 non-null   float64
 9   sulphates             1599 non-null   float64
 10  alcohol               1599 non-null   float64
 11  quality               1599 non-null   int64  
dtypes: float64(11), int64(1)
memory usage: 150.0 KB
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4898 entries, 0 to 4897
Data columns (total 12 columns):
 #   Column        

## Cleaning and Merging Dataset 
Column names were standardized to lowercase and _ format to improve readability and compatibility with analysis. The `winequality-red.csv` and `winequality-white.csv` datasets were then merged into `wine.csv` and saved to the processed data folder.

In [2]:
# clean column names
red.columns = red.columns.str.strip().str.lower().str.replace(" ", "_")
white.columns = white.columns.str.strip().str.lower().str.replace(" ", "_")

# add wine type label 
red["wine_type"] = "red"
white["wine_type"] = "white"

# merging dataset 
wine = pd.concat([red, white], ignore_index = True) 

# save to processed folder
wine.to_csv("data/processed/wine.csv", index= False)

In [4]:
wine.head()

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality,wine_type
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


## Splitting into Training and Testing Dataset

In [3]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(
    wine,
    test_size=0.30,
    random_state=42,
    stratify=wine["quality"]
)

print(train.shape)
print(test.shape)

(4547, 13)
(1950, 13)
