# Wine Quality Analysis

## Summary
`TODO`


## Introduction 
### (I) Background Introduction : 
This project explores the chemical properties of red and white "Vinho Verde" wine from north or Portugal. We want to classify and predict whether a wine is high or low quality.

`TODO: more background info on the topic and state the question we try to answer`

### (II) Data Description : 
The data is sourced from the UCI Machine Leaning Repository (CITATION).We will be working with the `winequality-red.csv` and `winequality-white.csv` dataset. `winequality-red.csv` contains 1599 observations and `winequality-white.csv` contains 4898 observations. Both datasets contain 12 variables including : 
- `fixed acidity` (dbl) - concentration of fixed acids in the wine
- `volatile acidity` (dbl) - amount of acetic acid, which can give a vinegar taste
- `citric acid` (dbl) – concentration of citric acid
- `residual sugar` (dbl) – amount of sugar remaining after fermentation
- `chlorides (dbl)` – salt content of the wine
- `free sulfur dioxide` (dbl) – free SO₂ level
- `total sulfur dioxide` (dbl) – total SO₂ level
- `density` (dbl) – density of the wine
- `pH` (dbl) – acidity level of the wine
- `sulphates` (dbl) – potassium sulphate level (wine preservative)
- `alcohol` (dbl) – alcohol percentage by volume
- `quality` (int) – quality score assigned to the wine with score of 1 to 10 (target variable) 


## Methods

### 1. Data Acquisition
The primary data source consists of two separate CSV files hosted on the UCI Machine Learning Repository, representing red and white wine samples. We begin by loading these files directly into our environment.

In [13]:
# load library and dataset 
import pandas as pd 

# load data from the original source on the web
red_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
white_url = "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv"

# the raw files use semicolons as delimiters
red = pd.read_csv(red_url, sep=";")
white = pd.read_csv(white_url, sep=";")

### 2. Data Cleaning and Standardization
To ensure the datasets can be merged and analyzed seamlessly, we standardize the column names. We convert all headers to lowercase and replace spaces with underscores to prevent syntax errors.

In [17]:
# clean column names
red.columns = red.columns.str.strip().str.lower().str.replace(" ", "_")
white.columns = white.columns.str.strip().str.lower().str.replace(" ", "_")

### 3. Data Integration 
Since we plan to analyze both wine types together, we preserve the identity of each observation before merging. We add a categorical column `wine_type` to each dataframe and then concatenate them into a single tidy dataset and saved to the processed data folder.

In [18]:
# add wine type labels
red["wine_type"] = "red"
white["wine_type"] = "white"

# merge the datasets
wine = pd.concat([red, white], ignore_index = True) 

# save to processed folder
wine.to_csv("data/processed/wine.csv", index= False)

In [14]:
wine.head().style.set_caption("Table 1: Combined Wine Dataset Preview")

Unnamed: 0,fixed_acidity,volatile_acidity,citric_acid,residual_sugar,chlorides,free_sulfur_dioxide,total_sulfur_dioxide,density,ph,sulphates,alcohol,quality,wine_type
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,red
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,red
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,red
4,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,red


### 4. Data Partitioning (Train/Test Split)
To avoid data leakage and ensure we can evaluate our model's performance on unseen data, we split the data. We use a 70% training and 30% testing split. We also use stratification on the `quality` variable to ensure both sets have a similar distribution of quality scores.

In [5]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(
    wine,
    test_size=0.30,
    random_state=42,
    stratify=wine["quality"]
)

print(train.shape)
print(test.shape)

(4547, 13)
(1950, 13)
