# White Wine Quality Dataset Analysis

This notebook focuses on analyzing the **White Wine Quality** dataset.

We will use the following tools and libraries:
- **`pandas`** for data manipulation and exploratory data analysis.
- **`matplotlib`** and **`seaborn`** for creating visualizations to understand the data better.
- **`scikit-learn`** for preprocessing, splitting the dataset into training and testing subsets, and building machine learning models.


### Step 0: Install Necessary Libraries

Before running the notebook, make sure you have all the required Python libraries installed:
- **ucimlrepo**: Fetches datasets from the UCI Machine Learning Repository.
- **pandas**: Handles data manipulation and analysis.
- **matplotlib** and **seaborn**: For creating visualizations.
- **scikit-learn**: Provides tools for splitting datasets and building machine learning models.

Run the following code to install them if they are not already installed.


In [8]:
# Install required libraries
# Install graphviz through graphviz.org, also add to path
%pip install ucimlrepo pandas matplotlib seaborn scikit-learn graphviz

Note: you may need to restart the kernel to use updated packages.


### Step 1: Import Libraries

We begin by importing the necessary libraries for data loading, manipulation, and visualization.

In [154]:
# Import libraries for data manipulation and visualization
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import graphviz

# Import libraries for machine learning and data splitting
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


## Data Preprocessing

#### Dataset Metadata

We will load the `winequality.names` file to gain detailed information about the dataseg.


In [158]:
# Load and display the winequality.names file
names_file_path = "winequality.names"  # Update with your file path
with open(names_file_path, "r") as f:
    wine_names_content = f.read()

# Display the content
print(wine_names_content)


Citation Request:
  This dataset is public available for research. The details are described in [Cortez et al., 2009]. 
  Please include this citation if you plan to use this database:

  P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

  Available at: [@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
                [Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
                [bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib

1. Title: Wine Quality 

2. Sources
   Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida, Telmo Matos and Jose Reis (CVRVV) @ 2009
   
3. Past Usage:

  P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. 
  Modeling wine preferences by data mining from physicochemical properties.
  In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 016

Also, we will load the **White Wine Quality** dataset using `pandas` and perform a quick preview of the data to understand its structure.

In [161]:
# Load the White Wine Quality dataset
file_path = "winequality-white.csv"  # Update with your dataset path
data = pd.read_csv(file_path, sep=";")  # Dataset uses semicolon as the delimiter

# Display the first few rows of the dataset
data.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8,6
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5,6
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1,6
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9,6


#### Extract Features and Labels

After loading and previewing the dataset, we will extract:
- **Features**: The numerical columns describing tumor properties.
- **Labels**: The target column, which is `quality`.

In [164]:
# Extract features
features = data.drop(columns=['quality'])

# Group labels into 3 categories: Low, Standard, High
labels = data['quality'].apply(lambda x: 
    'Low Quality' if x <= 4 else 
    'Standard Quality' if x <= 6 else 
    'High Quality'
)


Preview the head of `features`:

In [167]:
features.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol
0,7.0,0.27,0.36,20.7,0.045,45.0,170.0,1.001,3.0,0.45,8.8
1,6.3,0.3,0.34,1.6,0.049,14.0,132.0,0.994,3.3,0.49,9.5
2,8.1,0.28,0.4,6.9,0.05,30.0,97.0,0.9951,3.26,0.44,10.1
3,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9
4,7.2,0.23,0.32,8.5,0.058,47.0,186.0,0.9956,3.19,0.4,9.9


Preview the head of `labels`:

In [170]:
labels.to_frame().head()

Unnamed: 0,quality
0,Standard Quality
1,Standard Quality
2,Standard Quality
3,Standard Quality
4,Standard Quality


## 2.1. Prepare the dataset for training

We will split the dataset into training and testing subsets with the following proportions:
- **40% Training / 60% Testing**
- **60% Training / 40% Testing**
- **80% Training / 20% Testing**
- **90% Training / 10% Testing**

In [173]:
# Define train-test proportions
train_sizes = [0.4, 0.6, 0.8, 0.9]

# Create a dictionary to store the splits
splits = {}

### Step 1: Perform Stratified Shuffle Split

Using the defined train-test proportions, we will:
1. Split the dataset into `feature_train`, `feature_test`, `label_train`, and `label_test` subsee2s.
3. Store the splits in a dictionary, where each key corresponds to the train-test proportion.


In [176]:
# Import StratifiedShuffleSplit to split the dataset into training and testing subsets
from sklearn.model_selection import StratifiedShuffleSplit


# Perform stratified shuffle split for each proportion
for train_size in train_sizes:
    sss = StratifiedShuffleSplit(n_splits=1, train_size=train_size, random_state=42)
    for train_index, test_index in sss.split(features, labels):
        feature_train = features.iloc[train_index]
        feature_test = features.iloc[test_index]
        label_train = labels.iloc[train_index]
        label_test = labels.iloc[test_index]
    
    # Store the splits in the dictionary
    splits[f'{int(train_size * 100)}/{int(round((1 - train_size) * 100))}'] = {
        'feature_train': feature_train,
        'feature_test': feature_test,
        'label_train': label_train,
        'label_test': label_test
    }


### Step 2: Confirm Splits

We will now verify that the splits were created correctly by:
1. Checking the structure of the dictionary where splits are stored.
2. Displaying a few rows from all the splits to confirm that the data is organized as expected.

In [179]:
# Display the keys of the splits dictionary 
splits.keys()

dict_keys(['40/60', '60/40', '80/20', '90/10'])

## 2.2. Build the decision tree classifiers

We will now train **Decision Tree Classifier** on **training data**.

### Parameters:
- **criterion='entropy'**: The splitting criterion is information gain (entropy).
- **random_state=42**: Ensures reproducibility of the training process.

This classifier will learn to differentiate between malignant and benign tumors based on the training data.

---

### Step 1: Training

Initializes a dictionary to store classifiers and a list to store the shapes of the training and testing datasets for each proportion. The loop iterates through all splits to extract the details for `feature_train`, `feature_test`, `label_train`, and `label_test`. The resulting details are stored in a structured table for better visualization.


In [182]:
from sklearn.tree import DecisionTreeClassifier

# Initialize a dictionary to store classifiers
classifiers = {}

# Initialize a list to store the shapes 
dataset_shapes = []

# Train Decision Tree Classifier for each proportion
for proportion, data in splits.items():
    
    # Extract data
    feature_train = data['feature_train']
    feature_test = data['feature_test']
    label_train = data['label_train']
    label_test = data['label_test']
    
    # Initialize the Decision Tree Classifier
    clf = DecisionTreeClassifier(criterion='entropy', random_state=42, max_depth=5)
    
    # Train the classifier
    clf.fit(feature_train, label_train)
    
    # Store the classifier and split details
    classifiers[proportion] = {
        'classifier': clf,
        'feature_train': feature_train,
        'feature_test': feature_test,
        'label_train': label_train,
        'label_test': label_test
    }
    # Append details to the list
    dataset_shapes.append({
        'Proportion': proportion,
        'Feature Train Shape': feature_train.shape,
        'Feature Test Shape': feature_test.shape,
        'Label Train Shape': label_train.shape,
        'Label Test Shape': label_test.shape
    })


shapes_df = pd.DataFrame(dataset_shapes)

# Display the DataFrame
shapes_df


Unnamed: 0,Proportion,Feature Train Shape,Feature Test Shape,Label Train Shape,Label Test Shape
0,40/60,"(1959, 11)","(2939, 11)","(1959,)","(2939,)"
1,60/40,"(2938, 11)","(1960, 11)","(2938,)","(1960,)"
2,80/20,"(3918, 11)","(980, 11)","(3918,)","(980,)"
3,90/10,"(4408, 11)","(490, 11)","(4408,)","(490,)"


### Step 2: Export the Decision Tree to Graphviz Format
We export the trained decisions tree to Graphviz's DOT format, which allows us to visualize the tree structure.


In [185]:
from sklearn.tree import export_graphviz

# Initialize a dictionary to store DOT data for each proportion
dot_files = {}

# Loop through all trained classifiers to export decision trees
for proportion, data in classifiers.items():
    # Extract the classifier and its training data
    clf = data['classifier']
    feature_train = data['feature_train']
    
    # Export the decision tree to DOT format
    dot_data = export_graphviz(
        clf,
        out_file=None,  # Do not write to a file
        feature_names=feature_train.columns,  # Feature names
        class_names=['Low Quality', 'Standard Quality', 'High Quality'],  # Adjust as needed
        filled=True,  # Fill nodes with colors
        rounded=True,  # Use rounded edges
        special_characters=True,  # Allow special characters
        max_depth=5  # Adjust the depth of the tree for visualization
    )
    
    # Store the DOT data
    dot_files[proportion] = dot_data


### Step 3: Visualize the Decision Tree Using Graphviz

The DOT data is converted to an image using Graphviz. This visualization provides a graphical representation of the decision tree:
- Splits based on feature thresholds.
- The reduction in entropy at each split.
- Predicted class and sample counts at leaf nodes.

---

**Decision Tree of 40/60:**

In [188]:
# Render and display using Graphviz
graph = Source(dot_files["40/60"])
graph.render("decision_tree_graphviz", format="png", cleanup=True)
graph

NameError: name 'Source' is not defined