🍷 White Wine Quality Classification

Data Mining

Author: Nathaly Ingol

Toolkit: Orange 3

📌 Project Overview

This project explores classification techniques using the White Wine Quality dataset (4,898 instances) from the UCI Machine Learning Repository.

The dataset contains:

11 numerical physicochemical features:
- Fixed acidity
- Volatile acidity
- Citric acid
- Residual sugar
- Chlorides
- Free sulfur dioxide
- Total sulfur dioxide
- Density
- pH
- Sulphates
- Alcohol
Target variable:
- Wine quality score (3–9)

The analysis was performed using Orange 3, a visual drag-and-drop data mining toolkit built on Python and scikit-learn.

Time spent on this assignment: ~6–7 hours

🧹 Dataset Preparation

Loaded dataset using the File widget.
Used the Edit Domain widget to:
- Convert quality from Numeric → Categorical
- Set quality as the Target variable

⚠️ This step was essential. Without converting the target to categorical, Orange treated the problem as regression and produced errors (e.g., Naive Bayes requires a categorical class).

📊 Dataset Characteristics

4,898 instances
No missing values
Class imbalance present
- Classes 5 and 6 account for ~74% of the data
- Classes 3 and 9 are rare (20 and 5 samples)

Because of this imbalance, accuracy alone is not a reliable metric.

🌳 Decision Tree Analysis

A Decision Tree learner was connected to a Tree Viewer widget for visualization.

Key Findings:

The root split was on alcohol content
Alcohol is the most important predictor of wine quality
Unlimited tree depth → Overfitting
Max depth = 5 → Better interpretability and generalization

The Tree Viewer clearly showed the branching logic of the model.

📈 Model Comparison (5-Fold Stratified Cross-Validation)

Six models were evaluated using the Test & Score widget.

Model	AUC	Accuracy	F1	Precision	Recall	MCC
Neural Network	0.766	55.3%	0.538	0.540	0.553	0.309
CN2 Rule Induction	0.736	61.1%	0.611	0.610	0.611	0.423
Decision Tree	0.713	57.7%	0.576	0.577	0.577	0.374
Naive Bayes	0.690	44.2%	0.446	0.464	0.442	0.210
SVM	0.689	42.5%	0.413	0.420	0.425	0.161
kNN	0.669	47.5%	0.462	0.460	0.475	0.193

🔎 Interpretation

Neural Network
- Highest AUC (0.766)
- Best overall class discrimination
- Statistically significantly better than other models
CN2 Rule Induction
- Highest raw accuracy (61.1%)
- Produces human-readable IF–THEN rules
- Strong balance of interpretability and performance
Decision Tree
- Good performance (57.7%)
- Fully interpretable
SVM & kNN
- Underperformed
- Likely due to lack of feature normalization
- Distance-based models are sensitive to feature scale

🔬 Unsupervised Learning (Bonus)

k-Means Clustering (k = 3)

Scatter plot: Alcohol vs. Volatile Acidity
Clusters loosely corresponded to:
- Low quality
- Medium quality
- High quality
Significant overlap observed

Hierarchical Clustering

Average linkage method
Dendrogram showed two main branches
Primary split based on alcohol content

Unsupervised results reinforced findings from supervised learning.

⚠️ Obstacles & Observations

Obstacles

Edit Domain confusion:
- Without converting the target to categorical:
  - Regression metrics (MSE/RMSE) appeared
  - Naive Bayes error: “Categorical class variable expected”
SVM training was slow on ~5,000 instances.
Class imbalance made accuracy misleading.

Key Observations

Alcohol content is consistently the strongest predictor.
Orange’s visual workflow makes model comparison simple.
CN2 Rule Induction performed surprisingly well.
AUC is more reliable than accuracy for this dataset.

🚀 Suggestions for Improvement

Add a Normalize widget before SVM and kNN → Likely to significantly improve their performance.
Reduce class imbalance by binning quality:
- Low: 3–4
- Medium: 5–6
- High: 7–9
Add Random Forest to the comparison (ensemble methods typically outperform single trees).
Use AUC or F1 score as the primary evaluation metric instead of accuracy.

📚 Tools Used

Orange 3
UCI White Wine Quality Dataset
5-Fold Stratified Cross-Validation

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
figure		figure
report		report
workflow-orange		workflow-orange
README.md		README.md
index.html		index.html
wine_analysis.ipynb		wine_analysis.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🍷 White Wine Quality Classification

📌 Project Overview

🧹 Dataset Preparation

📊 Dataset Characteristics

🌳 Decision Tree Analysis

Key Findings:

📈 Model Comparison (5-Fold Stratified Cross-Validation)

🔎 Interpretation

🔬 Unsupervised Learning (Bonus)

k-Means Clustering (k = 3)

Hierarchical Clustering

⚠️ Obstacles & Observations

Obstacles

Key Observations

🚀 Suggestions for Improvement

📚 Tools Used

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🍷 White Wine Quality Classification

📌 Project Overview

🧹 Dataset Preparation

📊 Dataset Characteristics

🌳 Decision Tree Analysis

Key Findings:

📈 Model Comparison (5-Fold Stratified Cross-Validation)

🔎 Interpretation

🔬 Unsupervised Learning (Bonus)

k-Means Clustering (k = 3)

Hierarchical Clustering

⚠️ Obstacles & Observations

Obstacles

Key Observations

🚀 Suggestions for Improvement

📚 Tools Used

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages