Skip to content

coconath0/Wine-Quality

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🍷 White Wine Quality Classification

Data Mining

Author: Nathaly Ingol

Toolkit: Orange 3


📌 Project Overview

This project explores classification techniques using the White Wine Quality dataset (4,898 instances) from the UCI Machine Learning Repository.

The dataset contains:

  • 11 numerical physicochemical features:

    • Fixed acidity
    • Volatile acidity
    • Citric acid
    • Residual sugar
    • Chlorides
    • Free sulfur dioxide
    • Total sulfur dioxide
    • Density
    • pH
    • Sulphates
    • Alcohol
  • Target variable:

    • Wine quality score (3–9)

The analysis was performed using Orange 3, a visual drag-and-drop data mining toolkit built on Python and scikit-learn.

Time spent on this assignment: ~6–7 hours


🧹 Dataset Preparation

  1. Loaded dataset using the File widget.

  2. Used the Edit Domain widget to:

    • Convert quality from Numeric → Categorical
    • Set quality as the Target variable

⚠️ This step was essential. Without converting the target to categorical, Orange treated the problem as regression and produced errors (e.g., Naive Bayes requires a categorical class).

📊 Dataset Characteristics

  • 4,898 instances

  • No missing values

  • Class imbalance present

    • Classes 5 and 6 account for ~74% of the data
    • Classes 3 and 9 are rare (20 and 5 samples)

Because of this imbalance, accuracy alone is not a reliable metric.


🌳 Decision Tree Analysis

A Decision Tree learner was connected to a Tree Viewer widget for visualization.

Key Findings:

  • The root split was on alcohol content
  • Alcohol is the most important predictor of wine quality
  • Unlimited tree depth → Overfitting
  • Max depth = 5 → Better interpretability and generalization

The Tree Viewer clearly showed the branching logic of the model.


📈 Model Comparison (5-Fold Stratified Cross-Validation)

Six models were evaluated using the Test & Score widget.

Model AUC Accuracy F1 Precision Recall MCC
Neural Network 0.766 55.3% 0.538 0.540 0.553 0.309
CN2 Rule Induction 0.736 61.1% 0.611 0.610 0.611 0.423
Decision Tree 0.713 57.7% 0.576 0.577 0.577 0.374
Naive Bayes 0.690 44.2% 0.446 0.464 0.442 0.210
SVM 0.689 42.5% 0.413 0.420 0.425 0.161
kNN 0.669 47.5% 0.462 0.460 0.475 0.193

🔎 Interpretation

  • Neural Network

    • Highest AUC (0.766)
    • Best overall class discrimination
    • Statistically significantly better than other models
  • CN2 Rule Induction

    • Highest raw accuracy (61.1%)
    • Produces human-readable IF–THEN rules
    • Strong balance of interpretability and performance
  • Decision Tree

    • Good performance (57.7%)
    • Fully interpretable
  • SVM & kNN

    • Underperformed
    • Likely due to lack of feature normalization
    • Distance-based models are sensitive to feature scale

    Diagram


🔬 Unsupervised Learning (Bonus)

k-Means Clustering (k = 3)

  • Scatter plot: Alcohol vs. Volatile Acidity

  • Clusters loosely corresponded to:

    • Low quality
    • Medium quality
    • High quality
  • Significant overlap observed

Hierarchical Clustering

  • Average linkage method
  • Dendrogram showed two main branches
  • Primary split based on alcohol content

Unsupervised results reinforced findings from supervised learning.


⚠️ Obstacles & Observations

Obstacles

  1. Edit Domain confusion:

    • Without converting the target to categorical:

      • Regression metrics (MSE/RMSE) appeared
      • Naive Bayes error: “Categorical class variable expected”
  2. SVM training was slow on ~5,000 instances.

  3. Class imbalance made accuracy misleading.


Key Observations

  • Alcohol content is consistently the strongest predictor.
  • Orange’s visual workflow makes model comparison simple.
  • CN2 Rule Induction performed surprisingly well.
  • AUC is more reliable than accuracy for this dataset.

🚀 Suggestions for Improvement

  1. Add a Normalize widget before SVM and kNN → Likely to significantly improve their performance.

  2. Reduce class imbalance by binning quality:

    • Low: 3–4
    • Medium: 5–6
    • High: 7–9
  3. Add Random Forest to the comparison (ensemble methods typically outperform single trees).

  4. Use AUC or F1 score as the primary evaluation metric instead of accuracy.


📚 Tools Used

  • Orange 3
  • UCI White Wine Quality Dataset
  • 5-Fold Stratified Cross-Validation

About

White wine quality classification using Orange 3 & Python. Decision Tree, SVM, kNN, Naive Bayes, Neural Network & CN2 compared. Data Mining.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors