Data Mining
Author: Nathaly Ingol
Toolkit: Orange 3
This project explores classification techniques using the White Wine Quality dataset (4,898 instances) from the UCI Machine Learning Repository.
The dataset contains:
-
11 numerical physicochemical features:
- Fixed acidity
- Volatile acidity
- Citric acid
- Residual sugar
- Chlorides
- Free sulfur dioxide
- Total sulfur dioxide
- Density
- pH
- Sulphates
- Alcohol
-
Target variable:
- Wine quality score (3–9)
The analysis was performed using Orange 3, a visual drag-and-drop data mining toolkit built on Python and scikit-learn.
Time spent on this assignment: ~6–7 hours
-
Loaded dataset using the File widget.
-
Used the Edit Domain widget to:
- Convert
qualityfrom Numeric → Categorical - Set
qualityas the Target variable
- Convert
-
4,898 instances
-
No missing values
-
Class imbalance present
- Classes 5 and 6 account for ~74% of the data
- Classes 3 and 9 are rare (20 and 5 samples)
Because of this imbalance, accuracy alone is not a reliable metric.
A Decision Tree learner was connected to a Tree Viewer widget for visualization.
- The root split was on alcohol content
- Alcohol is the most important predictor of wine quality
- Unlimited tree depth → Overfitting
- Max depth = 5 → Better interpretability and generalization
The Tree Viewer clearly showed the branching logic of the model.
Six models were evaluated using the Test & Score widget.
| Model | AUC | Accuracy | F1 | Precision | Recall | MCC |
|---|---|---|---|---|---|---|
| Neural Network | 0.766 | 55.3% | 0.538 | 0.540 | 0.553 | 0.309 |
| CN2 Rule Induction | 0.736 | 61.1% | 0.611 | 0.610 | 0.611 | 0.423 |
| Decision Tree | 0.713 | 57.7% | 0.576 | 0.577 | 0.577 | 0.374 |
| Naive Bayes | 0.690 | 44.2% | 0.446 | 0.464 | 0.442 | 0.210 |
| SVM | 0.689 | 42.5% | 0.413 | 0.420 | 0.425 | 0.161 |
| kNN | 0.669 | 47.5% | 0.462 | 0.460 | 0.475 | 0.193 |
-
Neural Network
- Highest AUC (0.766)
- Best overall class discrimination
- Statistically significantly better than other models
-
CN2 Rule Induction
- Highest raw accuracy (61.1%)
- Produces human-readable IF–THEN rules
- Strong balance of interpretability and performance
-
Decision Tree
- Good performance (57.7%)
- Fully interpretable
-
SVM & kNN
- Underperformed
- Likely due to lack of feature normalization
- Distance-based models are sensitive to feature scale
-
Scatter plot: Alcohol vs. Volatile Acidity
-
Clusters loosely corresponded to:
- Low quality
- Medium quality
- High quality
-
Significant overlap observed
- Average linkage method
- Dendrogram showed two main branches
- Primary split based on alcohol content
Unsupervised results reinforced findings from supervised learning.
-
Edit Domain confusion:
-
Without converting the target to categorical:
- Regression metrics (MSE/RMSE) appeared
- Naive Bayes error: “Categorical class variable expected”
-
-
SVM training was slow on ~5,000 instances.
-
Class imbalance made accuracy misleading.
- Alcohol content is consistently the strongest predictor.
- Orange’s visual workflow makes model comparison simple.
- CN2 Rule Induction performed surprisingly well.
- AUC is more reliable than accuracy for this dataset.
-
Add a Normalize widget before SVM and kNN → Likely to significantly improve their performance.
-
Reduce class imbalance by binning quality:
- Low: 3–4
- Medium: 5–6
- High: 7–9
-
Add Random Forest to the comparison (ensemble methods typically outperform single trees).
-
Use AUC or F1 score as the primary evaluation metric instead of accuracy.
- Orange 3
- UCI White Wine Quality Dataset
- 5-Fold Stratified Cross-Validation