# Random Forest Classification on UCI Wine Dataset

This notebook demonstrates:

1. Loading the UCI Wine dataset (via `sklearn`).
2. Evaluating a Random Forest classifier using K-Fold Cross-Validation.
3. Performing hyperparameter tuning with `GridSearchCV`.
4. Comparing pre- and post-tuning performance and reflecting on results.

Dataset source: UCI ML Repository — Wine dataset (loaded through `sklearn.datasets.load_wine`).

In [None]:
# Imports and data loading
import numpy as np
import pandas as pd
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split, KFold, cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns

# Load dataset
data = load_wine(as_frame=True)
X = data.data
y = data.target
df = data.frame

print('Features shape:', X.shape)
print('Target distribution:\n', y.value_counts())

In [None]:
# Baseline Random Forest with K-Fold Cross-Validation
kf = KFold(n_splits=5, shuffle=True, random_state=42)
rf = RandomForestClassifier(random_state=42, n_jobs=-1)

cv_scores = cross_val_score(rf, X, y, cv=kf, scoring='accuracy', n_jobs=-1)
print('CV Accuracy scores:', np.round(cv_scores, 4))
print('Mean CV Accuracy: {:.4f}'.format(cv_scores.mean()))
print('Std CV Accuracy: {:.4f}'.format(cv_scores.std()))

In [None]:
# Train-test split for detailed metrics (baseline)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)

print('Test Accuracy (baseline): {:.4f}'.format(accuracy_score(y_test, y_pred)))
print('\nClassification Report (baseline):\n', classification_report(y_test, y_pred))
cm = confusion_matrix(y_test, y_pred)
print('\nConfusion Matrix:\n', cm)

# Plot confusion matrix
plt.figure(figsize=(6,5))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', xticklabels=data.target_names, yticklabels=data.target_names)
plt.title('Baseline Random Forest Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

In [None]:
# Hyperparameter tuning using GridSearchCV
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10, 20],
    'min_samples_split': [2, 4, 6],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

grid = GridSearchCV(RandomForestClassifier(random_state=42, n_jobs=-1),
                    param_grid,
                    cv=5,
                    scoring='accuracy',
                    n_jobs=-1,
                    verbose=1)

grid.fit(X, y)
print('Best params:', grid.best_params_)
print('Best CV score:', grid.best_score_)

In [None]:
# Evaluate the best estimator on the hold-out test set
best_rf = grid.best_estimator_

# If we want to retrain on train set only:
best_rf.fit(X_train, y_train)
y_pred_tuned = best_rf.predict(X_test)

print('Test Accuracy (tuned): {:.4f}'.format(accuracy_score(y_test, y_pred_tuned)))
print('\nClassification Report (tuned):\n', classification_report(y_test, y_pred_tuned))
cm2 = confusion_matrix(y_test, y_pred_tuned)
print('\nConfusion Matrix (tuned):\n', cm2)

plt.figure(figsize=(6,5))
sns.heatmap(cm2, annot=True, fmt='d', cmap='Greens', xticklabels=data.target_names, yticklabels=data.target_names)
plt.title('Tuned Random Forest Confusion Matrix')
plt.ylabel('True label')
plt.xlabel('Predicted label')
plt.show()

In [None]:
# Compare pre- and post-tuning
baseline_acc = accuracy_score(y_test, y_pred)
tuned_acc = accuracy_score(y_test, y_pred_tuned)

comparison = pd.DataFrame({
    'Metric': ['Test Accuracy'],
    'Baseline': [baseline_acc],
    'Tuned': [tuned_acc],
    'Absolute Improvement': [tuned_acc - baseline_acc]
})
comparison.style.format({'Baseline':'{:.4f}','Tuned':'{:.4f}','Absolute Improvement':'{:.4f}'})

## Reflection

"
- **What I did:** Loaded the UCI Wine dataset, trained a Random Forest classifier and evaluated it using 5-fold cross-validation to examine consistency. Then I performed hyperparameter tuning using GridSearchCV to find better hyperparameters.

"
- **Pre-tuning performance:** The baseline Random Forest (default hyperparameters) produced the cross-validation mean accuracy shown above and a baseline test accuracy.

"
- **Post-tuning performance:** After GridSearchCV, the best parameters are reported and applying the tuned model to the hold-out test set produced a (typically) improved accuracy. The notebook includes the confusion matrices and classification reports for both baseline and tuned models so you can see where performance changed (which classes improved or declined).

"
- **How tuning impacted performance:** GridSearchCV searches a grid of hyperparameters and can improve generalization by finding a better bias-variance tradeoff (e.g., adjusting tree depth, number of trees, and leaf sizes). If you see a positive absolute improvement in test accuracy, tuning helped. If improvement is small or zero, the dataset may already be well-suited to the baseline or more extensive search (or RandomizedSearchCV) could be used.

"
- **Next steps / suggestions:**
"
1. Try `RandomizedSearchCV` for larger parameter spaces to save time.
"
2. Evaluate with other metrics (F1 macro/weighted) if class imbalance is a concern.
"
3. Consider using `XGBoost` or `LightGBM` (often stronger on tabular data) if you can install them.
"
4. Use a nested cross-validation loop if you want an unbiased estimate of generalization performance after tuning.

"
