# Wine Classification — Random Forest + K-Fold CV + GridSearchCV

This notebook covers: dataset loading, preprocessing, K-Fold cross-validation, hyperparameter tuning with `GridSearchCV`, model comparison, saving the best model, and notes for deployment in Streamlit.

**Files included in the project:**
- `app.py` — Streamlit app for real-time predictions
- `train_model.py` — script to train and save the model
- `model.joblib` — trained model (created when running the notebook)
- `requirements.txt`, `README.md`, `medium_article.md`, `reflection.md`

---

In [None]:
# Imports
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, StratifiedKFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, f1_score
import joblib
print("Libraries imported.")

In [None]:
# Load dataset (uploaded as wine_dataset.csv)
df = pd.read_csv('/mnt/data/wine_dataset.csv')
print('Shape:', df.shape)
df.head()

In [None]:
# Basic preprocessing
# if 'class' or 'target' exists, identify; otherwise assume last column is target.
if 'target' in df.columns:
    y = df['target']
elif 'class' in df.columns:
    y = df['class']
elif 'Type' in df.columns:
    y = df['Type']
else:
    y = df.iloc[:, -1]

X = df.drop(y.name, axis=1)
X = X.select_dtypes(include=[np.number])  # keep numeric features
print('Features shape:', X.shape)
print('Target distribution:\n', y.value_counts())

In [None]:
# Train/test split (for final holdout evaluation)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)
print(X_train.shape, X_test.shape)

## 1) Baseline Random Forest with 5-Fold Cross-Validation

In [None]:
# Baseline model and CV
rf = RandomForestClassifier(random_state=42)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scores = cross_val_score(rf, X_train, y_train, cv=cv, scoring='accuracy')
print('CV Accuracy scores:', scores)
print('Mean CV accuracy: {:.4f} ± {:.4f}'.format(scores.mean(), scores.std()))

## 2) Hyperparameter Tuning with GridSearchCV

In [None]:
# GridSearchCV for RandomForest (small grid for speed; extend as needed)
param_grid = {
    'n_estimators': [50, 100, 200],
    'max_depth': [None, 5, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2]
}

grid = GridSearchCV(RandomForestClassifier(random_state=42), param_grid, cv=cv, scoring='accuracy', n_jobs=-1, verbose=1)
grid.fit(X_train, y_train)
print('Best params:', grid.best_params_)
print('Best CV score:', grid.best_score_)

In [None]:
# Evaluate best estimator on holdout test set
best_rf = grid.best_estimator_
y_pred = best_rf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred, average='weighted')
print('Test Accuracy: {:.4f}'.format(acc))
print('Test F1 (weighted): {:.4f}'.format(f1))
print('\nClassification report:\n', classification_report(y_test, y_pred))

In [None]:
# Save the trained model
import joblib
model_path = '/mnt/data/wine_streamlit_project/model.joblib'
joblib.dump(best_rf, model_path)
print('Saved model to', model_path)

## 3) Short comparison: Pre-tuning vs Post-tuning (based on CV)

In [None]:
# Re-evaluate baseline RF with same CV for comparison
from sklearn.base import clone
base_rf = RandomForestClassifier(random_state=42)
base_scores = cross_val_score(base_rf, X_train, y_train, cv=cv, scoring='accuracy')
print('Baseline mean CV accuracy: {:.4f} ± {:.4f}'.format(base_scores.mean(), base_scores.std()))
print('Tuned (GridSearch) CV accuracy: {:.4f} (from grid.best_score_)'.format(grid.best_score_))

## 4) Notes for Streamlit deployment

The `app.py` in the project uses the saved `model.joblib`. Run the app with:

```
streamlit run app.py
```

It displays sliders for each numeric feature and returns predicted class and prediction probabilities in real time.