# Practice Assignment: Wine Classification (Starter Notebook)

This notebook is a **starter template** with **TODOs**.  
Complete each section and keep your answers in markdown cells.

**Dataset:** `sklearn.datasets.load_wine`  
**Goal:** Predict the wine class (3 classes) from 13 numeric features.

---

## Rules
- Do **not** change the dataset.
- Use a fixed random seed where requested.
- Write short answers where asked.
- Keep your code clean and reproducible.


In [1]:
# TODO: Run this cell.
import numpy as np
import pandas as pd

from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

# Models you may use (you will pick at least 3 in Part 4)
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

import matplotlib.pyplot as plt


## Part 0 — Load the dataset

**TODOs**
1. Load the dataset using `load_wine()`
2. Create a DataFrame `df` containing the features
3. Create `X` (features) and `y` (target)
4. Print:
   - dataset shape
   - feature names
   - class distribution


In [2]:
# TODO: Load wine dataset
wine = load_wine()

# TODO: Create DataFrame with features
# Hint: wine.data is a (n_samples, n_features) numpy array
# Hint: wine.feature_names is the list of column names
df = pd.DataFrame(wine.data, columns=wine.feature_names)

# TODO: Create X and y
X = df
y = wine.target  # 0,1,2 correspond to classes

# TODO: Print dataset shape
print("X shape:", X.shape)
print("y shape:", np.shape(y))

# TODO: Print feature names
print("Features:", list(X.columns))

# TODO: Print class distribution
# Hint: np.bincount(y)
print("Class counts:", np.bincount(y))


X shape: (178, 13)
y shape: (178,)
Features: ['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
Class counts: [59 71 48]


In [3]:
df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [4]:
# let's see classes
wine.target_names
# TODO for Wednesday - FIND what these classes ACTUALLY MEAN


array(['class_0', 'class_1', 'class_2'], dtype='<U7')

## Part 1 — Quick EDA (minimal but required)

**TODOs**
1. Check for missing values
2. Show basic descriptive statistics (mean, std, min, max)
3. Plot at least **one** feature distribution grouped by class

**Answer in markdown:**
1. Are all features on similar numeric scales?
2. Do any features appear clearly class-separating?


In [None]:
# TODO: Missing values check
print("Missing values per column:")
print(df.isna().sum())

# TODO: Basic descriptive statistics
# Hint: df.describe().T
display(df.describe().T)


In [None]:
# TODO: Plot one feature distribution grouped by class
# Pick ONE feature name from df.columns, e.g. 'alcohol' or 'color_intensity'
feature = "alcohol"  # TODO: change if you want

plt.figure()
for cls in np.unique(y):
    plt.hist(df.loc[y == cls, feature], alpha=0.5, bins=15, label=f"class {cls}")
plt.xlabel(feature)
plt.ylabel("count")
plt.title(f"Distribution of {feature} by class")
plt.legend()
plt.show()


**Your answers (TODO):**

1. Feature scales:  
   - TODO

2. Clear class separation:  
   - TODO


## Part 2 — Train/test split

Use an **80/20** split and `random_state=42`.

**TODOs**
- Create `X_train, X_test, y_train, y_test`
- Print their shapes


In [None]:
# TODO: Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("X_train:", X_train.shape)
print("X_test :", X_test.shape)
print("y_train:", y_train.shape)
print("y_test :", y_test.shape)


## Part 3 — Baseline model (Logistic Regression, **no scaling**)

Train Logistic Regression without scaling and evaluate.

**TODOs**
1. Train Logistic Regression
2. Predict on test set
3. Report:
   - Accuracy
   - Confusion matrix
   - Classification report

**Reflection (markdown):**
- Why might this baseline be suboptimal even if accuracy looks OK?


In [None]:
# TODO: Baseline logistic regression WITHOUT scaling
# Hint: increase max_iter to avoid convergence warnings
baseline_clf = LogisticRegression(max_iter=5000)

baseline_clf.fit(X_train, y_train)
y_pred_base = baseline_clf.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred_base))
print("\nConfusion matrix:\n", confusion_matrix(y_test, y_pred_base))
print("\nClassification report:\n", classification_report(y_test, y_pred_base))


**Reflection (TODO):**  
- TODO


## Part 4 — Scaling + Logistic Regression (core lesson)

Now apply `StandardScaler` and retrain Logistic Regression using a Pipeline.

**TODOs**
1. Create a pipeline: `StandardScaler()` -> `LogisticRegression()`
2. Fit on training set
3. Evaluate on test set
4. Fill the comparison table (Accuracy and Macro F1)

**Required written explanation (markdown):**
- Explain *why* scaling changed (or did not change) the results.


In [None]:
# TODO: Logistic regression WITH scaling using a pipeline
scaled_lr = Pipeline([
    ("scaler", StandardScaler()),
    ("clf", LogisticRegression(max_iter=5000))
])

scaled_lr.fit(X_train, y_train)
y_pred_scaled = scaled_lr.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred_scaled))
print("\nConfusion matrix:\n", confusion_matrix(y_test, y_pred_scaled))
print("\nClassification report:\n", classification_report(y_test, y_pred_scaled))


In [None]:
# TODO: Compute Macro F1 for baseline and scaled versions and fill the table below
from sklearn.metrics import f1_score

acc_base = accuracy_score(y_test, y_pred_base)
f1_base = f1_score(y_test, y_pred_base, average="macro")

acc_scaled = accuracy_score(y_test, y_pred_scaled)
f1_scaled = f1_score(y_test, y_pred_scaled, average="macro")

acc_base, f1_base, acc_scaled, f1_scaled


### Comparison table (fill in the numbers)

| Metric | No Scaling (baseline LR) | With Scaling (LR) |
|---|---:|---:|
| Accuracy | TODO | TODO |
| Macro F1 | TODO | TODO |

**Explanation (TODO):**  
- TODO


## Part 5 — Model comparison (train at least 3 models)

Train and evaluate **at least 3** models from the list below:

- KNN
- SVM (linear or RBF)
- Decision Tree
- Random Forest

**Rules**
- Use the same train/test split.
- Use scaling for models that need it (KNN, SVM, Logistic Regression).
- Keep default hyperparameters (you may set `random_state` where available).

**TODOs**
1. Train your chosen models
2. Compute accuracy + macro F1 for each
3. Build a comparison table (DataFrame)


In [None]:
# TODO: Define candidate models
# Tip: use Pipelines for models that need scaling
models = {
    "KNN (scaled)": Pipeline([("scaler", StandardScaler()), ("clf", KNeighborsClassifier())]),
    "SVM RBF (scaled)": Pipeline([("scaler", StandardScaler()), ("clf", SVC())]),
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "Random Forest": RandomForestClassifier(random_state=42),
}

# TODO: Choose at least 3 models to evaluate
chosen_model_names = [
    # "KNN (scaled)",
    # "SVM RBF (scaled)",
    # "Decision Tree",
    # "Random Forest",
]

# TODO: Evaluate chosen models and store results
results = []

for name in chosen_model_names:
    clf = models[name]
    clf.fit(X_train, y_train)
    preds = clf.predict(X_test)
    results.append({
        "model": name,
        "accuracy": accuracy_score(y_test, preds),
        "macro_f1": f1_score(y_test, preds, average="macro"),
    })

results_df = pd.DataFrame(results).sort_values(by="macro_f1", ascending=False)
results_df


**TODO:** Briefly comment (1–3 sentences) on which model performed best and any surprises.

- TODO


## Part 6 — Overfitting check

Pick your best-performing model from Part 5 and compare:

- Training accuracy
- Test accuracy

**TODOs**
1. Select the best model name from `results_df`
2. Refit it on training set (if needed)
3. Compute training and test accuracy
4. Answer: Is there evidence of overfitting?


In [None]:
# TODO: Pick best model from results_df
# Example:
# best_model_name = results_df.iloc[0]["model"]
best_model_name = None  # TODO

best_clf = models[best_model_name]
best_clf.fit(X_train, y_train)

train_acc = accuracy_score(y_train, best_clf.predict(X_train))
test_acc = accuracy_score(y_test, best_clf.predict(X_test))

print("Best model:", best_model_name)
print("Training accuracy:", train_acc)
print("Test accuracy    :", test_acc)


**Overfitting judgment (TODO):**  
- TODO


## Part 7 — Final reflection (required)

Answer briefly:

1. Which model performed best and why (in your view)?
2. Which model would you **not** trust for this dataset?
3. One common beginner mistake you made or almost made.

**Your answers (TODO):**
1. TODO  
2. TODO  
3. TODO  


## Optional stretch goals (extra)

Pick any ONE:

- Add 5-fold cross-validation and compare to your test-set result
- Make a PCA 2D plot to visualize class separation
- Try GridSearchCV on KNN (`n_neighbors`) or SVM (`C`, `gamma`) *carefully* (small grid)

*(Optional work should be clearly separated from the main assignment.)*
