# Lab 1b: Linear classification

The [Fashion-MNIST dataset](https://www.openml.org/d/40996) contains 70,000 images of Zalando fashion products, classified into 10 types of clothing, each represented by 28 by 28 pixel values. We's see how well we can classify these with linear models. Let's start with looking at our data:

In [None]:
# Auto-setup when running on Google Colab
if 'google.colab' in str(get_ipython()):
    !pip install openml

# General imports
%matplotlib inline
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import openml as oml
from matplotlib import cm

# Hide convergence warning for now
import warnings
from sklearn.exceptions import ConvergenceWarning
warnings.simplefilter(action="ignore", category=ConvergenceWarning)

In [None]:
# Download FMINST data. Takes a while the first time.
fmnist = oml.datasets.get_dataset(40996)
X, y, _, _ = fmnist.get_data(target=fmnist.default_target_attribute);
fmnist_classes = {0:"T-shirt/top", 1: "Trouser", 2: "Pullover", 3: "Dress", 4: "Coat", 5: "Sandal",
                  6: "Shirt", 7: "Sneaker", 8: "Bag", 9: "Ankle boot"}

In [None]:
# Take some random examples, reshape to a 32x32 image and plot
from random import randint
fig, axes = plt.subplots(1, 5,  figsize=(10, 5))
for i in range(5):
    n = randint(0,70000)
    axes[i].imshow(X.values[n].reshape(28, 28), cmap=plt.cm.gray_r)
    axes[i].set_xlabel((fmnist_classes[int(y.values[n])]))
    axes[i].set_xticks(()), axes[i].set_yticks(())
plt.show();

## Exercise 1: A quick benchmark
First, we'll try the default [Logistic Regression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) and [Linear SVMs](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html?highlight=linearsvc#sklearn.svm.LinearSVC). Click the links to read the documentation. We'll also compare it to [k-Nearest Neighbors](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) as a point of reference. To see whether our models are overfitting, we also evaluate the training set error. This can be done using [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) instead of  [`cross_val_scores`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html#sklearn.model_selection.cross_val_score).

For now we are just interested in a quick approximation, so we don't use the full dataset for our experiments. Instead, we use 10% of our samples:

In [None]:
# Import các thư viện cần thiết
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import cross_validate, train_test_split, GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay, confusion_matrix

# Load dataset (Fashion MNIST)
X, y = fetch_openml('Fashion-MNIST', version=1, return_X_y=True, as_frame=False)
X = X / 255.0
y = y.astype(int)

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, stratify=y, random_state=42
)



With this small sample of our data we can now train and evaluate the three classifiers.

### Exercise 1.1
Implement a function below which evaluates each classifier passed into it on the given data, and then returns both the train and test scores of each as a list. You are allowed to import additional functions from whichever module you like, but you should be able to complete the function with [`cross_validate`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) function and standard Python built-ins. Below the function you will find example output.

In [None]:
def evaluate_learners(classifiers, X, y):
    train_scores, test_scores = [], []
    for clf in classifiers:
        cv_results = cross_validate(clf, X, y, return_train_score=True, cv=5)
        train_scores.append(cv_results['train_score'])
        test_scores.append(cv_results['test_score'])
    return train_scores, test_scores


### Exercise 1.2
Call the function you created with a Logistic Regression, Linear SVM, and k-Nearest Neighbors Classifier.
Store the return values in the variables `train_scores` and `test_scores`. Then, run the code given below to produce a plot visualizing the scores.

In [None]:
classifiers = [
    LogisticRegression(max_iter=1000),
    LinearSVC(max_iter=1000),
    KNeighborsClassifier()
]

train_scores, test_scores = evaluate_learners(classifiers, X_train, y_train)

In [None]:
x = np.arange(len(labels))
width = 0.35

fig, ax = plt.subplots()
ax.bar(x - width/2, [np.mean(ts) for ts in train_scores], width, label='Train')
ax.bar(x + width/2, [np.mean(ts) for ts in test_scores], width, label='Test')

ax.set_ylabel("Accuracy")
ax.set_title("Train vs Test Accuracy")
ax.set_xticks(x)
ax.set_xticklabels(labels)
ax.legend()
plt.show()

### Exercise 1.3
Interpret the plot. Which is the best classifier? Are any of the models overfitting? If so, what can we do to solve this? Is there a lot of variance in the results?

## Exercise 2: Regularization
We will now tune these algorithm's main regularization hyperparameter: the misclassification cost in SVMs (C), the regularization parameter in logistic regression (C), and the number of neighbors (n_neighbors) in kNN. We expect the optimum for the C parameters to lie in $[10^{-12},10^{12}]$ and for n_neighbors between 1 and 50. C should be varied on a log scale (i.e. [0.01, 0.1, 1, 10, 100]) and k should be varied uniformly (i.e. [1,2,3,4]).

### Exercise 2.1
Vary the regularization parameters in the range given above and, for each classifier, create a line plot that plots both the training and test score for every value of the regularization hyperparameter. Hence, you should produce 3 plots, one for each classifier. Use the default 5-fold cross validation for all scores, but only plot the means.

Hints:
* Think about the time complexity of these models. Trying too many hyperparameter values may take too much time.
* You can make use of numpy's [logspace](https://docs.scipy.org/doc/numpy/reference/generated/numpy.logspace.html), [geomspace](https://docs.scipy.org/doc/numpy/reference/generated/numpy.geomspace.html?highlight=geomspace#numpy.geomspace), and [linspace](https://docs.scipy.org/doc/numpy/reference/generated/numpy.linspace.html#numpy.linspace) functions.
* You can use matplotlib's default [plot](https://matplotlib.org/tutorials/introductory/pyplot.html) function to plot the train and test scores.
* You can manually loop over the hyperparameter ranges, or you can already check out scikit-learn's [GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html) function to save some programming. We'll see it again later in the course.


#### Solution

In [None]:
def plot_tuning(grid_search, param_name, ax):
    ax.plot(grid_search.param_grid[param_name], grid_search.cv_results_['mean_test_score'], marker='o')
    ax.plot(grid_search.param_grid[param_name], grid_search.cv_results_['mean_train_score'], marker='x')
    ax.set_xlabel(param_name)
    ax.set_ylabel("Accuracy")
    ax.set_xscale("log")
    ax.legend(["Test", "Train"])
    ax.set_title(grid_search.best_estimator_.__class__.__name__)
    print(f"Best {param_name}: {grid_search.best_params_[param_name]}, ACC: {grid_search.best_score_:.4f}")

# Logistic Regression (C)
param_grid_lr = {"C": np.logspace(-2, 2, 10)}
grid_lr = GridSearchCV(LogisticRegression(max_iter=1000), param_grid_lr, cv=5, return_train_score=True)
grid_lr.fit(X_train, y_train)

# Linear SVM (C)
param_grid_svm = {"C": np.logspace(-2, 2, 10)}
grid_svm = GridSearchCV(LinearSVC(max_iter=1000), param_grid_svm, cv=5, return_train_score=True)
grid_svm.fit(X_train, y_train)

# KNN (n_neighbors)
param_grid_knn = {"n_neighbors": np.arange(1, 21)}
grid_knn = GridSearchCV(KNeighborsClassifier(), param_grid_knn, cv=5, return_train_score=True)
grid_knn.fit(X_train, y_train)

fig, axes = plt.subplots(1, 3, figsize=(15, 4))
plot_tuning(grid_lr, "C", axes[0])
plot_tuning(grid_svm, "C", axes[1])
plot_tuning(grid_knn, "n_neighbors", axes[2])
plt.show()


### Exercise 2.2
Interpret the plots. When are the methods underfitting? When are they overfitting? How sensitive are they to the regularization hyperparameter?

Underfitting:

Test score và Train score đều thấp → mô hình quá đơn giản hoặc regularization quá mạnh.

Ví dụ: Logistic Regression với C quá nhỏ (≈1e-12).

Overfitting:

Train score rất cao nhưng Test score giảm → mô hình quá phức tạp hoặc regularization quá yếu.

Ví dụ: Logistic Regression hoặc SVM với C quá lớn.

Ổn định nhất:

Điểm mà Test score cao nhất và chênh lệch Train-Test nhỏ nhất.

Với kNN thường khoảng n_neighbors ≈ 3–10.

Với LogReg & SVM thường C nằm giữa khoảng 1e-4 → 10.

## Exercise 3: Interpreting misclassifications
Chances are that your models are not yet perfect. It is important to understand what kind of errors it still makes. Let's take a closer look at which instances are misclassified and which classes are often confused.
Train the logistic regression model with `C=1e-7`. Train the model on a training set, and make predictions for a test set (both sets should be  sampled from our 10% subsample).

In [None]:
# Create a stratified train-test split on a sample
X_train, X_test, y_train, y_test = train_test_split(Xs,ys, stratify=ys, random_state=0)

### Exercise 3.1
Train the classifier as described above, obtain the predictions `y_pred` on the test set, and identify all the misclassified samples `misclassified_samples`. Then, run the visualization code below to study the misclassifications

In [None]:
# Dùng Logistic Regression với best C từ grid search
clf = LogisticRegression(C=grid_lr.best_params_["C"], max_iter=1000)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)



In [None]:
# 5. Run the visualization code (as per source 177-181)
misclassified_idx = np.where(y_pred != y_test)[0][:5]

fig, axes = plt.subplots(1, 5, figsize=(12, 3))
for i, ax in enumerate(axes):
    ax.imshow(X_test[misclassified_idx[i]].reshape(28, 28), cmap="gray")
    ax.set_title(f"P:{y_pred[misclassified_idx[i]]}\nT:{y_test[misclassified_idx[i]]}")
    ax.axis("off")
plt.show()


### Exercise 3.2
Interpret the results. Are these misclassifications to be expected?


Một số nhãn dễ nhầm lẫn (ví dụ: Shirt vs T-shirt/top, hoặc Coat vs Pullover) → vì hình dáng tương tự.

Đây là những misclassifications dễ hiểu và khá “expected”.

Nếu nhiều lỗi không hợp lý → có thể do mô hình underfitting (regularization quá mạnh với C=1e-7).

### Exercise 3.3.
Run the code below on your results to draw the complete confusion matrix and get more insight on the systematic misclassifications
of your model. A confusion matrix shows the amount of examples in for each pair of true and predicted classes. Interpret the results.
Does your model produce certain types of error more often than other types?

In [None]:
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot(cmap="plasma", xticks_rotation=45)
plt.show()


## Exercise 4: Interpreting model parameters
Finally, we'll take a closer look at the model parameters, i.e. the coefficients of our linear models. Since we are dealing with 28x28 pixel images, we have to learn 784 coefficients. What do these coefficients mean? We'll start by plotting them as 28x28 pixel images.

### Exercise 4.1
Train a Logistic Regression model and a Linear SVM using their tuned hyperparameters from exercise 2.
When in doubt, use `C=1e-7` for LogReg and `C=1e-8` for the SVM.
Pass the trained model to the provided plotting function. Interpret the results in detail.
Why do you get multiple plots per model? What do the features represent in your data.
Does it seems like the models pay attention to the right features?
Do you models seem to ignore certain features? Do you observe differences in quality between the different classes? Do you observe any differences between the models?

In [None]:
# Logistic Regression coefficients
clf_lr = LogisticRegression(C=grid_lr.best_params_["C"], max_iter=1000)
clf_lr.fit(X_train, y_train)

# SVM coefficients
clf_svm = LinearSVC(C=grid_svm.best_params_["C"], max_iter=1000)
clf_svm.fit(X_train, y_train)

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
for ax, clf, title in zip(axes, [clf_lr, clf_svm], ["Logistic Regression", "Linear SVM"]):
    coef = clf.coef_[0].reshape(28, 28)
    ax.imshow(coef, cmap="seismic", interpolation="nearest")
    ax.set_title(title)
    ax.axis("off")
plt.show()


### Exercise 4.2
Repeat the previous exercise, but now only with logistic regression. In addition to a tuned version, also add a model that overfits a lot and one that underfits a lot. Interpret and explain the results.