# Classification

In the previous chapter, we learned about logistic regression—a natural starting point for classification due to its close connection with linear regression. Here, we revisit the movement dataset and show how logistic regression can be applied using a single feature. 

We'll use the acceleration along the Z-axis (`z_acc`) to predict the type of exercise, encoded as 0 or 1.


In [None]:
import pandas as pd
import seaborn as sns
import numpy as np
import matplotlib.pyplot as plt

%matplotlib widget

movement = pd.read_csv(
    "https://raw.githubusercontent.com/digital-sustainability/SAI3-2025/refs/heads/main/datasets/movement.csv"
)
movement.head()

We start by fitting a logistic regression model using only one feature: `z_acc`. The target variable is `move_type`, which indicates the class.

In [None]:
from sklearn import linear_model

X = movement[["z_acc"]]
y = movement["move_type"]

log_model = linear_model.LogisticRegression()
log_model.fit(X=X, y=y)

X_pred = pd.DataFrame(np.arange(-10, 50, 0.1), columns=["z_acc"])
pred_prob = log_model.predict_proba(X_pred)
pred = log_model.predict(X_pred)

We visualize the decision boundary of the logistic regression model. While the model predicts a category based on `z_acc`, we will see that it cannot perfectly separate the two classes with this feature alone.

In [None]:
ax = sns.scatterplot(data=movement, x="z_acc", y="move_type")
ax.plot(np.arange(-10, 50, 0.1), pred, "g", label="Predicted class")
ax.legend();

## Logistic Regression with Two Features

With only one feature (`z_acc`), we saw that the logistic regression model couldn't perfectly separate the two exercise types. What if we include an additional feature?

We'll now use both `z_acc` and `y_acc` to fit a logistic regression model. To visualize the data in this higher-dimensional feature space, we'll plot it in 3D.


In [None]:
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1, projection="3d")
ax2 = fig.add_subplot(1, 2, 2, projection="3d")

axes = [ax1, ax2]
for ax in axes:
    scatter = ax.scatter(
        movement["z_acc"].values,
        movement["y_acc"].values,
        movement["move_type"].values,
        c=movement["move_type"].values,
        cmap="Set1",
    )
    ax.set_xlabel("z_acc", fontsize=12)
    ax.set_ylabel("y_acc", fontsize=12)
    ax.set_zlabel("move_type", fontsize=12)

# 3D View (angled and top-down)
ax1.view_init(elev=30.0, azim=20)
ax2.view_init(elev=90.0, azim=-90)
ax2.set_proj_type("ortho")

fig.tight_layout()
plt.show()

These 3D plots show how adding `y_acc` as a second feature helps distinguish the two classes more effectively. While neither feature alone provides a clear separation, their combination makes a linear decision boundary more feasible.

Let’s now fit a logistic regression model using both features.

In [None]:
X = movement[["z_acc", "y_acc"]]
y = movement["move_type"]

log_model = linear_model.LogisticRegression()
log_model.fit(X=X, y=y)

## Visualizing the Sigmoid Decision Surface

When using two features, the logistic regression model defines a **sigmoid-shaped surface** in 3D space. The decision boundary (i.e., where predicted probability is 0.5) becomes a **plane**.

Let’s plot this surface and highlight the region where the model is uncertain (around $p=0.5$).


In [None]:
# Create grid of points spanning the input space
X_1_grid, X_2_grid = np.meshgrid(
    np.linspace(movement["z_acc"].min(), movement["z_acc"].max(), 1000),
    np.linspace(movement["y_acc"].min(), movement["y_acc"].max(), 1000),
)

# Note the correct column order: ['z_acc', 'y_acc']
grid_df = pd.DataFrame(np.c_[X_1_grid.ravel(), X_2_grid.ravel()], columns=["z_acc", "y_acc"])

# Predict probabilities over the grid
pred_prob = log_model.predict_proba(grid_df)

# Select points near decision boundary (probability ≈ 0.5)
mask = (pred_prob[:, 1] >= 0.49) & (pred_prob[:, 1] <= 0.51)
sel_X_1 = X_1_grid.ravel()[mask]
sel_X_2 = X_2_grid.ravel()[mask]
sel_line = pred_prob[:, 1][mask]

In [None]:
# Plot the sigmoid surface and decision boundary
fig = plt.figure(figsize=(10, 5))
ax1 = fig.add_subplot(1, 2, 1, projection="3d", computed_zorder=False)
ax2 = fig.add_subplot(1, 2, 2, projection="3d", computed_zorder=False)

axes = [ax1, ax2]
for ax in axes:
    surf = ax.plot_surface(
        X_1_grid,
        X_2_grid,
        pred_prob[:, 1].reshape(X_1_grid.shape),
        cmap="cool",
        antialiased=True,
        vmin=0,
        vmax=1,
        alpha=0.7,
    )

    ax.scatter(
        movement["z_acc"], movement["y_acc"], movement["move_type"], c=movement["move_type"], cmap="Set1", alpha=0.8
    )

    ax.scatter(sel_X_1, sel_X_2, sel_line, c="black", label="~0.5 prob")

ax1.set_xlabel("z_acc", fontsize=12)
ax1.set_ylabel("y_acc", fontsize=12)
ax1.set_zlabel("Predicted probability", fontsize=12)

ax1.view_init(elev=30.0, azim=70)
ax2.view_init(elev=90.0, azim=-90)
ax2.set_proj_type("ortho")

fig.tight_layout()
plt.show()

From the top-down view (right plot), the sigmoid surface collapses into a **linear decision boundary**. This highlights how combining two features improves classification: the boundary is now more aligned with the data distribution.

## Support Vector Machines (SVM)

Logistic regression aims to separate classes by estimating probabilities and fitting a sigmoid function. However, it doesn’t explicitly maximize the distance between classes.

Support Vector Machines (SVMs), on the other hand, aim to find the **decision boundary that maximizes the margin** between the classes. Only the data points closest to the boundary—called **support vectors**—influence this decision.

Let’s train an SVM with a linear kernel on the same two features: `z_acc` and `y_acc`.


In [None]:
from sklearn import svm

# Define and fit SVM model with a linear kernel
clf = svm.SVC(kernel="linear", C=1)
clf.fit(X, y)

We visualize the decision boundary learned by the SVM, including:
- The boundary itself (solid line)
- The margin (dashed lines)
- The support vectors (highlighted as large empty circles)

The decision function tells us how far each point is from the boundary.

In [None]:
# Prepare grid again for contour plot
xx = np.linspace(movement["z_acc"].min(), movement["z_acc"].max(), 1000)
yy = np.linspace(movement["y_acc"].min(), movement["y_acc"].max(), 1000)
XX, YY = np.meshgrid(xx, yy)
xy = np.vstack([XX.ravel(), YY.ravel()]).T

# Compute distance to decision boundary
Z = clf.decision_function(pd.DataFrame(xy, columns=["z_acc", "y_acc"])).reshape(XX.shape)

# Create a new 2D plot (not 3D!)
fig, ax2d = plt.subplots(figsize=(10, 6))

# Plot data and decision boundary
sns.scatterplot(data=movement, x="z_acc", y="y_acc", hue="move_type", palette="Set1", ax=ax2d)

# Plot the decision boundary and margins
CS = ax2d.contour(XX, YY, Z, colors="k", levels=[-1, 0, 1], alpha=0.5, linestyles=["--", "-", "--"])

# Highlight support vectors
ax2d.scatter(
    clf.support_vectors_[:, 0],
    clf.support_vectors_[:, 1],
    s=100,
    linewidth=1,
    facecolors="none",
    edgecolors="k",
    label="support",
)

ax2d.clabel(CS, inline=1, fontsize=12)
ax2d.legend()
plt.show()

## Non-Linear SVMs with Kernels

Sometimes the data cannot be separated with a straight line (or a plane in higher dimensions). This is where **kernel methods** come in.

Support Vector Machines allow for **non-linear decision boundaries** by implicitly mapping the input features into a higher-dimensional space using a **kernel function**. One common example is the **polynomial kernel**.

Let’s try this on the same dataset using a polynomial kernel of degree 2.

In [None]:
# Define and train SVM with a non-linear (polynomial) kernel
clf_nonlin = svm.SVC(kernel="poly", degree=2, C=10)
clf_nonlin.fit(X, y)

We now visualize the non-linear decision boundary and margins. Compared to the linear model, this approach can better handle curved class boundaries.

In [None]:
# Reuse same mesh grid
xx = np.linspace(movement["z_acc"].min(), movement["z_acc"].max(), 1000)
yy = np.linspace(movement["y_acc"].min(), movement["y_acc"].max(), 1000)
XX, YY = np.meshgrid(xx, yy)
xy = np.vstack([XX.ravel(), YY.ravel()]).T

# Compute decision function values
Z_nonlin = clf_nonlin.decision_function(pd.DataFrame(xy, columns=["z_acc", "y_acc"])).reshape(XX.shape)

# Plot
fig, ax2d_nonlin = plt.subplots(figsize=(10, 6))

sns.scatterplot(data=movement, x="z_acc", y="y_acc", hue="move_type", palette="Set1", ax=ax2d_nonlin)

CS = ax2d_nonlin.contour(XX, YY, Z_nonlin, colors="k", levels=[-1, 0, 1], alpha=0.5, linestyles=["--", "-", "--"])

ax2d_nonlin.scatter(
    clf_nonlin.support_vectors_[:, 0],
    clf_nonlin.support_vectors_[:, 1],
    s=100,
    linewidth=1,
    facecolors="none",
    edgecolors="k",
    label="support",
)

ax2d_nonlin.clabel(CS, inline=1, fontsize=12)
ax2d_nonlin.legend()
plt.show()

## K-Nearest Neighbors (KNN)

Unlike models that learn decision boundaries during training, **K-Nearest Neighbors (KNN)** makes decisions at prediction time. 

For any new data point, the algorithm:
1. Finds the $k$ closest points in the training set,
2. Assigns the most common class label among those neighbors.

This is a **non-parametric**, instance-based method. Here, we use the default setting of $k=5$.


In [None]:
from sklearn import neighbors

# Train KNN classifier
kn_model = neighbors.KNeighborsClassifier()
kn_model.fit(X=X, y=y)

We now visualize the decision regions predicted by KNN. You'll notice that the decision boundary may look more irregular, especially when $k$ is small or outliers are present.

In [None]:
# Predict class labels across the grid
Z_kn = kn_model.predict(pd.DataFrame(xy, columns=["z_acc", "y_acc"])).reshape(XX.shape)

# Plot KNN decision boundary
fig, ax_knn = plt.subplots(figsize=(10, 8))

# Fill contour for classification regions
CS = ax_knn.contourf(XX, YY, Z_kn, cmap="Set2")

# Plot data points with both hue and style
sns.scatterplot(
    data=movement, x="z_acc", y="y_acc", hue="move_type", style="move_type", s=100, ax=ax_knn, palette="Set1"
)

ax_knn.set_title("KNN Decision Regions (k=5)")
ax_knn.legend()
plt.show()

💡 **Observation:** With KNN, the decision boundary can become sensitive to noise or outliers in the dataset. Choosing the right value of $k$ is essential for balancing bias and variance.

## Exercises: Seed Classification

We’ll now apply what we've learned to a new dataset: the **Seeds dataset** from the UCI Machine Learning Repository. This dataset contains measurements of various wheat seeds and their corresponding types.

We'll focus on classifying two of the three seed types (classes 1 and 3).

---

### 1. Load and Preprocess the Data

Load the dataset and keep only the two desired classes. This starting step is already provided for you. Make sure that you understand how the dataframe is filtered.

Then, split your dataset into a train (80%) and test (20%) split. 

💡 Tip: Use `train_test_split` from `sklearn.model_selection`. You can split an entire dataframe without selecting specific columns first.

---

### 2. Logistic Regression

Use logistic regression to classify the seeds based on:
- First only `length_groove`
- Then both `length_groove` and `perimeter`

For each case:
- Fit the model on the train set and make predictions on the test set
- Evaluate the model using a classification report

Which model performs better?

💡 Tip: Use `classification_report` from `sklearn.metrics`.

---

### 3. K-Nearest Neighbors

Train a KNN classifier on the same features. Compare the performance to logistic regression in terms of f1-score. Try different values for k (e.g., 3, 5, 7) and observe how the results change.

---

Feel free to experiment and visualize your results!

In [None]:
# Step 1
seeds = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/00236/seeds_dataset.txt",
    sep="\t",
    on_bad_lines="skip",
    names=["area", "perimeter", "compactness", "length", "width", "symmetry_coef", "length_groove", "seed_type"],
)

# Keep only seed types 1 and 3
seeds = seeds[(seeds.seed_type == 1) | (seeds.seed_type == 3)]