<a href="https://colab.research.google.com/github/gustafbjurstam/ML-retreat-tekmek-2025/blob/main/supervised_classification_methods_revised.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Table of Contents
1. Supervised Classification Methods  
    1.1 Introduction  
    1.2 What does "Supervised Classification" mean?  
    1.3 Roadmap  
2. Motivating Example: Fluid Flow Classification  
    2.1 Problem setup  
    2.2 Mathematical models  
3. Logistic Regression  
    3.1 Intuition and how it works  
    3.2 Worked-out example  
    3.3 Fluid velocity profile application  
4. k-Nearest Neighbors (k-NN)  
    4.1 Intuition and how it works  
    4.2 Distance metrics  
    4.3 Advantages and disadvantages  
5. Support Vector Machine (SVM)  
    5.1 Hard and Soft Margins  
    5.2 Non-linear data  
6. Multi-Layer Perceptron (MLP)  
7. Classification methods comparison

---

# Supervised Classification Methods

Our goal is to understand what a "supervised" machine learning method is and explore how different machine learning techniques can be used to classify data when the class labels are known in advance.

## Introduction

This notebook introduces the concepts of **linear regression** and **logistic regression**, two fundamental techniques in supervised machine learning. Using practical engineering-inspired examples, we will explore how these models work, how to fit them to data, and how to interpret their results. We will also discuss their assumptions, limitations, and how they can be extended to more complex problems.

---

## What does "Supervised Classification" mean?

In **supervised learning**, we train models on data where each example comes with a known **label**. The model's job is to learn a function that maps inputs to their correct outputs. Once trained, the model can be used to predict labels for new, unseen data.

In **classification**, the labels are **discrete categories**. For example:
- Email: *spam* or *not spam*
- Medical image: *benign* or *malignant*
- Fluid flow: *laminar* or *turbulent*

In this notebook, we focus on **binary classification**, where there are only two possible classes. However, there are methods that can handle multiple categories.

---

## Roadmap

For each classification method, we'll follow a consistent structure:
1. **Intuition and how it works** — Understand the basic idea behind the method, key mechanics and parameters.
2. **Visual example or demonstration** — Visual illustration in low dimensions.
3. **Fluid velocity profile application** — The setup for this example will be defined in the next section. We will test the performance of different methods using this example.

By the end of this notebook, you'll have an intuitive and practical understanding of:
- Logistic Regression  
- k-Nearest Neighbors (k-NN)
- Support Vector Machine (SVM)
- Multi-Layer Perceptron (MLP)
- Classification methods comparison

---

Let's get started!


In [None]:
#@title Setup: Import all required libraries

# Core scientific computing
import numpy as np
import matplotlib.pyplot as plt

# Interactive widgets
import ipywidgets as widgets
from ipywidgets import interact, FloatSlider, FloatLogSlider
from IPython.display import display

# Machine learning - datasets
from sklearn.datasets import make_blobs

# Machine learning - preprocessing
from sklearn.preprocessing import PolynomialFeatures

# Machine learning - pipeline
from sklearn.pipeline import make_pipeline

# Machine learning - models
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

# Machine learning - model selection
from sklearn.model_selection import train_test_split

# Machine learning - metrics
from sklearn.metrics import accuracy_score, confusion_matrix

# Motivating Example: Fluid Flow Classification

To make our exploration concrete, we'll use a **fluid mechanics** example: classifying velocity profiles in pipe flow.

## Problem Setup

**Physical system**: Fluid flowing through a circular pipe (like water through a garden hose).

When we look at a **cross-section** of the pipe, the velocity varies from the center (fastest) to the walls (zero due to friction). We measure the velocity at 51 points across the diameter, creating a **51-dimensional feature vector** for each flow observation.

**Two flow regimes**:
- **Laminar flow**: Smooth, parabolic velocity profile (low flow rates)
- **Turbulent flow**: Flatter, more irregular profile (high flow rates)

## Geometry and Normalization

**Radial position $ y $**: We normalize the pipe radius so $ y \in [-1, 1] $, where:
- $ y = -1 $: left pipe wall
- $ y = 0 $: pipe centerline
- $ y = +1 $: right pipe wall

This normalization makes the problem **scale-independent** — the same model works whether we're analyzing a millimeter capillary or a meter-wide industrial pipe.

**Velocity $ v(y) $**: Normalized by the centerline velocity $ u_{\infty} $ (the maximum speed at the pipe center).

## Mathematical Models

- **Laminar flow** (parabolic profile):  
  $$v(y) = u_{\infty} - y^2 + \varepsilon$$

- **Turbulent flow** (power-law profile):  
  $$v(y) = u_{\infty} - C |y|^{7} + \varepsilon$$

Where $ \varepsilon \sim \mathcal{N}(0, \sigma^2) $ represents measurement noise.

The plot below shows example profiles. Notice how laminar flow has a sharper curvature near the walls, while turbulent flow is flatter in the center due to enhanced mixing.

In [None]:
#@title Generating laminar and turbulent velocity profiles

def generate_velocity_profile(flow_type='laminar', u_infty=1.0, C=1.0, sigma=0.01, n_points=51, seed=None):
    """
    Generate a velocity profile across a pipe diameter.

    Parameters:
    - flow_type (str): 'laminar' or 'turbulent'
    - u_infty (float): centerline velocity (maximum speed)
    - C (float): turbulent profile shape constant
    - sigma (float): measurement noise level
    - n_points (int): measurement points across diameter
    - seed (int or None): random seed for reproducibility

    Returns:
    - y (np.ndarray): normalized radial positions [-1, 1]
    - v (np.ndarray): velocity at each position
    """
    if seed is not None:
        np.random.seed(seed)

    # Normalized positions: -1 (left wall) to +1 (right wall)
    y = np.linspace(-1, 1, n_points)

    # Measurement noise
    noise = np.random.normal(loc=0.0, scale=sigma, size=n_points)

    if flow_type == 'laminar':
        v = u_infty - y**2 + noise
    elif flow_type == 'turbulent':
        v = u_infty - C * (np.abs(y))**(7) + noise
    else:
        raise ValueError("flow_type must be either 'laminar' or 'turbulent'")

    return y, v

# Generate both profiles
y, v_laminar = generate_velocity_profile('laminar', sigma=0.02, seed=42)
_, v_turbulent = generate_velocity_profile('turbulent', C=1, sigma=0.02, seed=42)

# Plot
plt.figure(figsize=(8, 4))
plt.plot(y, v_laminar, label='Laminar', linewidth=2)
plt.plot(y, v_turbulent, label='Turbulent', linewidth=2)
plt.xlabel("Normalized radial position y (pipe wall to wall)")
plt.ylabel("Normalized velocity v(y) / u∞")
plt.title("Velocity Profiles Across Pipe Diameter")
plt.axvline(x=-1, color='k', linestyle='--', alpha=0.3, linewidth=1)
plt.axvline(x=1, color='k', linestyle='--', alpha=0.3, linewidth=1)
plt.text(-1, 0.05, 'Left\nwall', ha='center', fontsize=9, alpha=0.6)
plt.text(1, 0.05, 'Right\nwall', ha='center', fontsize=9, alpha=0.6)
plt.text(0, 1.05, 'Center', ha='center', fontsize=9, alpha=0.6)
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

# Logistic Regression


## Intuition and how it works

Logistic regression is a method that was introduced in the previous workshop. It is one of the most basic types of classification techniques. It relies on trying to fit a sigmoidal function in a way to maximize the likelihood of all data points. Data points can be multi-dimensional. However, any point put through the sigmoidal function will be a scalar. For example, let's assume there is a 4 dimensional data point $(x_1, x_2, x_3, x_4)$. The boundary condition used for the classification problem is defined as $d(\vec{x}) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \theta_4 x_4$. The scalar value of $d(\vec{x})$ is then used as input to the sigmoidal function which produces the final likelihood value $h_{\theta}(\vec{x}) = \sigma(d(\vec{x})) \in [-1, 1]$. In our example, logistic regression relies on a mapping from $R^4$ to $R$ when determining the optimal decision boundary (i.e. the coefficients of $\theta_i, i \in \{1, 2, 3, 4\}$).

## Worked-out example

**Example 1** uses logistic regression to fit a decision boundary in the form of:

$$d(x_1, x_2) = \theta_0 + \theta_1 x_1 + \theta_2 x_2$$

However, logistic regression can be used in combination with other kinds of decision boundaries as long as they are linear with respect to their coefficients. In our case, **Example 2** is able to fit a circural decision boundary because it is defined as:

$$d(x_1, x_2) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_1 x_2 + \theta_4 x_1^2 + \theta_5 x_2^2$$

Such $d(x_1, x_2)$ gives more flexibility as the decision boundary can be defined as a polynomial. Here of degree $n=2$.

In [None]:
#@title Logistic regression Example 1

# Generate linear data
n_samples = 200

# Class 0
x0 = np.random.normal(loc=[1, 1], scale=0.5, size=(n_samples // 2, 2))
# Class 1
x1 = np.random.normal(loc=[3, 3], scale=0.5, size=(n_samples // 2, 2))

X_linear = np.vstack((x0, x1))
y_linear = np.array([0] * (n_samples // 2) + [1] * (n_samples // 2))

# Fit logistic regression
model_linear = LogisticRegression()
model_linear.fit(X_linear, y_linear)

# Plot decision boundary
xx, yy = np.meshgrid(np.linspace(-1, 4.5, 300), np.linspace(-1, 4.5, 300))
grid_points = np.c_[xx.ravel(), yy.ravel()]
Z = model_linear.predict_proba(grid_points)[:, 1].reshape(xx.shape)

plt.figure(figsize=(6, 6))
plt.contourf(xx, yy, Z, levels=[0, 0.5, 1], alpha=0.2, colors=["blue", "red"])
plt.scatter(X_linear[y_linear == 0, 0], X_linear[y_linear == 0, 1], label="Class 0", alpha=0.6)
plt.scatter(X_linear[y_linear == 1, 0], X_linear[y_linear == 1, 1], label="Class 1", alpha=0.6)
plt.title("Linear Boundary: Logistic Regression")
plt.xlabel("x0")
plt.ylabel("x1")
plt.legend()
plt.grid(True)
plt.axis("equal")
plt.tight_layout()
plt.show()


In [None]:
#@title Logistic regression Example 2
# Generate circular data
np.random.seed(0)
n_samples = 200
radius = 1.0

# Inner circle (class 0)
r0 = radius * np.sqrt(np.random.rand(n_samples // 2))
theta0 = 2 * np.pi * np.random.rand(n_samples // 2)
x0 = np.stack((r0 * np.cos(theta0), r0 * np.sin(theta0)), axis=1)

# Outer ring (class 1)
r1 = radius + 0.5 * np.random.rand(n_samples // 2)
theta1 = 2 * np.pi * np.random.rand(n_samples // 2)
x1 = np.stack((r1 * np.cos(theta1), r1 * np.sin(theta1)), axis=1)

# Combine and label
X_circle = np.vstack((x0, x1))
y_circle = np.array([0] * (n_samples // 2) + [1] * (n_samples // 2))

# Fit logistic regression with polynomial features
model_circle = make_pipeline(PolynomialFeatures(degree=2), LogisticRegression())
model_circle.fit(X_circle, y_circle)

# Plotting
xx, yy = np.meshgrid(np.linspace(-2, 2, 300), np.linspace(-2, 2, 300))
grid_points = np.c_[xx.ravel(), yy.ravel()]
Z = model_circle.predict_proba(grid_points)[:, 1].reshape(xx.shape)

plt.figure(figsize=(6, 6))
plt.contourf(xx, yy, Z, levels=[0, 0.5, 1], alpha=0.2, colors=["blue", "red"])
plt.scatter(X_circle[y_circle == 0, 0], X_circle[y_circle == 0, 1], label="Class 0", alpha=0.6)
plt.scatter(X_circle[y_circle == 1, 0], X_circle[y_circle == 1, 1], label="Class 1", alpha=0.6)
plt.title("Nonlinear Boundary: Logistic Regression with Polynomial Features")
plt.xlabel("x0")
plt.ylabel("x1")
plt.legend()
plt.grid(True)
plt.axis("equal")
plt.tight_layout()
plt.show()


## Fluid velocity profile application

As mentioned before, logistic regression can go beyond the visual confinements of 2D and 3D. The question is, how well can it distinguish between the laminar and turbulent velocity profile?

In [None]:
#@title Flow velocity profile application

# --- Generate labeled dataset ---
n_samples_per_class = 200
n_points = 51
sigma = 0.05  # increased noise to make the task more realistic

X = []
y = []

for i in range(n_samples_per_class):
    _, v_laminar = generate_velocity_profile('laminar', sigma=sigma)
    _, v_turbulent = generate_velocity_profile('turbulent', sigma=sigma)
    X.append(v_laminar)
    y.append(0)
    X.append(v_turbulent)
    y.append(1)

X = np.array(X)
y = np.array(y)

# --- Split into train/test sets ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Fit logistic regression model ---
clf = LogisticRegression(max_iter=1000)
clf.fit(X_train, y_train)

# --- Evaluate the model ---
y_pred = clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.2f}")

# --- Optional: show confusion matrix ---
# print("Confusion matrix:")
# print(confusion_matrix(y_test, y_pred))

# --- Visualize a few examples ---
fig, axes = plt.subplots(2, 3, figsize=(12, 6))
y_vals = np.linspace(-1, 1, n_points)
for i, ax in enumerate(axes.ravel()):
    idx = i
    ax.plot(y_vals, X_test[idx], label="Velocity profile")
    pred = clf.predict(X_test[idx].reshape(1, -1))[0]
    true = y_test[idx]
    ax.set_title(f"True: {'Laminar' if true == 0 else 'Turbulent'} | Pred: {'Laminar' if pred == 0 else 'Turbulent'}")
    ax.set_xlabel("y")
    ax.set_ylabel("v(y)")
    ax.grid(True)

plt.tight_layout()
plt.show()


As we can see, logistic regression is just fine for the task. It manages to classify every flow profile with 100% accuracy.

# k-Nearest Neighbors (k-NN)

## Intuition and how it works

k-Nearest Neighbors (k-NN) has a very intuitive principle of work. In fact, it's name already explains it. Unlike logistic regression, k-NN can be used for multi-class classification "out of the box". Here is the way k-NN works:

1. Pick a data point you want to classify.
2. Find k nearest neighbors. A neighbor is one of k data points closest to the point we want to classify.
3. Assign the class to the point that corresponds to the majority class within the k neighbors.

And that is all, nice and simple! The only parameters we need to decide on are k and the way that the distance between points is calculated. If we choose too low k then the results can be noisy since few neighbors decide on the classification. On the other hand, picking too high of k can lead to outright rejection of classes that do not have many points in the data set. Play around with the demo below and try to recreate those two scenarios.



# k-Nearest Neighbors (k-NN)

## Intuition and how it works

k-Nearest Neighbors (k-NN) has a very intuitive principle of work. In fact, its name already explains it. Unlike logistic regression, k-NN can be used for multi-class classification "out of the box". Here is the way k-NN works:

1. Pick a data point you want to classify.
2. Find the k nearest neighbors.
3. Assign the class to the point that corresponds to the majority class within the k neighbors.

And that is all — nice and simple! The only parameters we need to decide on are **k** and the way that the **distance between points is calculated**.

## Distance metrics

Different datasets and problem domains can benefit from different distance metrics. Here are some common options:

- **Euclidean distance** (default): Good for continuous, geometrically well-behaved data.
- **Manhattan distance**: More robust when the data contains outliers or if dimensions are not directly comparable.
- **Minkowski distance**: A generalization of both Euclidean and Manhattan (you can tune its power parameter).
- **Cosine similarity**: Often used for text or high-dimensional sparse data, where direction matters more than magnitude.

Choosing the right distance metric helps k-NN reflect the true "closeness" between data points in a meaningful way for your specific problem.

## Advantages and disadvantages

**Advantages:**

- Very simple and intuitive.
- Naturally handles multi-class classification.
- No training phase — the model just stores the data.

**Disadvantages:**

- Computationally expensive at prediction time (especially for large datasets).
- Performance can degrade in high-dimensional spaces ("curse of dimensionality").
- Sensitive to irrelevant or redundant features and scaling of data.

Play around with the demo below and try to recreate different behaviors by changing **k** and experimenting with how points are distributed. Try small values of k to see overfitting in action, or high values to observe how the model smooths over finer details.


In [None]:
#@title Interactive k-NN demo
%matplotlib inline


# === Generate dataset ===
centers = [(-2, -2), (2, 0), (-1, 4), (3, 7)]
cluster_std = [1.0, 1.5, 0.5, 1.0]
X, y_labels = make_blobs(n_samples=[100, 200, 5, 100], centers=centers, cluster_std=cluster_std, random_state=42)

# === Fit once globally ===
knn_global = KNeighborsClassifier()
knn_global.fit(X, y_labels)

# === Define plotting function ===
def plot_knn_decision_boundary(x=0.0, y=0.0, k=5, show_lines=True):
    # Convert slider inputs to native float (in case they are numpy scalars)
    x = float(x)
    y = float(y)

    # Refit classifier with selected k
    knn_global.set_params(n_neighbors=k)
    knn_global.fit(X, y_labels)

    # Mesh grid for decision boundary
    h = 0.1
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = knn_global.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Query point and neighbors
    query_point = np.array([[x, y]])
    query_class = knn_global.predict(query_point)[0]
    print(f"color is: {query_class}")
    distances, indices = knn_global.kneighbors(query_point)
    neighbor_pts = X[indices[0]]

    # Define class colors
    color_map = {
    0: 'blue',
    1: 'red',
    2: 'pink',
    3: 'cyan',
    }

    # Plotting
    fig, ax = plt.subplots(figsize=(8, 7))
    ax.contourf(xx, yy, Z, alpha=0.3, cmap=plt.cm.tab10)
    ax.scatter(X[:, 0], X[:, 1], c=y_labels, cmap=plt.cm.tab10, edgecolor='k', s=50)

    if show_lines:
        for pt in neighbor_pts:
            ax.plot([x, pt[0]], [y, pt[1]], 'k--', linewidth=1)

    ax.scatter(neighbor_pts[:, 0], neighbor_pts[:, 1],
               facecolors='none', edgecolors='black', s=150, linewidths=2, label='Neighbors')

    ax.scatter(x, y, c=color_map[query_class], edgecolor='k', s=200, marker='o', label='Query Point')

    ax.set_title(f"Interactive k-NN Classification (k={k})", fontsize=14)
    ax.set_xlabel("x")
    ax.set_ylabel("y")
    ax.legend()
    ax.grid(True)
    plt.tight_layout()
    plt.show()


# === Sliders ===
x_slider = widgets.FloatSlider(value=2.0, min=-4.0, max=6.0, step=0.1, description='x:')
y_slider = widgets.FloatSlider(value=6.0, min=-4.0, max=9.0, step=0.1, description='y:')
k_slider = widgets.IntSlider(value=5, min=1, max=20, step=1, description='k:')
lines_toggle = widgets.Checkbox(value=True, description='Show lines to neighbors')

interactive_plot = widgets.interactive_output(
    plot_knn_decision_boundary,
    {'x': x_slider, 'y': y_slider, 'k': k_slider, 'show_lines': lines_toggle}
)

ui = widgets.VBox([x_slider, y_slider, k_slider, lines_toggle])
display(ui, interactive_plot)


In [None]:
#@title Flow velocity profile application

# --- Generate labeled dataset ---
n_samples_per_class = 200
n_points = 51
sigma = 0.05  # noise level

X = []
y = []

for i in range(n_samples_per_class):
    _, v_laminar = generate_velocity_profile('laminar', sigma=sigma)
    _, v_turbulent = generate_velocity_profile('turbulent', sigma=sigma)
    X.append(v_laminar)
    y.append(0)
    X.append(v_turbulent)
    y.append(1)

X = np.array(X)
y = np.array(y)

# --- Split into train/test sets ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Fit k-NN model ---
k = 5
knn = KNeighborsClassifier(n_neighbors=k)
knn.fit(X_train, y_train)

# --- Evaluate the model ---
y_pred = knn.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"Accuracy: {acc:.2f}")

# --- Optional: show confusion matrix ---
# print("Confusion matrix:")
# print(confusion_matrix(y_test, y_pred))

# --- Visualize a few examples ---
fig, axes = plt.subplots(2, 3, figsize=(12, 6))
y_vals = np.linspace(-1, 1, n_points)
for i, ax in enumerate(axes.ravel()):
    idx = i
    ax.plot(y_vals, X_test[idx], label="Velocity profile")
    pred = knn.predict(X_test[idx].reshape(1, -1))[0]
    true = y_test[idx]
    ax.set_title(f"True: {'Laminar' if true == 0 else 'Turbulent'} | Pred: {'Laminar' if pred == 0 else 'Turbulent'}")
    ax.set_xlabel("y")
    ax.set_ylabel("v(y)")
    ax.grid(True)

plt.tight_layout()
plt.show()


Once again, k-NN manages to classify the flows just fine.

# Support Vector Machine (SVM)

Another supervised machine learning classification method is the Support Vector Machine (SVM). The main idea behind it is to separate two classes using a hyperplane. Examples of hyperplanes are:

- 1D: point $h_{\theta}(x_1) = \theta_0 + \theta_1 x_1$
- 2D: line $h_{\theta}(x_1, x_2) = \theta_0 + \theta_1 x_1 + \theta_2 x_2$
- 3D: plane $h_{\theta}(x_1, x_2, x_3) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3$
- ... and so on

The main advantage of SVM is that the optimization problem it poses can be solved using a method called **quadratic programming**, which is computationally efficient. Contrast this with iterative methods such as gradient descent, which require multiple updates to converge to a solution. Because of this, SVM is very efficient for classifying **linearly separable** data.

Even though this seems like a limitation, we will later show how this approach can still be applied to data that is not linearly separable, through kernel methods. However, we will not cover the mathematical details of SVMs, as that is beyond the scope of this workshop.

---

Below is an example of two clusters that need to be separated with a line. What makes one line better than another?

In [None]:
#@title Figure 1
from IPython.display import display, Image, SVG, Video
display(SVG(filename='/content/decision-boundaries.svg'))

Typically, the line on the left in Figure 1 appears to divide the clusters in the most natural way. This is also how SVM attempts to separate classes: by finding the hyperplane that maximizes the **margin** between the two classes (see Figure 2).

SVM’s objective is to find a decision boundary that:

- maximizes the margin between the nearest points of the two classes,
- and still correctly classifies the training examples.

The points that lie closest to the hyperplane and define the margin are called **support vectors** (see Figure 2). They are critical because the margin is "pushed" up against them from both sides.

In [None]:
#@title Figure 2
display(SVG(filename='/content/terminology.svg'))

The simple demo below presents the core idea behind SVM. Try to manually find a line that separates the classes while maximizing the margin.

In [None]:
#@title SVM margin demo

# --- Generate two clusters ---
np.random.seed(1)
n_points = 10
cluster_1 = np.random.randn(n_points, 2) + np.array([-2, 4])
cluster_2 = np.random.randn(n_points, 2) + np.array([2, -4])
X = np.vstack((cluster_1, cluster_2))
y = np.array([1]*n_points + [-1]*n_points)

def svm_margin_plot(slope):
    w = -np.array([slope, -1])  # Normal vector to decision boundary
    w = w / np.linalg.norm(w)  # Normalize

    # Project points onto normal vector
    projections = X @ w
    margin_pos = projections[y == 1].min()
    margin_neg = projections[y == -1].max()
    margin = 0.5 * (margin_pos - margin_neg)
    decision_offset = 0.5 * (margin_pos + margin_neg)

    def plot_margin_line(ax, offset, style='k--', label=None):
        midpoint = offset * w
        direction = np.array([-w[1], w[0]])  # Perpendicular direction
        line_points = np.array([midpoint + t * direction for t in [-10, 10]])
        ax.plot(line_points[:, 0], line_points[:, 1], style, label=label)

    # Plotting
    fig, ax = plt.subplots(figsize=(8, 6))
    ax.scatter(cluster_1[:, 0], cluster_1[:, 1], c='red', label='Class +1')
    ax.scatter(cluster_2[:, 0], cluster_2[:, 1], c='blue', label='Class -1')

    plot_margin_line(ax, decision_offset, style='k-', label='Decision boundary')
    plot_margin_line(ax, margin_pos, style='k--', label='Margin boundaries')
    plot_margin_line(ax, margin_neg, style='k--')

    # Highlight support vectors
    sv_pos = X[y == 1][np.argmin(projections[y == 1])]
    sv_neg = X[y == -1][np.argmax(projections[y == -1])]
    ax.scatter(*sv_pos, s=150, facecolors='none', edgecolors='red', linewidths=2, label='Support vectors')
    ax.scatter(*sv_neg, s=150, facecolors='none', edgecolors='blue', linewidths=2)

    ax.set_xlim(-10, 10)
    ax.set_ylim(-10, 15)
    ax.set_aspect('equal', adjustable='box')  # Changed this line
    ax.set_title(f"SVM Margin and Decision Boundary (slope = {slope:.2f})")
    ax.set_xlabel("x₁")
    ax.set_ylabel("x₂")
    ax.grid(True)
    ax.legend()
    plt.tight_layout()
    plt.show()

# Create interactive slider
interact(svm_margin_plot, slope=FloatSlider(value=-0.5, min=-1, max=20, step=0.1));

### Hard and Soft Margins

So far, we have looked at the case where two clusters can be perfectly separated by a line. However, in many real-world datasets, misclassifications are inevitable, and a perfect separation is not possible. In such cases, the standard SVM cannot find a valid decision boundary.

What we need is some flexibility in the margin — a way to tolerate a small number of misclassified points when determining the separating hyperplane. This approach is known as the **soft margin**, as opposed to the **hard margin**, which assumes that the data is linearly separable with no exceptions.

The **soft margin** SVM introduces a mechanism to allow certain violations of the margin — that is, to allow some points to fall on the wrong side of the boundary or within the margin. This makes it more robust and better suited for real-world data.

The trade-off between maximizing the margin and minimizing classification errors is controlled by the parameter $C$. This parameter determines how much **penalty** is assigned to misclassified points:

- A **large $C$** forces the model to classify all training examples correctly (hard margin behavior), possibly at the cost of a smaller margin.
- A **small $C$** allows for more margin violations (soft margin), leading to a wider margin and potentially better generalization.

Soft-margin SVM is the **default** behavior in most machine learning libraries and toolboxes.

Below is a plot that shows the effects of different $C$ values. A hard margin can be approximated by using a very large $C$.


In [None]:
#@title Hard vs soft margin plot
import numpy as np
import matplotlib.pyplot as plt
from sklearn.svm import SVC

# --- Generate linearly separable data ---
np.random.seed(0)
n_samples = 20
cluster_1 = np.random.randn(n_samples, 2) + np.array([2, 2])
cluster_2 = np.random.randn(n_samples, 2) + np.array([-2, -2])
X = np.vstack((cluster_1, cluster_2))
y = np.array([1]*n_samples + [-1]*n_samples)

# --- Fit hard margin and soft margin SVM ---
svm_hard = SVC(kernel='linear', C=1e10)
svm_soft = SVC(kernel='linear', C=0.1)
svm_hard.fit(X, y)
svm_soft.fit(X, y)

# --- Create mesh for decision boundary visualization ---
h = 0.05
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
grid = np.c_[xx.ravel(), yy.ravel()]

Z_hard = svm_hard.decision_function(grid).reshape(xx.shape)
Z_soft = svm_soft.decision_function(grid).reshape(xx.shape)

# --- Plotting ---
fig, axs = plt.subplots(1, 2, figsize=(12, 5))

titles = ['Hard Margin SVM (C=1e10)', 'Soft Margin SVM (C=0.1)']
svms = [svm_hard, svm_soft]
Zs = [Z_hard, Z_soft]

for ax, title, svm, Z in zip(axs, titles, svms, Zs):
    ax.contourf(xx, yy, Z > 0, alpha=0.2, cmap=plt.cm.coolwarm)
    ax.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])

    ax.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolors='k')
    ax.scatter(svm.support_vectors_[:, 0],
               svm.support_vectors_[:, 1],
               s=150, facecolors='none', edgecolors='k', linewidths=1.5, label='Support Vectors')

    ax.set_title(title)
    ax.set_xlabel("x₁")
    ax.set_ylabel("x₂")
    ax.grid(True)
    ax.legend()

plt.tight_layout()
plt.show()


Now, what happens if data is not fully separable due to noisy data and outliers? That is where the hard margin version of SVM does not have any solution. The soft margin SVM however can handle such data. However, it is left up to the user to decide on an appropriate value of $C$ which will result in a decision boundary that can be generalized to the 'real' data distribution.

In [None]:
#@title Non-linearly separable data with soft margin SVM demo

# --- Generate non-linearly separable data ---
np.random.seed(1)
n_samples = 20
cluster_1 = 1.5 * np.random.randn(n_samples, 2) + np.array([1, 1])
cluster_2 = np.random.randn(n_samples, 2) + np.array([-1, -1])
X = np.vstack((cluster_1, cluster_2))
y = np.array([1] * n_samples + [-1] * n_samples)

# --- Create mesh grid for decision boundaries ---
h = 0.05
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
grid = np.c_[xx.ravel(), yy.ravel()]

def plot_svm(C):
    clf = SVC(kernel='linear', C=C)
    clf.fit(X, y)
    Z = clf.decision_function(grid).reshape(xx.shape)

    plt.figure(figsize=(6, 5))
    plt.contourf(xx, yy, Z > 0, alpha=0.2, cmap=plt.cm.coolwarm)
    plt.contour(xx, yy, Z, colors='k', levels=[-1, 0, 1], linestyles=['--', '-', '--'])

    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolors='k')
    plt.scatter(clf.support_vectors_[:, 0],
                clf.support_vectors_[:, 1],
                s=150, facecolors='none', edgecolors='k', linewidths=1.5, label='Support Vectors')

    plt.title(f"Soft Margin SVM (C={C:.3f})")
    plt.xlabel("x₁")
    plt.ylabel("x₂")
    plt.grid(True)
    plt.legend()
    plt.tight_layout()
    plt.show()

# Create interactive slider (log scale for better exploration)
interact(plot_svm, C=FloatLogSlider(value=1.0, base=10, min=-2, max=3, step=0.1, description='C'))


### Non-linear data

At this point, it might seem like we’ve put a lot of effort into a method that can only separate data with a straight line. Fortunately, SVM has a powerful technique to handle more complex scenarios. When the raw data is not linearly separable, SVM can map it into a higher-dimensional space where a hyperplane *can* separate the clusters.

This idea is often called the "kernel trick" and is one of the key strengths of SVM. The figure below illustrates how mapping to a higher-dimensional space can make separation possible even when it looks impossible in the original feature space. The original data set $(x_1, x_2)$ is mapped into 3D space $(x_1, x_2) \mapsto (z_1, z_2, z_3) := (x_1, x_2, x_1^2 + x_2^2)$.



In [None]:
#@title Figure 3
display(Image(filename='/content/higher-space-mapping.png'))

From *Data-Driven Science and Engineering: Machine Learning, Dynamical Systems, and Control*; S. L. Brunton, J. N. Kutz, p.236

In [None]:
#@title Flow velocity profile classification using SVM

# --- Generate labeled dataset ---
n_samples_per_class = 200
n_points = 51
sigma = 0.05  # same noise level

X = []
y = []

for _ in range(n_samples_per_class):
    _, v_laminar = generate_velocity_profile('laminar', sigma=sigma)
    _, v_turbulent = generate_velocity_profile('turbulent', sigma=sigma)
    X.append(v_laminar)
    y.append(0)
    X.append(v_turbulent)
    y.append(1)

X = np.array(X)
y = np.array(y)

# --- Train-test split ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Fit SVM classifier ---
svm_clf = SVC(kernel='linear')  # or try 'rbf' for more flexibility
svm_clf.fit(X_train, y_train)

# --- Evaluate ---
y_pred = svm_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"SVM Accuracy: {acc:.2f}")

# print("Confusion matrix:")
# print(confusion_matrix(y_test, y_pred))

# --- Visualize a few test samples ---
fig, axes = plt.subplots(2, 3, figsize=(12, 6))
y_vals = np.linspace(-1, 1, n_points)

for i, ax in enumerate(axes.ravel()):
    idx = i
    ax.plot(y_vals, X_test[idx], label="Velocity profile")
    pred = svm_clf.predict(X_test[idx].reshape(1, -1))[0]
    true = y_test[idx]
    ax.set_title(f"True: {'Laminar' if true == 0 else 'Turbulent'} | Pred: {'Laminar' if pred == 0 else 'Turbulent'}")
    ax.set_xlabel("y")
    ax.set_ylabel("v(y)")
    ax.grid(True)

plt.tight_layout()
plt.show()


SVM reaches 100% accuracy on the laminar vs turbulent flow classification.

# Multi-Layer Perceptron (MLP)

**Multi-Layer Perceptron (MLP)** is by far the most versatile and complex technique covered in this session. What makes MLPs so useful is the fact that they consist of many interconnected nodes (neurons), which can be organized into multiple layers — an input layer, one or more hidden layers, and an output layer. Thanks to this layered structure, MLPs can be used for a wide range of problems **including regression**, **classification**, and even more advanced tasks like **time series prediction** or **image recognition**. Most importantly, MLPs allow for learning complex non-linear functions that traditional linear models cannot capture.

However, this flexibility comes with the challenge of selecting appropriate hyperparameters, which significantly affect performance:

- **Hidden Layer Size**: The number of neurons in each hidden layer (e.g. one layer `(100,)` vs. two layers `(100, 50)`). More neurons and more layers increase model capacity but also the risk of overfitting.

- **Number of Hidden Layers**: Increasing the depth (i.e. number of layers) enables learning more complex patterns, but also makes training harder.

- **Activation Function**: Determines the non-linearity at each neuron. Common choices include 'relu', 'tanh', and 'logistic'.

- **Learning Rate**: Controls how fast the model updates its weights during training. Too high can make learning unstable; too low can slow convergence.

- **Regularization (alpha)**: Prevents overfitting by penalizing large weights. A higher alpha applies stronger regularization.

- **Max Iterations**: The maximum number of training epochs. If the model doesn't converge, you might need to increase this.

- **Solver**: The optimization algorithm used (e.g., `'adam'`, `'sgd'`, `'lbfgs'`). `'adam'` is often a good default for most tasks.

Choosing good hyperparameters often requires experimentation and sometimes cross-validation. In practice, using tools like `GridSearchCV` or `RandomizedSearchCV` can help automate this process. We will cover MLPs in more detail in the upcoming sessions.

In the following example, we see how an MLP can classify a highly non-linear dataset — something that simpler models cannot do.

In [None]:
#@title MLP example

# --- Generate two spiral dataset ---
def generate_spiral(n_points, noise=0.2):
    theta = np.sqrt(np.random.rand(n_points)) * 4 * np.pi  # Angle
    r = theta
    x1 = r * np.cos(theta) + np.random.randn(n_points) * noise
    y1 = r * np.sin(theta) + np.random.randn(n_points) * noise

    x2 = -r * np.cos(theta) + np.random.randn(n_points) * noise
    y2 = -r * np.sin(theta) + np.random.randn(n_points) * noise

    X = np.vstack((np.column_stack((x1, y1)), np.column_stack((x2, y2))))
    y = np.array([0]*n_points + [1]*n_points)
    return X, y

X, y = generate_spiral(500, noise=0.15)

# --- Split data ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)

# --- Train MLP ---
mlp = MLPClassifier(hidden_layer_sizes=(100, 100), activation='relu', max_iter=8000, random_state=42)
mlp.fit(X_train, y_train)

# --- Accuracy ---
y_pred = mlp.predict(X_test)
print(f"Test Accuracy: {accuracy_score(y_test, y_pred):.2f}")

# --- Plot decision boundary ---
def plot_decision_boundary(model, X, y):
    h = 0.05
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, cmap=plt.cm.Spectral, alpha=0.4)
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.Spectral, edgecolors='k')
    plt.title("MLP Classification of Two Spirals")
    plt.xlabel("x")
    plt.ylabel("y")
    plt.grid(True)
    plt.show()

plot_decision_boundary(mlp, X, y)


In [None]:
#@title Flow velocity profile classification using MLP

# --- Generate labeled dataset ---
n_samples_per_class = 200
n_points = 51
sigma = 0.05  # same noise level

X = []
y = []

for _ in range(n_samples_per_class):
    _, v_laminar = generate_velocity_profile('laminar', sigma=sigma)
    _, v_turbulent = generate_velocity_profile('turbulent', sigma=sigma)
    X.append(v_laminar)
    y.append(0)
    X.append(v_turbulent)
    y.append(1)

X = np.array(X)
y = np.array(y)

# --- Train-test split ---
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# --- Fit MLP classifier ---
mlp_clf = MLPClassifier(hidden_layer_sizes=(100, 50), activation='relu', max_iter=3000, random_state=42)
mlp_clf.fit(X_train, y_train)

# --- Evaluate ---
y_pred = mlp_clf.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f"MLP Accuracy: {acc:.2f}")

# print("Confusion matrix:")
# print(confusion_matrix(y_test, y_pred))

# --- Visualize a few test samples ---
fig, axes = plt.subplots(2, 3, figsize=(12, 6))
y_vals = np.linspace(-1, 1, n_points)

for i, ax in enumerate(axes.ravel()):
    idx = i
    ax.plot(y_vals, X_test[idx], label="Velocity profile")
    pred = mlp_clf.predict(X_test[idx].reshape(1, -1))[0]
    true = y_test[idx]
    ax.set_title(f"True: {'Laminar' if true == 0 else 'Turbulent'} | Pred: {'Laminar' if pred == 0 else 'Turbulent'}")
    ax.set_xlabel("y")
    ax.set_ylabel("v(y)")
    ax.grid(True)

plt.tight_layout()
plt.show()


Finally, MLP also achives 100% accuracy on the fluid flow example.

# Classification methods comparison

When looking at the velocity profiles classification shown throughout this notebook, we see that all of the methods achieved 100% accuracy. Often times, it is good to start with the simplest and most efficient method. If the results are not satisfactory for a given application, you should try out another method. Often times, some hyperparamter tinkering will be required before seeing better results. That is often the case with MLPs. Hence, give it some time when testing new methods as they tend to yield better results with some tuning.


Below, you can find decision boundaries derived using logistic regression, k-NN, SVM and MLP. Take a moment to look at the plots and draw conclusions on their classification capacity and overall quality of decision boundary.

In [None]:
#@title Classification methods visualization with Accuracy

# --- Generate two spiral dataset ---
def generate_spiral(n_points, noise=0.2):
    theta = np.sqrt(np.random.rand(n_points)) * 4 * np.pi  # Angle
    r = theta
    x1 = r * np.cos(theta) + np.random.randn(n_points) * noise
    y1 = r * np.sin(theta) + np.random.randn(n_points) * noise

    x2 = -r * np.cos(theta) + np.random.randn(n_points) * noise
    y2 = -r * np.sin(theta) + np.random.randn(n_points) * noise

    X = np.vstack((np.column_stack((x1, y1)), np.column_stack((x2, y2))))
    y = np.array([0]*n_points + [1]*n_points)
    return X, y

# --- Generate dataset ---
np.random.seed(42)
X, y = generate_spiral(300, noise=0.25)

# --- Create mesh grid for plotting ---
h = 0.1
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))
grid = np.c_[xx.ravel(), yy.ravel()]

# --- Logistic Regression with Polynomial Features ---
logreg = make_pipeline(PolynomialFeatures(degree=5), LogisticRegression(max_iter=15000))
logreg.fit(X, y)
Z_logreg = logreg.predict(grid).reshape(xx.shape)
acc_logreg = accuracy_score(y, logreg.predict(X))

# --- k-Nearest Neighbors ---
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X, y)
Z_knn = knn.predict(grid).reshape(xx.shape)
acc_knn = accuracy_score(y, knn.predict(X))

# --- Multi-Layer Perceptron ---
mlp = MLPClassifier(hidden_layer_sizes=(100, 100), activation='relu', max_iter=8000, random_state=42)
mlp.fit(X, y)
Z_mlp = mlp.predict(grid).reshape(xx.shape)
acc_mlp = accuracy_score(y, mlp.predict(X))

# --- SVM with RBF kernel ---
svm = SVC(kernel='rbf', gamma='scale', C=100000.0)
svm.fit(X, y)
Z_svm = svm.predict(grid).reshape(xx.shape)
acc_svm = accuracy_score(y, svm.predict(X))

# --- Plot all models ---
fig, axs = plt.subplots(2, 2, figsize=(12, 10))

# Logistic Regression
axs[0, 0].contourf(xx, yy, Z_logreg, alpha=0.3, cmap=plt.cm.coolwarm)
axs[0, 0].scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolor='k', s=30)
axs[0, 0].set_title(f"Logistic Regression\nAcc: {acc_logreg:.2f}")
axs[0, 0].set_xlabel("x")
axs[0, 0].set_ylabel("y")
axs[0, 0].grid(True)
axs[0, 0].set_aspect('equal')

# k-NN
axs[0, 1].contourf(xx, yy, Z_knn, alpha=0.3, cmap=plt.cm.coolwarm)
axs[0, 1].scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolor='k', s=30)
axs[0, 1].set_title(f"k-NN (k=5)\nAcc: {acc_knn:.2f}")
axs[0, 1].set_xlabel("x")
axs[0, 1].set_ylabel("y")
axs[0, 1].grid(True)
axs[0, 1].set_aspect('equal')

# SVM
axs[1, 0].contourf(xx, yy, Z_svm, alpha=0.3, cmap=plt.cm.coolwarm)
axs[1, 0].scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolor='k', s=30)
axs[1, 0].set_title(f"SVM (RBF kernel)\nAcc: {acc_svm:.2f}")
axs[1, 0].set_xlabel("x")
axs[1, 0].set_ylabel("y")
axs[1, 0].grid(True)
axs[1, 0].set_aspect('equal')

# MLP
axs[1, 1].contourf(xx, yy, Z_mlp, alpha=0.3, cmap=plt.cm.coolwarm)
axs[1, 1].scatter(X[:, 0], X[:, 1], c=y, cmap=plt.cm.coolwarm, edgecolor='k', s=30)
axs[1, 1].set_title(f"MLP (100,100) relu\nAcc: {acc_mlp:.2f}")
axs[1, 1].set_xlabel("x")
axs[1, 1].set_ylabel("y")
axs[1, 1].grid(True)
axs[1, 1].set_aspect('equal')

plt.suptitle("Classification Comparison on Two-Spiral Dataset", fontsize=16)
plt.tight_layout(rect=[0, 0, 1, 0.96])
plt.show()


At first glance, we can see that logistic regression has the lowest accuracy score, which is reflected in a boundary that poorly fits the training data. This is due to the simplicity of this technique. It struggles when dealing with more complicated distributions. On the other hand, k-NN, SVM and MLP both achieve 100% accuracy. However, the two differ in the way that new points get classified. Now, if we wanted to sample a new point and check which class it belongs to, k-NN would need to check that point against every point in the data set in order to find the k nearest neighbors. On the other hand, SVM relies on a model derived from the data allowing for low computational overhead when classifying new data. Similarly,  MLP uses its network of weights to define the decision boundary. This approach makes it a lot easier and faster to classify data after the network is trained.