<a href="https://colab.research.google.com/github/adamd1985/lectutures_on_AI/blob/main/Introduction_to_AI_and_Machine_Learning_Models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#  Introduction to AI and Machine Learning - Models

All datasets used within our modules are available from: https://scikit-learn.org/stable/datasets/toy_dataset.html

## Supervised Models

### Linear Regression


We’re going to break down a simple code example, understand the imports, and dive into the purpose of each line.

Let's go through the code step-by-step.



These libraries will be your main tools for the lessons:

- NumPy is a library for handling numerical data in Python.
- matplotlib: Matplotlib is a plotting library, and  provides a simple interface for creating charts.
- sklearn: Scikit-Learn is a powerful machine learning library in Python that contains all algorithms we will need.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score

We fetch a public dataset, in this case the california housing data with these attributes:
- **MedInc**: median income in block group
- **HouseAge**: median house age in block group
- **AveRooms**: average number of rooms per household
- **AveBedrms**: average number of bedrooms per household
- **Population**: block group population
- **AveOccup**: average number of household members
- **Latitude**: block group latitude
- **Longitude**: block group longitude
- **MedHouseVal**: the median of the house value for each district and our target value.

In [None]:
cali_housing = fetch_california_housing(as_frame=True)
df = cali_housing.frame
df.sample(3)

We want to predict the **MedHouseVal**.

To train the linear regression model, we first have to create a feature vector off all other attributes, except `MedHouseVal`. This is called `X`.
The vector with our target value is called `Y`.

`train_test_split` is a utility function commonly used to split the vectors into a train and test set. Here we leave 20% of the data to be unseen by the model, to be evaluated later.


In [None]:
X = cali_housing.data
y = cali_housing.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Here we train our model using the train test splits created above.

In [None]:
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

coefficients = pd.DataFrame({
    "Feature": X_train.columns,
    "Coefficient": model.coef_
})

print("Coefficients (Beta1, Beta2,...BetaN):")
print(coefficients)
print("Intercept (Beta0):", model.intercept_)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R-squared (R^2): {r2}")

Let's test on some known samples from the dataset:

| MedInc | HouseAge | AveRooms  | AveBedrms  | Population | AveOccup | Latitude | Longitude | MedHouseVal |
|--------|----------|-----------|------------|------------|----------|----------|-----------|-------------|
| 8.3252 | 41.0     | 6.984127  | 1.023810   | 322.0      | 2.555556 | 37.88    | -122.23   | 4.526       |
| 8.3014 | 21.0     | 6.238137  | 0.971880   | 2401.0     | 2.109842 | 37.86    | -122.22   | 3.585       |
| 7.2574 | 52.0     | 8.288136  | 1.073446   | 496.0      | 2.802260 | 37.85    | -122.24   | 3.521       |

From our lesson, we also provide the **MSE** using the API `mean_squared_error` from Scikit-Learn, we use this to validate the performance of the model.

In [None]:
sample_data = {
    "MedInc": [8.3252, 8.3014, 7.2574],
    "HouseAge": [41.0, 21.0, 52.0],
    "AveRooms": [6.984127, 6.238137, 8.288136],
    "AveBedrms": [1.023810, 0.971880, 1.073446],
    "Population": [322.0, 2401.0, 496.0],
    "AveOccup": [2.555556, 2.109842, 2.802260],
    "Latitude": [37.88, 37.86, 37.85],
    "Longitude": [-122.23, -122.22, -122.24],
    "MedHouseVal": [4.526, 3.585, 3.521]
}

sample_df = pd.DataFrame(sample_data)
sample_y = sample_df["MedHouseVal"]
sample_X = sample_df.drop(columns=["MedHouseVal"])
y_sample_pred = model.predict(sample_X)


sample_df["Predicted (Y)"] = y_sample_pred
sample_df["Error (E)"] = sample_y - y_sample_pred

mse = mean_squared_error(sample_y, y_sample_pred)
print(f"\nMean Squared Error: {mse}")
print("Predicted vs Actual Values:")
sample_df

Finally, we visualize the result of our model on the test data, which it has never seen. In general the predictions are close to the actual data points, with the exception of extreme values.

To avoid issues caused by extreme values, we do feature engineering and data cleaning. More of that in the lessons within this module.


In [None]:
plt.figure(figsize=(8, 6))

plt.scatter(y_test, y_pred, alpha=0.6, label="Predicted vs Actual")
plt.plot([y_test.min(), y_test.max()], [y_test.min(), y_test.max()], color='red', linestyle='--', linewidth=2, label="Perfect Prediction")

plt.title('Predicted VS Actual Median House Values', fontsize=16)
plt.xlabel('Actual Median House Values', fontsize=14)
plt.ylabel('Predicted Median House Values', fontsize=14)
plt.grid(color='gray', linestyle='--', linewidth=0.5, alpha=0.7)
plt.legend(fontsize=12, loc='upper left')

plt.tight_layout()
plt.show()


### Logistic Regression

In this example, we will use logistic regression and its error functions. The imports are almost the same as the linear regression section, with these additions:
- **linear_model**: To inclue the logistic regression model
- **metrics**: For classification metrics
- **inspection**: For boundary plotting APIs.

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, accuracy_score
from sklearn.inspection import DecisionBoundaryDisplay
from sklearn.pipeline import make_pipeline

The breast cancer dataset is one commonly used for binary  classification, and will be a good example for logistic regression. The dataset has 30 features, though we list the most used ones here:

- **mean radius**: Mean of distances from center to points on the perimeter
- **mean texture**: Standard deviation of gray-scale values
- **mean perimeter**: Mean size of the core tumor perimeter
- **mean area**: Mean size of the core tumor area
- **mean smoothness**: Mean of local variation in radius lengths                |
- Target: **0**: Malignant (Cancerous) or **1**: Benign (Non-Cancerous)


In [None]:
breast_cancer = load_breast_cancer(as_frame=True)
X = breast_cancer.data
y = breast_cancer.target

breast_cancer.frame.sample(3)

Same as linear regression, collect the features in `X`, the target in `y` and split into training and testing datasets.
Note for this particular model, we choose to scale the data. More on this feature engineering later.

In [None]:
from sklearn.preprocessing import StandardScaler

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

We train a logistic regression model here. Unlike linear regression, we evaluate the model using classification-specific metrics:

- **Accuracy**: The percentage of correctly predicted instances out of the total instances. For this dataset, accuracy is quite high, reflecting the model's effectiveness.
- **Precision**: The proportion of correctly predicted positive instances (**True Positives, TP**) out of all predicted positive instances, while **True Negatives (TN)** refer to correctly predicted negative cases.
- **Recall**: The proportion of correctly predicted positive instances (**TP**) out of all actual positive instances:
  $$
  \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}}
  $$
- **F1-Score**: The harmonic mean of precision and recall, providing a balanced measure.


In [None]:
model = LogisticRegression()
model.fit(X_train_scaled, y_train)
y_pred = model.predict(X_test_scaled)
y_prob = model.predict_proba(X_test_scaled)[:, 1]

coefficients = pd.DataFrame({
    "Feature": X.columns,
    "Coefficient": model.coef_[0]
})

print("Coefficients (Beta1, Beta2,...BetaN):")
print(coefficients)
print("Intercept (Beta0):", model.intercept_[0])
print(f"\nAccuracy: {accuracy_score(y_test, y_pred)*100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

Using the **confusion matrix**, we analyze the following:
- **True Positives (TP)**: Correctly predicted positive cases.
- **True Negatives (TN)**: Correctly predicted negative cases.
- **False Positives (FP)**: Incorrectly predicted positive cases (actual negative).
- **False Negatives (FN)**: Incorrectly predicted negative cases (actual positive).

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay

ConfusionMatrixDisplay.from_estimator(
    model, X_test_scaled, y_test, display_labels=breast_cancer.target_names, cmap="Blues", values_format="d"
)
plt.title("Confusion Matrix")
plt.show()

Logistic regression predicts probabilities for binary outcomes (0 or 1), therefore the threshold at which the model predicts class 1 versus class 0. We can visualize this using the following [api](https://scikit-learn.org/stable/modules/generated/sklearn.inspection.DecisionBoundaryDisplay.html) from Scikit Learn.

Note that for linear models like the logistic, you can also plot the boundaries using their function, in our case we can derive it from the base logistic function: $X_2 = -\frac{(w_1 \cdot X_1 + b)}{w_2}$ which represents the decision boundary where the model predicts equal probabilities for both classes (0.5).

Don't worry about the pipeline, we use it to include data scaling in the model's operations, but keep the data raw for visualization.

In [None]:
def plot_reg_decision_boundary(
    X, y, ax, feature_names, class_names
):
  reg_clf = make_pipeline(StandardScaler(), LogisticRegression()).fit(X, y)
  common_params = {"estimator": reg_clf, "X": X, "ax": ax}

  # Shows the regions where the classifier assigns each class.
  DecisionBoundaryDisplay.from_estimator(
      **common_params,
      response_method="predict",
      plot_method="contourf",
      cmap=plt.cm.Paired,
      alpha=0.3,
      eps=0.5,
  )
  scatter = ax.scatter(
      X.iloc[:, 0],
      X.iloc[:, 1],
      c=y,
      cmap=plt.cm.Paired,
      edgecolors="k",
      s=50,
  )
  legend_labels = [class_names[int(label)] for label in np.unique(y)]
  ax.legend(
      scatter.legend_elements()[0],
      legend_labels,
      loc="upper right",
      title="Classes",
  )

  ax.set_title(f"Decision boundaries of Logistic Regression\n({feature_names[0]} vs {feature_names[1]})")


feature_names = ["worst concave points", "worst radius", "mean concave points"]
feature_pairs = [
    (0, 1, ["worst concave points", "worst radius"]),
    (0, 2, ["worst concave points", "mean concave points"]),
    (1, 2, ["worst radius", "mean concave points"]),
]

target_names = breast_cancer.target_names
fig, axes = plt.subplots(1, 3, figsize=(21, 7), tight_layout=True)
for ax, (idx1, idx2, names) in zip(axes, feature_pairs):
    X_pair = X_test.iloc[:, [idx1, idx2]]
    y_pair = y_test
    plot_reg_decision_boundary(
        X_pair, y_pair, ax, names, target_names
    )
plt.show()

Given the size of the features, it is not feasible to construct a new row, therefore we will sample some test rows and validate their class.

In [None]:
sample_indices = np.random.choice(len(X_test), 3, replace=False)
X_sample = X_test_scaled[sample_indices]
y_actual = y_test.iloc[sample_indices]
y_pred_sample = model.predict(X_sample)

print("\nSample Predictions:")
for i, idx in enumerate(sample_indices):
    print(f"Actual Class: {y_actual.iloc[i]}, Predicted Class: {y_pred_sample[i]}")
X_test.iloc[sample_indices]

### Decision Trees

For a decision tree, we import the APIs from Scikit-Learn, note how the tree has both a classifier and a regressor.

In [None]:
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor, plot_tree

We load again the breast cancer dataset. Note that this time, we don't need to scale it as decision trees can work with raw data.

In [None]:
breast_cancer = load_breast_cancer(as_frame=True)
X = breast_cancer.data
y = breast_cancer.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
breast_cancer.frame.head(3)

Train the classifier, and compare with  the logistic regression metrics of:

```bash
Accuracy: 0.97

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114
```

In [None]:
tree_model = DecisionTreeClassifier(random_state=42, max_depth=5, min_samples_leaf=3)
tree_model.fit(X_train, y_train)
y_pred = tree_model.predict(X_test)

print(f"\nAccuracy: {accuracy_score(y_test, y_pred):.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred))

As we said, trees are the most interpretable of the ML algorithms, you can easilily follow a decision on any sample you choose. Not just that, but you can identify the most important data points for the decision, starting with the root node `mean concave points`

In [None]:
plt.figure(figsize=(20, 10))
plot_tree(
    tree_model,
    feature_names=breast_cancer.feature_names,
    class_names=breast_cancer.target_names,
    filled=True,
    rounded=True,
    fontsize=10
)
plt.title("Decision Tree Visualization")
plt.show()

The algo provides an API to rank the most important feature here

In [None]:
feature_importance = pd.DataFrame({
    "Feature": breast_cancer.feature_names,
    "Importance": tree_model.feature_importances_
}).sort_values(by="Importance", ascending=False)

print("Feature Importance:")
print(feature_importance)

Let's try it with a regresion on the **california housing dataset** we previously used in our linear regression problem.

In [None]:
cali_housing = fetch_california_housing(as_frame=True)
X = cali_housing.data
y = cali_housing.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Train a tree regressor. Compare the R2 and MSE with the Linear Regressor which were:

```bash
Mean Squared Error (MSE): 0.5558915986952444
R-squared (R^2): 0.5757877060324508
```

In [None]:
tree_regressor = DecisionTreeRegressor(random_state=42, max_depth=4, min_samples_leaf=6)
tree_regressor.fit(X_train, y_train)
y_pred = tree_regressor.predict(X_test)

mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"\nMean Squared Error (MSE): {mse}")
print(f"R-squared (R^2): {r2}")

A bit trickier to interpret since we are working with a continuous dependant variable as a target

In [None]:
plt.figure(figsize=(20, 10))
plot_tree(
    tree_regressor,
    feature_names=X.columns,
    filled=True,
    rounded=True,
    fontsize=10
)
plt.title("Decision Tree Visualization for California Housing Data")
plt.show()
feature_importance = pd.DataFrame({
    "Feature": X.columns,
    "Importance": tree_regressor.feature_importances_
}).sort_values(by="Importance", ascending=False)

print("\nFeature Importance:")
print(feature_importance)

Note the clustering of the regression output, the target variable needs to fall in one of the leave nodes, and we have a finite amount of those, therefore it cannot be a continuous line like our result from the linear regression.

In [None]:
# Scatter plot to compare predictions vs actual values
plt.figure(figsize=(8, 8))
plt.scatter(y_test, y_pred, alpha=0.6)
plt.plot([0, 5], [0, 5], '--r', label="Perfect Prediction")
plt.xlabel("Actual Median House Value")
plt.ylabel("Predicted Median House Value")
plt.title("Predicted vs Actual House Values")
plt.legend()
plt.grid(True)
plt.show()


### SVM

Let's import the SVM libraries for both a linear and nonlinear regression: `SVC` which is the support vector classifier, read about it [here](https://scikit-learn.org/1.5/modules/generated/sklearn.svm.SVC.html).

In [None]:
from sklearn.svm import SVC

Get the breast cancer dataset, and split for training.

In [None]:
breast_cancer = load_breast_cancer(as_frame=True)
X = breast_cancer.data
y = breast_cancer.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

breast_cancer.frame.sample(3)

Scale it (more on data engineering in later lectures) and fit the SVM model, note the `linear` kernel used for this fit.

You can compare this model's performance with the Logistic Regression's, which were:

```bash
Accuracy: 0.97

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114
```

In [None]:
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

svm_model = SVC(kernel='linear', C=1.0, random_state=42)
svm_model.fit(X_train_scaled, y_train)

y_pred = svm_model.predict(X_test_scaled)
accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy*100:.2f}%")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))

Whenever you are evaluating, always pass the engineered data, in this case the scaled data. Else you will get confusing results.

In [None]:
ConfusionMatrixDisplay.from_estimator(
    svm_model, X_test_scaled, y_test, display_labels=breast_cancer.target_names, cmap="Blues", values_format="d"
)
plt.title("Confusion Matrix")
plt.show()

Same as we did in the logistic, we can lot the decision boundaries for the SVM using a similar function and the Scikit Learn API.

In [None]:
def plot_svm_decision_boundary(kernel, X, y, ax, feature_names, class_names):
    # We create a pipeline to show the unscaled data in the plots, but model the scaled data.
    svc_clf = make_pipeline(StandardScaler(), SVC(kernel=kernel, C=1)).fit(X, y)
    common_params = {"estimator": svc_clf, "X": X, "ax": ax}

    # Plot decision regions and boundaries
    DecisionBoundaryDisplay.from_estimator(
        **common_params,
        response_method="predict",
        plot_method="pcolormesh",
        alpha=0.3,
    )
    DecisionBoundaryDisplay.from_estimator(
        **common_params,
        response_method="decision_function",
        plot_method="contour",
        levels=[-1, 0, 1],
        colors=["k", "k", "k"],
        linestyles=["--", "-", "--"],
    )

    # Highlight support vectors
    support_vectors = svc_clf.named_steps['standardscaler'].inverse_transform(svc_clf.named_steps['svc'].support_vectors_)
    ax.scatter(
        support_vectors[:, 0],
        support_vectors[:, 1],
        s=150,
        facecolors="none",
        edgecolors="k",
        label="Support Vectors",
    )

    # plot the data points
    scatter = ax.scatter(
        X.iloc[:, 0],
        X.iloc[:, 1],
        c=y,
        cmap=plt.cm.Paired,
        edgecolors="k",
        s=50,
    )

    legend_labels = [class_names[int(label)] for label in np.unique(y)]
    ax.legend(
        scatter.legend_elements()[0],
        legend_labels,
        loc="upper right",
        title="Classes",
    )

    ax.set_title(f"Decision boundaries of {kernel} kernel in SVC\n({feature_names[0]} vs {feature_names[1]})")

feature_names = ["worst concave points", "worst radius", "mean concave points"]
feature_pairs = [
    (0, 1, ["worst concave points", "worst radius"]),
    (0, 2, ["worst concave points", "mean concave points"]),
    (1, 2, ["worst radius", "mean concave points"]),
]


fig, axes = plt.subplots(1, 3, figsize=(21, 7), tight_layout=True)
for ax, (idx1, idx2, names) in zip(axes, feature_pairs):
    X_pair = X_test.iloc[:, [idx1, idx2]]  # Extract selected feature pair
    y_pair = y_test  # Labels remain the same
    plot_svm_decision_boundary(
        "linear", X_pair, y_pair, ax, names, target_names
    )
plt.show()

The strenght of SVM comes from its ability to use nonlinear boundaries through a kernel trick where it casts the data to a higher dimension plane. Using the same breast cancer dataset, we can see how it performs (you can compare it to the models above).

In [None]:
svm_model = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=42)
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
print(f"Accuracy: {accuracy:.2f}")
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=breast_cancer.target_names))

In [None]:
ConfusionMatrixDisplay.from_estimator(
    svm_model, X_test, y_test, display_labels=breast_cancer.target_names, cmap="Blues", values_format="d"
)
plt.title("Confusion Matrix")
plt.show()

For visualizing boundaries, note that now the boundaries are nonlinear.

In [None]:
feature_names = ["worst concave points", "worst radius", "mean concave points"]
feature_pairs = [
    (0, 1, ["worst concave points", "worst radius"]),
    (0, 2, ["worst concave points", "mean concave points"]),
    (1, 2, ["worst radius", "mean concave points"]),
]

target_names = breast_cancer.target_names
fig, axes = plt.subplots(1, 3, figsize=(21, 7), tight_layout=True)
for ax, (idx1, idx2, names) in zip(axes, feature_pairs):
    X_pair = X_test.iloc[:, [idx1, idx2]]  # Extract the selected feature pair
    y_pair = y_test  # Labels remain the same
    plot_svm_decision_boundary(
        "rbf", X_pair, y_pair, ax, names, target_names
    )
plt.show()

## Unsupervised Learning Models

### KMeans

For Kmeans we will load the its [clustering APIs](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans) and the [Iris dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html#sklearn.datasets.load_iris), a well known toy dataset used to teach ML. It contains measurements of 150 iris flowers from three species: Setosa, Versicolor, and Virginica. Each flower is described by four features:

- Sepal Length (cm)
- Sepal Width (cm)
- Petal Length (cm)
- Petal Width (cm)


In [None]:
from sklearn.datasets import load_iris
from sklearn.cluster import KMeans

iris = load_iris(as_frame=True)
X = iris.data.values[:,:3] # We are taking the first 3 columns only!
y = iris.target

iris.frame.sample(3)

We fit a kmeans model for 3 clusters based on the target label.

Note that because it  is unsupervised, there isn't a ground truth to compare against, therefore we don't do the usual train-test split. KMeans assigns arbitary cluster IDs (e.g., 0, 1, 2) that do not correspond to the actual class labels (e.g., `Setosa`, `Versicolor`, `Virginica`), for us to get an 'accuracy' we need the **mode** of the cluster - that will be the cluster's label, and then we compare against it.

In [None]:
from scipy.stats import mode

kmeans = KMeans(n_clusters=3, random_state=42)
y_kmeans = kmeans.fit_predict(X)

def map_clusters_to_labels(y_true, y_pred):
  "Map any arbitary cluster IDs to the most common label within them using its `mode`."
  labels = np.zeros_like(y_pred)
  for cluster in np.unique(y_pred):
      mask = y_pred == cluster
      labels[mask] = mode(y_true[mask])[0]
  return labels

y_mapped = map_clusters_to_labels(y, y_kmeans)
accuracy = accuracy_score(y, y_mapped)
print(f"Clustering accuracy: {accuracy*100:.2f}%")

Now we plot them on a 3D  graph since we are using 3 of the 4 features. We use mode again to find what cluster ID maps to what label.

In [None]:
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

for cluster in range(3):
    ax.scatter(
        X[y_kmeans == cluster, 0],
        X[y_kmeans == cluster, 1],
        X[y_kmeans == cluster, 2],
        label=f'{iris.target_names[mode(y[y_kmeans == cluster])[0]]}',
        s=30,
        alpha=0.6,
        zorder=1
    )

centroids = kmeans.cluster_centers_
ax.scatter(
    centroids[:, 0],
    centroids[:, 1],
    centroids[:, 2],
    c='red',
    marker='X',
    s=100,
    label='Centroids',
    alpha=1.0,
    zorder=5
)

ax.set_box_aspect(None, zoom=0.90)
ax.set_title('K-Means with Iris Dataset', fontsize=14)
ax.set_xlabel('Sepal Length', fontsize=12)
ax.set_ylabel('Sepal Width', fontsize=12)
ax.set_zlabel('Petal Length', fontsize=12)
ax.legend()
plt.show()

### DBScan

Load the DBScan API

In [None]:
from sklearn.cluster import DBSCAN

Load again Iris with 3 out of 4 features to allow us to visualize the clusters.

In [None]:
iris = load_iris(as_frame=True)
X = iris.data.values[:,:3] # We are taking the first 3 columns only!
y = iris.target

iris.frame.sample(3)

Train the algo

In [None]:
dbscan = DBSCAN(eps=0.5, min_samples=6)
y_dbscan = dbscan.fit_predict(X)

y_mapped = map_clusters_to_labels(y, y_dbscan)
accuracy = accuracy_score(y, y_mapped)
print(f"Clustering accuracy: {accuracy*100:.2f}%")

And plot, see the difference from K-Means?

In [None]:
fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')
unique_labels = set(y_dbscan)

for label in unique_labels:
    class_member_mask = y_dbscan == label
    core_points = X[class_member_mask]

    if len(core_points) == 0:
        continue

    if label != -1:
        ax.scatter(
            core_points[:, 0],
            core_points[:, 1],
            core_points[:, 2],
            s=50,
            alpha=0.6,
            zorder=1,
            label=f'{iris.target_names[mode(y[class_member_mask])[0]]}'
        )
    else:
        ax.scatter(
            core_points[:, 0],
            core_points[:, 1],
            core_points[:, 2],
            s=50,
            alpha=0.6,
            zorder=1,
            label='Noise',
            color='red',
            marker='x'
        )

ax.set_box_aspect(None, zoom=0.90)
ax.set_title('DBSCAN Clustering on Iris Dataset', fontsize=14)
ax.set_xlabel('Sepal Length', fontsize=12)
ax.set_ylabel('Sepal Width', fontsize=12)
ax.set_zlabel('Petal Length', fontsize=12)
ax.legend()
plt.show()

### PCA

let's load the PCA [API](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html), and the Iris dataset again.

In [None]:
from sklearn.decomposition import PCA

iris = load_iris(as_frame=True)
X = iris.data.values # this time we loading all 4.
y = iris.target

iris.frame.sample(3)

To use PCA, you always need to to standardize as it is sensitive to scale of the data. Here we want to reduce the dimensions from 4 to 2!

In [None]:
scaler = StandardScaler()
normalized_data = scaler.fit_transform(X)

pca = PCA(n_components=2)
X_pca = pca.fit_transform(normalized_data)

explained_variance = pca.explained_variance_ratio_
print(f"Explained variance of Principal Component 1: {explained_variance[0]*100:0.2f}%")
print(f"Explained variance of Principal Component 2: {explained_variance[1]*100:0.2f}%")

In [None]:
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# Plot the PCA-reduced data
plt.figure(figsize=(10, 7))
for i, target_name in enumerate(iris.target_names):
    plt.scatter(X_pca[y == i, 0], X_pca[y == i, 1], label=target_name, alpha=0.7)

plt.xlabel("Principal Component 1")
plt.ylabel("Principal Component 2")
plt.title("PCA of Iris Dataset")
plt.legend()
plt.grid(alpha=0.3)
plt.show()

If we reduced by 1/2 the features would our AI be able to cluster them?

K-Means previously had a Clustering accuracy of `88.67%`, let's see how it performs now.

In [None]:
kmeans = KMeans(n_clusters=3, random_state=42)
y_kmeans = kmeans.fit_predict(X_pca)
y_mapped = map_clusters_to_labels(y, y_kmeans)

accuracy = accuracy_score(y, y_mapped)
print(f"Clustering accuracy: {accuracy * 100:.2f}%")


fig, ax = plt.subplots(figsize=(10, 8))
for cluster in np.unique(y_mapped):
    ax.scatter(
        X_pca[y_mapped == cluster, 0],
        X_pca[y_mapped == cluster, 1],
        label=f'{iris.target_names[cluster]}',
        s=30,
        alpha=0.6,
        zorder=1
    )

centroids = kmeans.cluster_centers_
ax.scatter(
    centroids[:, 0],
    centroids[:, 1],
    c='red',
    marker='X',
    s=100,
    label='Centroids',
    alpha=1.0,
    zorder=5
)
ax.set_title('K-Means after PCA with Iris Dataset', fontsize=14)
ax.set_xlabel('Principal Component 1', fontsize=12)
ax.set_ylabel('Principal Component 2', fontsize=12)
ax.legend()
plt.show()

## Semi-Supervised Learning

With SSL (Semi Supervised Learning), we can train classifiers on some labelled data.

First we import the SSL [APIs](https://scikit-learn.org/1.5/modules/generated/sklearn.semi_supervised.SelfTrainingClassifier.html#sklearn.semi_supervised.SelfTrainingClassifier), and we reduce the features to 2 using PCA.

Then we remove some of the labels, creating an unlabelled dataset that the SSL AI needs to learning using LabelSpreading and SelfTraining.

In [None]:
from sklearn.semi_supervised import LabelSpreading, SelfTrainingClassifier

iris = load_iris()
X, y = iris.data, iris.target

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)
pca_components = [f"Principal Component {i+1}" for i in range(2)]

# Let's mask 50% of the labels for demostation
rng = np.random.RandomState(0)
y_masked = np.copy(y)
mask_indices = rng.rand(y.shape[0]) < 0.5
y_masked[mask_indices] = -1


We train all models. Note the use of a baseline that will work directly with the fully labelled data. The use of a baseline is always recommended as it gives you a model to compare to.

Also note how the Selftrained SVM has `probability=True` that is because the SelfTrainingClassifier works by pseudo-labeling the most confident predictions on the unlabeled data and uses the predicted probabilities of the underlying SVC to determine confidence.

In [None]:
models = {
    "Supervised SVM (Baseline)": SVC(kernel="rbf").fit(X_pca, y),
    "Label Spreading (50% unlabeled)": LabelSpreading().fit(X_pca, y_masked),
    "Self-Training (50% unlabeled)": SelfTrainingClassifier(SVC(kernel="rbf", probability=True)).fit(X_pca, y_masked),
}
predictions = {name: model.predict(X_pca) for name, model in models.items()}
accuracies = {name: f"{accuracy_score(y, pred)*100.:0.2f}%" for name, pred in predictions.items()}
print(accuracies)

Now let's plot the results.

In [None]:
fig, axes = plt.subplots(1, 3, figsize=(20, 6), sharex=True, sharey=True)
colors = ['red', 'green', 'blue']
markers = ['o', 's', '^']

def plot_model(ax, model, title, accuracy, mask_indices=None, class_names = iris.target_names):
    DecisionBoundaryDisplay.from_estimator(
        model, X_pca, response_method="predict", ax=ax, alpha=0.5
    )
    for i, class_name in enumerate(class_names):
        ax.scatter(
            X_pca[y == i, 0], X_pca[y == i, 1],
            color=colors[i], marker=markers[i], edgecolor="k", label=class_name
        )

    if mask_indices is not None:
        ax.scatter(
            X_pca[mask_indices, 0], X_pca[mask_indices, 1],
            facecolors='none', edgecolors='black', s=100, label="Unlabeled"
        )

    ax.set_title(f"{title}\nAccuracy: {accuracy}%")
    ax.set_xlabel(pca_components[0])
    ax.set_ylabel(pca_components[1])
    ax.legend(loc="upper right")

for ax, (name, model) in zip(axes, models.items()):
    plot_model(ax, model, name, accuracies[name], mask_indices=mask_indices if "SVM" not in name else None)
plt.suptitle(
    "SVM Baseline vs. Label Spreading vs. Self-Training",
    fontsize=16
)
plt.tight_layout()
plt.show()


## Neural Networks.

We will design a simple feed-forward neural network. We will use a framework called **pytorch** for this, read its documentation [here](https://pytorch.org/docs/stable/index.html).

- **torch**: The base PyTorch module provides core functionalities such as tensor operations, which are the building blocks of neural networks.
- **nn**: A module in PyTorch that provides classes and functions for building neural network layers.
- **optim**: Provides optimizers for gradient-based optimization to update the model's weights to minimize the loss during training.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

We load and prepare the breast cancer dataset again. Note that we are doing standardization, NNs are known to degrade quickly on unscaled data.

In [None]:
data = load_breast_cancer()
X = data.data
y = data.target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

We convert the feature vectors to a structure called `tensor`. Tensors are multi-dimensional arrays (like NumPy arrays) used to store data and perform computations efficiently, with support for GPUs and automatic differentiation.

They are **the core building blocks** for deep learning models, enabling operations like addition, multiplication, and gradient computation.

In [None]:
X_train = torch.tensor(X_train, dtype=torch.float32)
X_test = torch.tensor(X_test, dtype=torch.float32)
y_train = torch.tensor(y_train, dtype=torch.float32).view(-1, 1)
y_test = torch.tensor(y_test, dtype=torch.float32).view(-1, 1)

Now we build our simple NN.

Since it is a classification problem we have to use a `sigmoid` function. But note how we default to `ReLU` in the deep layers. FC means fully connected.

In [None]:
class SimpleNN(nn.Module):
    def __init__(self, input_size):
        super(SimpleNN, self).__init__()
        self.fc1 = nn.Linear(input_size, 16)  # First layer (16 neurons)
        self.fc2 = nn.Linear(16, 8)           # Second layer (8 neurons)
        self.fc3 = nn.Linear(8, 1)            # Output layer (1 neuron for binary classification)
        self.sigmoid = nn.Sigmoid()           # Sigmoid for 0 or 1 output.

    def forward(self, x):
        x = torch.relu(self.fc1(x))           # ReLU for hidden layers
        x = torch.relu(self.fc2(x))
        x = self.sigmoid(self.fc3(x))
        return x

Let's create this NN.
Note the loss function, we spoke a lot about MSE in our regressions, but given this is a binary classification, we have to use cross-enthropy.

The ADAM optimizer is a widely used one for optimizing the gradient descent process.

In [None]:
input_size = X_train.shape[1]  # Number of features, 30 in the dataset.
model = SimpleNN(input_size)
criterion = nn.BCELoss()  # Binary Cross-Entropy Loss, specific for classification.
optimizer = optim.Adam(model.parameters(), lr=0.01)  # Optimizes gradient descent.

Finally we run the training loop.

Remember that we have to do the gradient descent for the model to update its weights and therefore **learn**.

As we are on googlecolab, we can use their network inspection and visualization tools called **tensorboard**. Also we include `TQDM`, a well known progress bar library for when you have long running trainings.

In [None]:
from torch.utils.tensorboard import SummaryWriter
from tqdm.notebook import tqdm

writer = SummaryWriter(log_dir="runs/breast_cancer_nn")
writer.add_graph(model, X_train)

activations = {}
def activation_hook(name):
    def hook(model, input, output):
        activations[name] = output
    return hook
for name, layer in model.named_modules():
    if isinstance(layer, nn.Linear):
        layer.register_forward_hook(activation_hook(name))

# Training loop
EPOCHS = 50
for epoch in tqdm(range(EPOCHS)):

    # Forward pass
    y_pred = model(X_train)
    loss = criterion(y_pred, y_train)

    # Backward pass
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

    # Metrics
    with torch.no_grad():
        y_pred_labels = (y_pred > 0.5).int()
        train_accuracy = accuracy_score(y_train.numpy(), y_pred_labels.numpy())
    writer.add_scalar("Loss/train", loss.item(), epoch)
    writer.add_scalar("Accuracy/train", train_accuracy, epoch)
    lr = optimizer.param_groups[0]['lr']
    writer.add_scalar("Learning Rate", lr, epoch)
    for name, param in model.named_parameters():
        writer.add_histogram(f"Weights/{name}", param, epoch)
        if param.grad is not None:
            writer.add_histogram(f"Gradients/{name}", param.grad, epoch)
    for name, activation in activations.items():
        writer.add_histogram(f"Activations/{name}", activation, epoch)

We run tensorboard in colab. In case its already runniing, we kill it and restart it.


In [None]:
!kill -9 $(ps aux | grep '[t]ensorboard' | awk '{print $2}')

%load_ext tensorboard
%tensorboard --logdir runs

Here we evaluate the model, note 3 things happening in the code:
1. `eval()` locks the layers of the model for evaluation - mostly effects dropouts.
2. `no_grad()` api, since we are not learning anymore but using the model, we will save resources from gradient computations.
3. `(y_pred_test > 0.5).int()`: because we want to convert probabilities to a label, remember the decision boundary from SVM?

Compare it with the performance of our best model in logistic regression
```bash
Accuracy: 97.37%

Classification Report:
              precision    recall  f1-score   support

           0       0.98      0.95      0.96        43
           1       0.97      0.99      0.98        71

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114
```

In [None]:
model.eval()
with torch.no_grad():
    y_pred_test = model(X_test)
    y_pred_labels = (y_pred_test > 0.5).int()
    accuracy = accuracy_score(y_test, y_pred_labels)
    print(f"Test Accuracy: {accuracy*100:.2f}%")
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred_labels))

    ConfusionMatrixDisplay.from_predictions(y_test, y_pred_labels, display_labels=['No Cancer', 'Cancer'], cmap='Blues')
    plt.title('Confusion Matrix')
    plt.show()


## Generative Models

For generative models, their architecture is usally quite large. For this introduction we will build a small VAE, which consists of an encoder that outputs parameters of a probability distribution (mean and log-variance) and a decoder that reconstructs the input from a sampled latent vector.

Let's import the necessary libraries:
- **DataLoader**: As we start working with large datasets, it won't be feasable to load everything in memory as we were doing with the scikit learn datasets. PyTorch comes equiped with loaders that deal with such datasets.
- **datasets**: Same as the scikit one, has educational datasets.
- **transforms**: Functions to apply transformations with the dataloaders, for example converting everything to a tensor.

In [None]:
from torch.utils.data import DataLoader
from torchvision import datasets, transforms, utils

We build the autoencoder here, which is made of the encoding process and the decoding process, in addition to the networks setups we used earlier.

In [None]:
class VAE(nn.Module):
    def __init__(self):
        super(VAE, self).__init__()

        # Encoder layers
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2_mu = nn.Linear(128, 3)      # Mean of latent space
        self.fc2_logvar = nn.Linear(128, 3)  # Log-variance of latent space

        # Decoder layers
        self.fc3 = nn.Linear(3, 128)
        self.fc4 = nn.Linear(128, 28 * 28)

    def encode(self, x):
        h1 = torch.relu(self.fc1(x))
        mu = self.fc2_mu(h1)
        logvar = self.fc2_logvar(h1)
        return mu, logvar

    def reparameterize(self, mu, logvar):
        # The reparameterization trick.
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def decode(self, z):
        h3 = torch.relu(self.fc3(z))
        return torch.sigmoid(self.fc4(h3))

    def forward(self, x):
        mu, logvar = self.encode(x.view(-1, 28 * 28))
        z = self.reparameterize(mu, logvar)
        return self.decode(z), mu, logvar


The loss function, as defined in the lesson using the Kullback-Leibler divergence.

In [None]:
def loss_function(recon_x, x, mu, logvar):
    # the reconstruciton loss, 1st term of the ELBO
    BCE = nn.functional.binary_cross_entropy(recon_x, x.view(-1, 28 * 28), reduction='sum')

    # General KLD Equation with gaussian.
    KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())

    # Combined, is the VAE loss function
    return BCE + KLD


Load and prepare the dataset. In python, images can be plotted as images, note how we use an iterator which loads from the disk rather than the memory for such large dataset.

The chosen **MNIST dataset** (Modified National Institute of Standards and Technology) is a widely-used dataset in the machine learning and computer vision community:
- **Dataset Size**: 70,000 images
  - 60,000 images for training
  - 10,000 images for testing
- **Image Size**: 28x28 pixels (grayscale)
- **Labels**: 10 (digits 0 through 9)
- **Pixel Values**: Range from 0 (black) to 255 (white), normalized to [0, 1].

In [None]:
transform = transforms.Compose([transforms.ToTensor()])

train_dataset = datasets.MNIST(root='./data', train=True, transform=transform, download=True)
train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)

data_iter = iter(train_loader)
images, labels = next(data_iter)
fig, axs = plt.subplots(1, 2, figsize=(8, 4))
for i in range(2):
    axs[i].imshow(images[i].squeeze(), cmap="gray")
    axs[i].set_title(f"Label: {labels[i].item()}")
    axs[i].axis('off')
plt.tight_layout()
plt.show()

Train the VAE, this will take some time due to the dataset's size.

In [None]:
model = VAE()
optimizer = optim.Adam(model.parameters(), lr=0.001)
EPOCHS = 5
writer = SummaryWriter(log_dir="runs/vae_experiment")
for epoch in range(EPOCHS):
    model.train()
    train_loss = 0
    for batch_idx, (data, _) in enumerate(train_loader):
        optimizer.zero_grad()
        recon_batch, mu, logvar = model(data)
        loss = loss_function(recon_batch, data, mu, logvar)
        loss.backward()
        train_loss += loss.item()
        optimizer.step()

        if batch_idx % 100 == 0:
            print(f'Epoch [{epoch+1}/{EPOCHS}] Batch [{batch_idx}/{len(train_loader)}] Loss: {loss.item() / len(data):.6f}')

    avg_train_loss = train_loss / len(train_loader.dataset)
    print(f'====> Epoch: {epoch+1} Average loss: {avg_train_loss:.4f}')
    writer.add_scalar('Loss/train', avg_train_loss, epoch)

    # WRite temp images to tensorboard.
    model.eval()
    with torch.no_grad():
        test_data, _ = next(iter(test_loader))
        recon_batch, _, _ = model(test_data)
        recon_batch = recon_batch.view(-1, 1, 28, 28)
        comparison = torch.cat([test_data[:8], recon_batch[:8]])
        img_grid = utils.make_grid(comparison, nrow=8)
        writer.add_image('Reconstructed Images', img_grid, epoch)

writer.close()

In [None]:
%tensorboard --logdir=runs/vae_experiment

Remember, we have encoded an image and the NN will try to reconstruct it, the image will be totally original though similar. This is the start of most GenAI models.

In [None]:
def plot_image_grid(original_images, reconstructed_images, n=2):
    plt.figure(figsize=(10, 4))
    for i in range(n):
        # Original images
        ax = plt.subplot(2, n, i + 1)
        plt.imshow(original_images[i].view(28, 28).cpu().numpy(), cmap='gray')
        ax.axis('off')
        if i == n // 2:
            ax.set_title('Original Images')

        # Reconstructed images
        ax = plt.subplot(2, n, i + 1 + n)
        plt.imshow(reconstructed_images[i].view(28, 28).cpu().numpy(), cmap='gray')
        ax.axis('off')
        if i == n // 2:
            ax.set_title('Reconstructed Images')
    plt.show()

BATCH_SIZE = 10

model.eval()
test_dataset = datasets.MNIST(root='./data', train=False, transform=transform, download=True)
test_loader = DataLoader(test_dataset, batch_size=BATCH_SIZE, shuffle=True)
dataiter = iter(test_loader)
images, _ = next(dataiter)

with torch.no_grad():
    reconstructed, _, _ = model(images)
plot_image_grid(images, reconstructed, n=BATCH_SIZE)

# Conclussion

That was the last model. The next notebook will be about ML and Data engineering. A model is as good as the data you give it.