### Project Number 5

##### Harry Denell (hdenell@uwaterloo.ca) and Evan St. Pierre (e3stpier@uwaterloo.ca)




#### Abstract

#### Team Member and Contribution

#### Code Libraries


| **Libraries**      | **Explanation**                                                                                  |
|---------------------|--------------------------------------------------------------------------------------------------|
| **NumPy**           | A powerful library for numerical computing in Python, providing support for arrays, matrices, and mathematical functions. |
| **Pandas**          | Used for data manipulation and analysis; provides data structures like DataFrames for handling structured data. |
| **Matplotlib**      | A 2D plotting library for creating static, interactive, and animated visualizations in Python.  |
| **Seaborn**         | Built on Matplotlib, Seaborn provides an easy-to-use interface for creating informative and attractive statistical graphics. |
| **Scikit-learn**    | A library for machine learning in Python, offering tools for classification, regression, clustering, and more. |
| **TensorFlow**      | An open-source library for deep learning and machine learning, widely used for building neural networks. |
| **PyTorch**         | Another deep learning framework, known for its flexibility and dynamic computation graph.         |
| **OpenCV**          | A library for computer vision and image processing tasks such as object detection and image transformation. |
| **Beautiful Soup**  | A library for web scraping, used to parse HTML and XML documents and extract data.                |
| **Flask**           | A lightweight web framework in Python for building web applications and APIs.                    |

In [63]:
# Python Libraries
import random
import math
import numbers
import platform
import copy

# Importing essential libraries for basic image manipulations.
import numpy as np
import PIL
from PIL import Image, ImageOps
import matplotlib.pyplot as plt
from tqdm import tqdm

# We import some of the main PyTorch and TorchVision libraries used for H
# Detailed installation instructions are here: https://pytorch.org/get-st
# That web site should help you to select the right 'conda install' comma
# In particular, select the right version of CUDA. Note that prior to ins
# install the latest driver for your GPU and CUDA (9.2 or 10.1), assuming
# For more information about pytorch refer to
# https://pytorch.org/docs/stable/nn.functional.html
# https://pytorch.org/docs/stable/data.html.
# and https://pytorch.org/docs/stable/torchvision/transforms.html
import torch
import torch.nn.functional as F
from torch import nn
from torch.utils.data import DataLoader
import torchvision.transforms as transforms
import torchvision.transforms.functional as tF

In [64]:
# It is best to start with USE_GPU = False (implying CPU). Switch USE_GPU
# we strongly recommend to wait until you are absolutely sure your CPU-ba
USE_GPU = False

if USE_GPU:
    device = torch.device("mps")

### PART 1: 

In [77]:
import torch
import torchvision
import torchvision.transforms as transforms
from torch.utils.data import random_split, DataLoader
import torch.optim as optim


In [78]:
    
# Define device
device = torch.device("mps" if torch.backends.mps.is_available() else 
                      ("cuda" if torch.cuda.is_available() else "cpu"))

Step 1: Data Preparation
  - Load CIFAR-10 dataset with transformations.
  - Split training data into labeled and unlabeled subsets.
  - Create dataloaders for labeled, unlabeled, and test data.

Step 2: labeled_labeled_model Definition
  - Define a CNN with:
    - Feature extraction layers.
    - Fully connected classification head.
  - Ensure labeled_labeled_model outputs:
    - Features for clustering.
    - Logits for classification.

Step 3: Loss Functions
  - Supervised Loss: Cross-entropy on labeled data.
  - Unsupervised Loss:
    - Perform K-Means on features of unlabeled data.
    - Compute clustering loss (e.g., mean distance to cluster centers).

Step 4: Training Loop
  For each epoch:
    - Train on Labeled Data (Supervised Step):
      - Load labeled batch.
      - Pass batch through the labeled_labeled_model.
      - Compute cross-entropy loss.
      - Backpropagate and update labeled_labeled_model.

    - Train on Unlabeled Data (Unsupervised Step):
      - Extract features for all unlabeled data.
      - Perform K-Means on features.
      - For each batch of unlabeled data:
        - Compute clustering loss.
        - Backpropagate and update labeled_labeled_model.

    - Log losses and progress.

Step 5: Testing and Evaluation
  - Evaluate on test set:
    - Compute accuracy using predicted labels.
  - Analyze performance at different labeled/unlabeled splits.

Step 6: Results Analysis
  - Plot training losses.
  - Visualize feature clustering with t-SNE.
  - Discuss how unlabeled data improved the labeled_labeled_model's performance.


Step 1: Data Preparation
  - Load CIFAR-10 dataset with transformations. Use pre-defined transformations
  - Split training data into labeled and unlabeled subsets.
  - Create dataloaders for labeled, unlabeled, and test data.

In [79]:
# Define transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))  # Normalize RGB channels
])

# Load CIFAR-10 dataset (original training set only)
full_trainset = torchvision.datasets.CIFAR10(
    root='./data', train=True, download=True, transform=transform
)

# Split into 80% train and 20% validation
train_size = int(0.8 * len(full_trainset))
val_size = len(full_trainset) - train_size
trainset, valset = random_split(full_trainset, [train_size, val_size])

# Create DataLoaders for training and validation
trainloader = DataLoader(trainset, batch_size=32, shuffle=True)
valloader = DataLoader(valset, batch_size=32, shuffle=False)

# Load CIFAR-10 test set (unchanged)
testset = torchvision.datasets.CIFAR10(
    root='./data', train=False, download=True, transform=transform
)
testloader = DataLoader(testset, batch_size=32, shuffle=False)

In [80]:
def split_by_ratio(data, labels, ratio, seed=42):
    M = int(len(data) * ratio)  # Calculate number of labeled samples
    return split_labeled_unlabeled(data, labels, M, seed)

In [81]:
# class CNN(nn.Module):
#     def __init__(self, num_classes=10):
#         super(CNN, self).__init__()
#         # Feature Extractor
#         self.feature_extractor = nn.Sequential(
#             nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1),
#             nn.ReLU(),
#             nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
#             nn.ReLU(),
#             nn.MaxPool2d(2, 2),  # Downsample
#             nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
#             nn.ReLU(),
#             nn.MaxPool2d(2, 2)  # Downsample
#         )
#         # Classifier
#         self.classifier = nn.Sequential(
#             nn.Flatten(),
#             nn.Linear(128 * 8 * 8, 256),
#             nn.ReLU(),
#             nn.Linear(256, num_classes)
#         )

#     def forward(self, x):
#         features = self.feature_extractor(x)
#         logits = self.classifier(features)
#         return features, logits

class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super(CNN, self).__init__()
        self.conv = nn.Sequential(
            nn.Conv2d(3, 32, kernel_size=3, padding=1), 
            nn.ReLU(),
            nn.MaxPool2d(2),
            
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.MaxPool2d(2),
            
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1,1))  # [N,128,1,1]
        )
        self.classifier_head = nn.Linear(128, num_classes)
        
    def feature_extractor(self, x):
        x = self.conv(x)           # shape: [N, 128, 1, 1]
        x = x.view(x.size(0), -1)  # Flatten to [N, 128]
        return x
    
    def classifier(self, features):
        return self.classifier_head(features)
    
    def forward(self, x):
        f = self.feature_extractor(x)
        logits = self.classifier(f)
        return f, logits

Step 2: labeled_labeled_model Definition
  - Define a CNN with:
    - Feature extraction layers.
    - Fully connected classification head.
  - Ensure labeled_labeled_model outputs:
    - Features for clustering.
    - Logits for classification.

In [70]:
# # first implementation of simple CNN for 
# class CNN(nn.Module):
#     def __init__(self, num_classes=10):
#         super(CNN, self).__init__()

#         # Feature Extractor?
        
#         # Output: 32x32x32
#         self.conv1 = nn.Conv2d(3, 32, kernel_size=3, stride=1, padding=1)  

#         # Output: 64x32x32
#         self.conv2 = nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1)  

#         # Output: Downsample to 64x16x16
#         self.pool = nn.MaxPool2d(2, 2)  

#         # Output: 128x16x16
#         self.conv3 = nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1) 

#         # Output: Downsample to 128x8x8
#         self.pool2 = nn.MaxPool2d(2, 2)
        
#         # Fully Connected Layers (Classification Head)
#         self.fc1 = nn.Linear(128 * 8 * 8, 256)  # Flattened size depends on input resolution
#         self.fc2 = nn.Linear(256, num_classes)  # Final classification layer

#     def forward(self, x):

#         # Convolutional Layers (Feature Extractor)

#         x = F.relu(self.conv1(x))  # Conv1 + ReLU

#         x = F.relu(self.conv2(x))  # Conv2 + ReLU

#         x = self.pool(x)  # MaxPooling

#         x = F.relu(self.conv3(x))  # Conv3 + ReLU
        
#         x = self.pool2(x)  # MaxPooling

#         # Flatten for Fully Connected Layers
#         features = x.view(x.size(0), -1)  # Flatten features for clustering
#         x = F.relu(self.fc1(features))  # Fully Connected Layer 1
#         logits = self.fc2(x)  # Fully Connected Layer 2 (Logits)

#         # Return both features and logits
#         return features, logits

Train dataset with all labels present

In [83]:
# Instantiate labeled_labeled_model, define loss, and optimizer
labeled_model = CNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(labeled_model.parameters(), lr=0.001)

# Training loop
num_epochs = 10
for epoch in range(num_epochs):
    labeled_model.train()  # Set labeled_model to training mode
    running_loss = 0.0
    for inputs, labels in trainloader:
        inputs, labels = inputs.to(device), labels.to(device)  # Move data to device

        # Forward pass
        features, logits = labeled_model(inputs)  # Unpack the tuple
        loss = criterion(logits, labels)  # Use logits for loss computation
        
        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {running_loss / len(trainloader):.4f}")

# Evaluation loop
labeled_model.eval()  # Set labeled_model to evaluation mode
correct = 0
total = 0

with torch.no_grad():
    for inputs, labels in testloader:
        inputs, labels = inputs.to(device), labels.to(device)
        
        # Forward pass
        features, logits = labeled_model(inputs)  # Unpack the tuple returned by the labeled_model
        _, predicted = torch.max(logits, 1)  # Use logits for prediction

        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Test Accuracy: {100 * correct / total:.2f}%")

mps
Epoch 1/10, Loss: 1.6729
Epoch 2/10, Loss: 1.3444
Epoch 3/10, Loss: 1.1914
Epoch 4/10, Loss: 1.0838
Epoch 5/10, Loss: 1.0181
Epoch 6/10, Loss: 0.9594
Epoch 7/10, Loss: 0.9121
Epoch 8/10, Loss: 0.8671
Epoch 9/10, Loss: 0.8327
Epoch 10/10, Loss: 0.7939
Test Accuracy: 71.04%


Use t-SNE or PCA to show clusters in the feature space ?

In [5]:
# take in argument based on which we want to use, lets use c_e for now
def supervised_loss(logits, labels):
    return F.cross_entropy(logits, labels)


In [84]:
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, random_split
from sklearn.cluster import KMeans
import numpy as np

# Define transformations
transform = transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))
])

# Load CIFAR-10 dataset
full_trainset = torchvision.datasets.CIFAR10(root='./data', train=True, download=True, transform=transform)

LABEL_SIZE = 0.2

# Split into labeled and unlabeled datasets (20% labeled)
labeled_size = int(LABEL_SIZE * len(full_trainset))
unlabeled_size = len(full_trainset) - labeled_size
labeled_set, unlabeled_set = random_split(full_trainset, [labeled_size, unlabeled_size])

print(f"Labeled size: {labeled_size}, Unlabeled size: {unlabeled_size}")


# Compute ratio
ratio = unlabeled_size / labeled_size

# A simple heuristic: if ratio is large, update K-Means less often
# For instance, if ratio = 1 means run every epoch, if ratio=5 means run every 5 epochs
kmeans_update_frequency = max(1, int(ratio))

print(f"K-Means will be updated every {kmeans_update_frequency} epochs based on ratio={ratio:.2f}.")

# Create DataLoaders for labeled and unlabeled datasets
labeled_loader = DataLoader(labeled_set, batch_size=32, shuffle=True)
unlabeled_loader = DataLoader(unlabeled_set, batch_size=32, shuffle=True)

# Load CIFAR-10 test set
testset = torchvision.datasets.CIFAR10(root='./data', train=False, download=True, transform=transform)
testloader = DataLoader(testset, batch_size=32, shuffle=False)

# Define device
device = torch.device("mps" if torch.backends.mps.is_available() else 
                      ("cuda" if torch.cuda.is_available() else "cpu"))
print(f"Using device: {device}")


model = CNN().to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)

num_epochs = 10
K = 10  # number of clusters
lambda_cluster = 0.5

cluster_centers_torch = None  # will store cluster centers

for epoch in range(num_epochs):
    model.eval()
    
    # Update K-Means only every kmeans_update_frequency epochs or if we have no cluster centers yet
    if epoch % kmeans_update_frequency == 0 or cluster_centers_torch is None:
        # ----- Run K-Means Clustering on Unlabeled Data -----
        all_features = []
        with torch.no_grad():
            for unlabeled_batch in unlabeled_loader:
                u_inputs = unlabeled_batch[0].to(device)
                u_features, _ = model(u_inputs)
                all_features.append(u_features.cpu().numpy())
        
        all_features = np.concatenate(all_features, axis=0)  # shape: [num_unlabeled_samples, feature_dim]

        # Fit KMeans
        kmeans = KMeans(n_clusters=K, random_state=42)
        kmeans.fit(all_features)
        cluster_centers = kmeans.cluster_centers_
        cluster_centers_torch = torch.tensor(cluster_centers, dtype=torch.float32, device=device)
        print(f"Updated K-Means at epoch {epoch+1}")

    # Reset unlabeled loader iterator for this epoch
    unlabeled_iter = iter(unlabeled_loader)

    model.train()
    running_supervised_loss = 0.0
    running_cluster_loss = 0.0
    
    for x_l, y_l in labeled_loader:
        x_l, y_l = x_l.to(device), y_l.to(device)
        
        # Get an unlabeled batch
        try:
            x_u, _ = next(unlabeled_iter)
        except StopIteration:
            unlabeled_iter = iter(unlabeled_loader)
            x_u, _ = next(unlabeled_iter)
        
        x_u = x_u.to(device)
        
        # Compute supervised loss
        f_l, logits_l = model(x_l)
        supervised_loss = criterion(logits_l, y_l)
        
        # Compute cluster loss for the unlabeled batch:
        f_u, _ = model(x_u)
        
        # Assign clusters by finding nearest center
        with torch.no_grad():
            dists = torch.cdist(f_u, cluster_centers_torch)
            assignments = dists.argmin(dim=1)
        
        assigned_centers = cluster_centers_torch[assignments]
        cluster_loss = ((f_u - assigned_centers)**2).mean()
        
        # Combine losses and backprop
        total_loss = supervised_loss + lambda_cluster * cluster_loss
        optimizer.zero_grad()
        total_loss.backward()
        optimizer.step()
        
        running_supervised_loss += supervised_loss.item()
        running_cluster_loss += cluster_loss.item()
    
    avg_supervised_loss = running_supervised_loss / len(labeled_loader)
    avg_cluster_loss = running_cluster_loss / len(labeled_loader)
    
    print(f"Epoch {epoch+1}/{num_epochs}: "
          f"Supervised Loss = {avg_supervised_loss:.4f}, "
          f"Cluster Loss = {avg_cluster_loss:.4f}")

# ----- Evaluation -----
model.eval()
correct = 0
total = 0
with torch.no_grad():
    for inputs, labels in testloader:
        inputs, labels = inputs.to(device), labels.to(device)
        _, logits = model(inputs)
        _, predicted = torch.max(logits, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print(f"Test Accuracy: {100 * correct / total:.2f}%")

Labeled size: 10000, Unlabeled size: 40000
K-Means will be updated every 4 epochs based on ratio=4.00.
Using device: mps
Updated K-Means at epoch 1
Epoch 1/10: Supervised Loss = 1.9655, Cluster Loss = 0.0798
Epoch 2/10: Supervised Loss = 1.6614, Cluster Loss = 0.1003
Epoch 3/10: Supervised Loss = 1.5418, Cluster Loss = 0.0972
Epoch 4/10: Supervised Loss = 1.4420, Cluster Loss = 0.0944
Updated K-Means at epoch 5
Epoch 5/10: Supervised Loss = 1.3515, Cluster Loss = 0.0353
Epoch 6/10: Supervised Loss = 1.2956, Cluster Loss = 0.0383
Epoch 7/10: Supervised Loss = 1.2390, Cluster Loss = 0.0427
Epoch 8/10: Supervised Loss = 1.1889, Cluster Loss = 0.0439
Updated K-Means at epoch 9
Epoch 9/10: Supervised Loss = 1.1464, Cluster Loss = 0.0411
Epoch 10/10: Supervised Loss = 1.0963, Cluster Loss = 0.0437
Test Accuracy: 60.01%


For each epoch:
    1. Supervised Phase:
        - For each batch in labeled_loader:
            a. Pass images through the labeled_labeled_model to get predictions (logits).
            b. Compute supervised loss (cross-entropy).
            c. Backpropagate and update labeled_labeled_model weights.

    2. Unsupervised Phase:
        a. Extract features for all images in unlabeled_loader.
        b. Perform clustering (e.g., K-Means) on the features.
        c. For each batch in unlabeled_loader:
            - Compute clustering loss (distance to cluster centers).
            - Backpropagate and update the feature extractor.

    Log both supervised and unsupervised losses.
    Evaluate on the test set to track performance.


#### Conclusions