## Overview

Hello! Welcome to the architecture setup notebook, where we will be installing all requirements and outline the basic architecture of our AlexNet model (whose performance will be compared to our custom model, EfficentNet, and ConvNeXt). 


The cell below handles our initial requirements installation:

In [6]:
!pip3 install -r ../../requirements.txt



## Data Preprocessing

As part of our data preprocessing, we will split the down-scaled lung dataset from the original dataset into a train/test split. 

Note that we will be using five-fold cross-validation for testing later, hence we will not be partioning an additional validation set. 

After splitting our data, we will then feed the training set into our models. Here, we will specifically feed it into the AlexNet model. 

In [7]:
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.utils.data
from torchvision import datasets, models, transforms
from torch.utils.data import DataLoader, Subset
import torchvision.transforms as transforms
from torch.utils.data import SubsetRandomSampler
from torchvision.datasets import ImageFolder
from sklearn.model_selection import KFold

from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score, recall_score, precision_score, f1_score
import os

from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression

The code below extracts images from our dataset, resizes each into a fourth their original size (768 -> 192), and converts them into Torch tensors. The ImageFolder class allows us to lazyload our images to preserve our computational power.

In [8]:
# Path to our lung_image_sets
data_dir = "../../lung_colon_image_set/lung_image_sets"

# Define resized size of images (Put this back to 192 later, recommended size of 224)
resized_size = 224

# Convert images into Tensors
tensor_data = transforms.Compose([
  transforms.Resize((resized_size, resized_size)),   # Cut image into a fourth of original size
  transforms.ToTensor()
])

# Load the dataset using ImageFolder
data = ImageFolder(root=data_dir, transform=tensor_data)

# Split the dataset into train and test sets
train_size = int(0.8 * len(data))
test_size = len(data) - train_size
train, test = torch.utils.data.random_split(data, [train_size, test_size])

# Create data loaders for training and testing
load_train = DataLoader(train, batch_size=32, shuffle=True)
load_test = DataLoader(test, batch_size=32, shuffle=False)

## Model Initialization
We will initialize the AlexNet model using Pytorch's pretrained AlexNet model and remove the final layer to perform feature extraction on our data.

In [14]:
# Initialize AlexNet Model 
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
alexnet = models.alexnet(pretrained=True)

# Modify AlexNet to Extract Features
# Note: we are removing the final layer
model = torch.nn.Sequential(*list(alexnet.children())[:-1])
model.eval()
model = model.to(device)
num_features = alexnet.classifier[6].in_features

# Define hyperparameters
learning_rate = 5e-4
momentum = 0.9

# Define our loss function and optimizer
loss_function = nn.CrossEntropyLoss()
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

## AlexNet + SVM Classifier Training and Testing
We will perform k-fold cross-validation testing on the SVM classifier, which is trained the on features extracted by our AlexNet model.

In [15]:
# Extract features for training and validation sets
def extract_features(loader, model):
    features_list, labels_list = [], []
    with torch.no_grad():
        for inputs, labels in loader:
            inputs = inputs.to(device)
            labels = labels.to(device)
            features = model(inputs)
            features = features.view(features.size(0), -1)
            features_list.append(features.cpu().numpy())
            labels_list.append(labels.cpu().numpy())
    return np.concatenate(features_list), np.concatenate(labels_list)

In [19]:
# Store the results of each fold
num_folds = 5
kfold = KFold(n_splits=num_folds, shuffle=True, random_state=231)

results = {}

# Create the SVM classifier
svm_model = make_pipeline(StandardScaler(), SVC(kernel='linear'))

# K-Fold Cross Validation
for fold, (train_indices, val_indices) in enumerate(kfold.split(train), 1):
    print(f'Fold {fold}')

    # Create data samplers for train and validation sets
    train_sampler = SubsetRandomSampler(train_indices)
    val_sampler = SubsetRandomSampler(val_indices)

    # Create data loaders for train and validation sets
    train_loader = DataLoader(train, batch_size=32, sampler=train_sampler)
    val_loader = DataLoader(train, batch_size=32, sampler=val_sampler)
    
    # Extract features for training and validation sets
    train_features, train_labels = extract_features(train_loader, model)
    val_features, val_labels = extract_features(val_loader, model)

    # Train the SVM classifier
    svm_model.fit(train_features, train_labels)

    # Evaluate the classifier on the validation set
    val_predictions = svm_model.predict(val_features)
    accuracy = accuracy_score(val_labels, val_predictions)
    results[fold] = accuracy
    print(f'Fold {fold} Accuracy: {accuracy:.4f}')

# Print the average accuracy across all folds
average_accuracy = np.mean(list(results.values()))
print(f'K-FOLD CROSS VALIDATION RESULTS FOR {num_folds} FOLDS')
print('--------------------------------')
for fold in results:
    print(f'Fold {fold}: {results[fold]:.4f}')
print(f'Average: {average_accuracy:.4f}')

Fold 1
Fold 1 Accuracy: 0.9173
Fold 2
Fold 2 Accuracy: 0.9173
K-FOLD CROSS VALIDATION RESULTS FOR 2 FOLDS
--------------------------------
Fold 1: 0.9173
Fold 2: 0.9173
Average: 0.9173


## AlexNet + Softmax Classifier Training and Testing
We will perform k-fold cross-validation testing on the Softmax classifier, which is trained the on features extracted by our AlexNet model.

In [17]:
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

num_folds = 5

softmax_classifier = make_pipeline(
    StandardScaler(),
    LogisticRegression(multi_class='multinomial', solver='lbfgs', C=1.0, random_state=231, max_iter=400)
)

# Store the results of each fold
results = {}

# K-Fold Cross Validation
kfold = KFold(n_splits=num_folds, shuffle=True, random_state=231)

for fold, (train_indices, val_indices) in enumerate(kfold.split(train), 1):
    print(f'Fold {fold}')

    # Create data samplers for train and validation sets
    train_sampler = SubsetRandomSampler(train_indices)
    val_sampler = SubsetRandomSampler(val_indices)

    # Create data loaders for train and validation sets
    train_loader = DataLoader(train, batch_size=32, sampler=train_sampler)
    val_loader = DataLoader(train, batch_size=32, sampler=val_sampler)
    
    # Extract features for training and validation sets
    train_features, train_labels = extract_features(train_loader, model)
    val_features, val_labels = extract_features(val_loader, model)

    # Train the Softmax classifier
    softmax_classifier.fit(train_features, train_labels)
    
    # Evaluate the classifier on the validation set and extract metrics
    val_predictions = softmax_classifier.predict(val_features)
    accuracy = accuracy_score(val_labels, val_predictions)
    
    results[fold] = accuracy
    print(f'Fold {fold} Accuracy: {accuracy:.4f}')

# Print the average accuracy across all folds
average_accuracy = np.mean(list(results.values()))
print(f'K-FOLD CROSS VALIDATION RESULTS FOR {num_folds} FOLDS')
print('--------------------------------')
for fold in results:
    print(f'Fold {fold}: {results[fold]:.4f}')
print(f'Average: {average_accuracy:.4f}')

Fold 1


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Fold 1 Accuracy: 0.9160
Fold 2


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Fold 2 Accuracy: 0.9125
K-FOLD CROSS VALIDATION RESULTS FOR 2 FOLDS
--------------------------------
Fold 1: 0.9160
Fold 2: 0.9125
Average: 0.9143


## AlexNet + PCA + SVM Classifier Training and Testing
As an extension to our SVM implementation, the paper suggests that applying PCA on the resulting features derives higher accuracy before being loaded into the SVM classifier. We implement this approach below, performing k-fold cross-validation testing on the PCA + SVM classifier, which is trained the on features extracted by our AlexNet model.

In [22]:
# Store the results of each fold
num_folds = 5
kfold = KFold(n_splits=num_folds, shuffle=True, random_state=231)
results = {}

# Apply PCA to reduce dimensionality
n_components = 24  # Set the number of principal components you want to keep
pca = PCA(n_components=n_components)

# Create the SVM classifier
svm_pca_classifier = make_pipeline(StandardScaler(), pca, SVC(kernel='linear'))

# K-Fold Cross Validation
for fold, (train_indices, val_indices) in enumerate(kfold.split(train), 1):
    print(f'Fold {fold}')

    # Create data samplers for train and validation sets
    train_sampler = SubsetRandomSampler(train_indices)
    val_sampler = SubsetRandomSampler(val_indices)

    # Create data loaders for train and validation sets
    train_loader = DataLoader(train, batch_size=32, sampler=train_sampler)
    val_loader = DataLoader(train, batch_size=32, sampler=val_sampler)

    # Extract features for training and validation sets
    train_features, train_labels = extract_features(train_loader, model)
    val_features, val_labels = extract_features(val_loader, model)

    # Train the SVM classifier with PCA
    svm_pca_classifier.fit(train_features, train_labels)

    # Evaluate the classifier on the validation set
    val_predictions = svm_pca_classifier.predict(val_features)
    accuracy = accuracy_score(val_labels, val_predictions)
    results[fold] = accuracy
    print(f'Fold {fold} Accuracy: {accuracy:.4f}')

# Print the average accuracy across all folds
average_accuracy = np.mean(list(results.values()))
print(f'K-FOLD CROSS VALIDATION RESULTS FOR {num_folds} FOLDS')
print('--------------------------------')
for fold in results:
    print(f'Fold {fold}: {results[fold]:.4f}')
print(f'Average: {average_accuracy:.4f}')

Fold 1
Fold 1 Accuracy: 0.8878
Fold 2
Fold 2 Accuracy: 0.8823
K-FOLD CROSS VALIDATION RESULTS FOR 2 FOLDS
--------------------------------
Fold 1: 0.8878
Fold 2: 0.8823
Average: 0.8851


### Testing and Metrics

Now with our trained models, we will now test with our test set and store metrics for each model. The metrics that we will store are the following:
- Accuracy
- Precision
- Recall
- F1

The metrics are defined in our paper more clearly, but to calculate these we will calculate the the following values:
- True Positive (TP)
- False Positive (FP)
- True Negative (TN)
- False Negative (FN)

We calculate these values below:

In [23]:
# Extract Features
test_features, test_labels = extract_features(load_test, model)

# List of trained classifiers
classifiers = {
    'SVM': svm_model,   # Assume svm_model is already trained
    'Softmax': softmax_classifier,  # Another trained classifier
    'SVM+PCA': svm_pca_classifier   # Another trained classifier
}

# Dictionary to store results
results = {clf_name: {} for clf_name in classifiers}

# Evaluate each classifier
for clf_name, clf in classifiers.items():
    # Predict using the classifier
    if clf_name == 'bitch':
        clf.eval()
        with torch.no_grad():
            test_features = torch.tensor(test_features, dtype=torch.float32).to(device)
            test_outputs = softmax_classifier(test_features)
            _, test_predictions = torch.max(test_outputs, 1)
    
            # Convert back to numpy
            test_predictions = test_predictions.cpu().numpy()

    else:
        test_predictions = clf.predict(test_features)
    
    # Calculate metrics
    accuracy = accuracy_score(test_labels, test_predictions)
    precision = precision_score(test_labels, test_predictions, average='weighted')
    recall = recall_score(test_labels, test_predictions, average='weighted')
    f1 = f1_score(test_labels, test_predictions, average='weighted')
    
    # Store the results
    results[clf_name]['accuracy'] = accuracy
    results[clf_name]['precision'] = precision
    results[clf_name]['recall'] = recall
    results[clf_name]['f1'] = f1

    # Print the results
    print(f'{clf_name} - Accuracy: {accuracy:.4f}, Precision: {precision:.4f}, Recall: {recall:.4f}, F1 Score: {f1:.4f}')

# Print a summary of the results
print('Comparison of Classifiers on Test Set:')
for clf_name in results:
    print(f'{clf_name}: Accuracy={results[clf_name]["accuracy"]:.4f}, Precision={results[clf_name]["precision"]:.4f}, Recall={results[clf_name]["recall"]:.4f}, F1 Score={results[clf_name]["f1"]:.4f}')

SVM - Accuracy: 0.9230, Precision: 0.9234, Recall: 0.9230, F1 Score: 0.9229
Softmax - Accuracy: 0.9183, Precision: 0.9186, Recall: 0.9183, F1 Score: 0.9181
SVM+PCA - Accuracy: 0.8797, Precision: 0.8802, Recall: 0.8797, F1 Score: 0.8798
Comparison of Classifiers on Test Set:
SVM: Accuracy=0.9230, Precision=0.9234, Recall=0.9230, F1 Score=0.9229
Softmax: Accuracy=0.9183, Precision=0.9186, Recall=0.9183, F1 Score=0.9181
SVM+PCA: Accuracy=0.8797, Precision=0.8802, Recall=0.8797, F1 Score=0.8798
