Welcome to assignment 1.                                                       

We are using pathology images for our first assignment please download data from this link https://drive.google.com/drive/folders/10dUOzcPR-PQwfFYcHk5gsLjIjSorQ32Q?usp=sharing



# Task 1: Feature Generation (15%)
# Use and run the following code (a deep network) to generate features from a set of training images. For this assignment, you do not need to know how the deep network is working here to extract features.
# This code extracts the features of image T4.tif (in the T folder of dataset). Modify the code so that it iterates over all images of the dataset and extracts their features.
# Allocate 10% of the data for validation.

# Insert your code here for Task 1





In [6]:
import os
import random
import torch
import torchvision.transforms as transforms
from torchvision.models import densenet121
from torch.autograd import Variable
from PIL import Image
from sklearn.model_selection import train_test_split

# Set the path to the dataset folder — modify to match the local dowload location of the dataset on your machine.
dataset_path = "C:\\Users\\brian\\Downloads\\SYDE 522\\train-20240221T231820Z-001\\train"

# List to store image paths and labels
all_image_paths = []
all_labels = []

# Iterate over labeled folders (A to T)
for label in os.listdir(dataset_path):
    label_folder = os.path.join(dataset_path, label)
    
    # Iterate over images in each labeled folder
    for image_name in os.listdir(label_folder):
        image_path = os.path.join(label_folder, image_name)
        all_image_paths.append(image_path)
        all_labels.append(label)

# Split the data into training and validation sets (90% training, 10% validation)
train_image_paths, val_image_paths, train_labels, val_labels = train_test_split(
    all_image_paths, all_labels, test_size=0.1, random_state=42
)

# Load pre-trained DenseNet model
model = densenet121(pretrained=True)

# Remove the classification layer (last fully connected layer)
model = torch.nn.Sequential(*list(model.children())[:-1])

# Add a global average pooling layer
model.add_module('global_avg_pool', torch.nn.AdaptiveAvgPool2d(1))

# Set the model to evaluation mode
model.eval()

# Define the image preprocessing pipeline
preprocess = transforms.Compose([
    transforms.Resize(256),
    transforms.CenterCrop(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

# Function to extract features for a given image path
def extract_features(image_path):
    image = Image.open(image_path)
    input_tensor = preprocess(image)
    input_batch = input_tensor.unsqueeze(0)
    input_var = Variable(input_batch)
    features = model(input_var)
    feature_vector = features.squeeze().detach().numpy()
    return feature_vector


# Extract features for training set
train_features = [extract_features(image_path) for image_path in train_image_paths]

# Extract features for validation set
val_features = [extract_features(image_path) for image_path in val_image_paths]

# Now 'train_features' and 'val_features' contain the features from the last fully connected layer of DenseNet
print("Training set feature vectors shape:", len(train_features))
print("Validation set feature vectors shape:", len(val_features))

# Print the first few feature vectors and labels for training set
print("Training set feature vectors:")
for i in range(min(5, len(train_features))):
    print(f"Instance {i+1}: {train_features[i]} - Label: {train_labels[i]}")

# Print the first few feature vectors and labels for validation set
print("\nValidation set feature vectors:")
for i in range(min(5, len(val_features))):
    print(f"Instance {i+1}: {val_features[i]} - Label: {val_labels[i]}")

Training set feature vectors shape: 702
Validation set feature vectors shape: 78
Training set feature vectors:
Instance 1: [ 3.6497071e-04  5.1469303e-04  2.3895486e-03 ...  1.3170384e+00
 -4.5840797e-01 -3.8190687e-01] - Label: N
Instance 2: [ 2.9535004e-04  1.8173460e-03 -6.7726779e-04 ...  3.1749940e-01
 -3.9225417e-01  3.6234949e-02] - Label: G
Instance 3: [ 1.6402092e-04  1.0557906e-02 -1.6252657e-03 ...  7.0413810e-01
  2.2983617e-01 -5.0651515e-01] - Label: B
Instance 4: [ 0.00031655  0.00800259  0.00266579 ...  0.14289752 -0.08327961
 -0.10008601] - Label: S
Instance 5: [ 3.0409099e-04  7.3866742e-03  5.1932072e-04 ...  1.0816039e+00
 -3.3889934e-01  4.2816776e-01] - Label: P

Validation set feature vectors:
Instance 1: [ 3.6549233e-04  6.7210454e-03  8.5626799e-04 ...  7.1736246e-01
 -2.8375876e-01 -6.8495661e-01] - Label: P
Instance 2: [-1.0506088e-04  3.5598530e-03  3.3325888e-03 ...  9.3878144e-03
 -5.0184292e-01 -2.4772014e-01] - Label: P
Instance 3: [ 0.00043349  0.004557

# Task 2: High Bias Classification Method (5%)
# Choose a classification method and let is have a high bias.
# Train it on the generated features and discuss why it is underfitting.

# Insert your code here for Task 2




In [10]:
from sklearn.cluster import KMeans
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

# Assuming 'train_features' and 'train_labels' are the features and labels for training set
# Assuming 'val_features' and 'val_labels' are the features and labels for validation set

# Convert labels to integers (assuming labels are strings)
label_to_int = {label: idx for idx, label in enumerate(set(train_labels))}
train_labels_int = [label_to_int[label] for label in train_labels]
val_labels_int = [label_to_int[label] for label in val_labels]

# Use K-Means for classification
kmeans = KMeans(n_clusters=len(set(train_labels)), random_state=42)
kmeans.fit(train_features)

# Predict cluster assignments for validation set
val_predictions = kmeans.predict(val_features)

# Convert cluster assignments to labels
cluster_to_label = {cluster: label for label, cluster in label_to_int.items()}
val_labels_pred = [cluster_to_label[cluster] for cluster in val_predictions]

#print(val_labels_int, val_predictions)

# Calculate accuracy
accuracy = accuracy_score(val_labels_int, val_predictions)

# Discuss why it might be underfitting
print(f"Accuracy: {accuracy:.2%}")
print("K-Means is a simple algorithm that assumes spherical clusters with equal variance.")
print("It may underfit when the underlying data distribution is non-linear or has varying cluster shapes.")


  super()._check_params_vs_input(X, default_n_init=10)


[12, 12, 3, 15, 18, 9, 17, 3, 5, 14, 17, 15, 13, 13, 13, 11, 3, 10, 4, 2, 10, 0, 2, 3, 8, 14, 1, 7, 16, 10, 7, 2, 11, 12, 13, 4, 1, 3, 19, 14, 19, 13, 11, 15, 9, 13, 19, 16, 10, 6, 18, 6, 10, 15, 9, 19, 17, 12, 1, 14, 4, 8, 15, 8, 8, 18, 5, 4, 18, 12, 5, 6, 5, 9, 19, 3, 5, 13] [17  9 11  8  1  5  2 11 16  4  2 16 12 12 12 18 11 17 13  1 17 14 15 16
  3  4  6 19 14 17 19 15  3 19 13 13  6 11 17  4  0 12 18  8  5  9 17 17
 17  7  1  7 17  8  5 17  2  0  6  4 13  3  8  3  3  1 16 19  1 17 16  7
 16  5 17 11 16 12]
Accuracy: 1.28%
K-Means is a simple algorithm that assumes spherical clusters with equal variance.
It may underfit when the underlying data distribution is non-linear or has varying cluster shapes.


# Task 3: High Variance Classification Method (5%)
# Use the chosen classification method and let it have a high variance.
# Train it on the generated features and discuss why it is overfitting.

# Insert your code here for Task 3




# Task 4: Balanced Classification Method (15%)
# Use the chosen classification method and let it balance the bias and variance.
# Train it on the generated features, possibly adjusting parameters.
# Discuss insights into achieving balance.

# Insert your code here for Task 4




# Task 5: K-Means Clustering (20%)
# Apply K-Means clustering on the generated features.
# Test with available labels and report accuracy.
# Experiment with automated K and compare with manually set 20 clusters.

# Insert your code here for Task 5




# Task 6: Additional Clustering Algorithm (10%)
# Choose another clustering algorithm and apply it on the features.
# Test accuracy with available labels.

# Insert your code here for Task 6




# Task 7: PCA for Classification Improvement (20%)
# Apply PCA on the features and then feed them to the best classification method in the above tasks.
# Assess if PCA improves outcomes and discuss the results.

# Insert your code here for Task 7




# Task 8: Visualization and Analysis (10%)
# Plot the features in a lower dimension using dimentinality reduction techniques.
# Analyze the visual representation, identifying patterns or insights.

# Insert your code here for Task 8