# Module 9 Problem Set
Problem Set for Module 9 - CNNs.

## Introduction
In this Problem Set, we will discuss the creation of Convolutional Neural Networks (CNNs). We will use the `tensorflow` library for creating our CNN, though if you are familiar with Pytorch, you will find the methods used here to be readily applicable through `torch.nn.sequential`.

## Data
As the problems we have get more complex, the datasets we apply our problems to get similarly complex. In your own endeavours (e.g. homework), you will notice that suitable data is a bit harder to come by compared to our simpler models like KNN and Decision Trees. Suitable data for CNNs will more often than not involve images. **This is because CNN's work best with data with spatial features. In other words, the input to a CNN is best suited for spatially organized data like images. Why?**

[Type here]

Often, especially with medical data, you will need to sign a licensing agreement to deal with real data. Moreover, these datasets will often require a lot of storage to deal with (more than GitHub would be happy with me uploading). Because of this limitiation, this problem set tackles one of the more "typical" CNN datasets - the MNIST digits dataset. 

The MNIST handwritten digits dataset is a widely used benchmark in machine learning, containing a collection of 28x28 pixel grayscale images of handwritten digits (0 through 9). It consists of 60,000 training images and 10,000 testing images, with each image depicting a single digit written by various individuals. 

Fortunately for us, tensorflow (and pytorch as well) has an instance of this dataset built in.

In [None]:
import tensorflow as tf
from tensorflow.keras.datasets import mnist

tf.random.set_seed(3621)

# Load the MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()

# Exploring the dataset
print(f"Shape of training images: {x_train.shape}")
print(f"Shape of testing images: {x_test.shape}")
print(f"Number of classes: {len(set(y_train))}")  # The built-in set type works exactly like sets in math - a list with no duplicates.

In [None]:
import matplotlib.pyplot as plt

# Create a dictionary to group images by their class (digit)
class_images = {i: [] for i in range(10)}

# Populate the dictionary with images
for i in range(len(x_train)):
    class_label = y_train[i]
    class_images[class_label].append(x_train[i])

# Plot a few examples from each class
plt.figure(figsize=(12, 8))
for i in range(10):
    plt.subplot(2, 5, i + 1)
    plt.imshow(class_images[i][0], cmap='gray')  # Display the first image of each class
    plt.title(f"Class {i}")
    plt.axis('off')

plt.show()

**With the information and output from cells above, describe the task of our CNN - what are we trying to do? Is this classification or regression? What are suitable loss functions/error metrics for training?**

[Type here]

## Model Creation

### Preprocessing

In [None]:
import tensorflow as tf

from tensorflow.keras.datasets import mnist
from tensorflow.keras.utils import to_categorical

# Preprocess the data
## Input data (images): make sure all data is 28x28x1 arrays with datatype float32 (32 bit float)
##             and scale by 255 (max value for any pixel) so that all input is in between 0 and 1.
x_train = x_train.reshape(-1, 28, 28, 1).astype('float32') / 255.0 
## Targets (classes): one-hot encode by using the to_categorical function
y_train = to_categorical(y_train, 10)


**Do we need to perform the same preprocessing steps to our testing data? Only some of the steps? Explain your answer? If preprocessing is necessary, use the following cell to preprocessing the testing data.**

[Type here]

In [None]:
## Code here

### Defining the model

In [None]:
from tensorflow.keras import layers, models

# Create a simple CNN model
## Start with tensorflow.keras.models.Sequential()
model = models.Sequential()

#Proceed to add layers with models.add()
## Unlike pytorch, tensorflow will 
model.add(layers.Conv2D(32, (3, 3), activation='relu', input_shape=(28, 28, 1)))  # 32 filters, 3x3 kernels
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(64, (3, 3), activation='relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(64, activation='relu'))
model.add(layers.Dense(10, activation='softmax'))

### Training the model
The cell below compiles our model (in other words, finalizes its shape) and then fits the model to our data over 5 epochs. **However, *something is wrong with the code* - what is it? Before running the code, implement a fix for this problem**

In [None]:
model.compile(optimizer='adam',
              loss='categorical_crossentropy',
              metrics=['accuracy'])

# Train the model
model.fit(x_train, y_train, epochs=5, batch_size=64, validation_data=(x_test, y_test))

[Type here]

In [None]:
## Code here

### Evaluate the model

In [None]:
# Evaluate the model on the test data
test_loss, test_accuracy = model.evaluate(x_test, y_test)
print(f'Test accuracy: {test_accuracy*100:.2f}%')

In [None]:
import numpy as np
# Use the trained model to make predictions on the test data
predictions = model.predict(x_test)

# Find misclassified images
misclassified_indices = np.where(np.argmax(predictions, axis=1) != np.argmax(y_test, axis=1))[0]

# Visualize 5 correct predictions
n_examples = 5 

plt.figure(figsize=(12, 8))
for i in range(n_examples):
    plt.subplot(2, n_examples, i + 1)
    plt.imshow(x_test[i].reshape(28, 28), cmap='gray')
    actual_label = np.argmax(y_test[i])
    predicted_label = np.argmax(predictions[i])
    plt.title(f"Actual: {actual_label}\nPredicted: {predicted_label}") 
    plt.axis('off')

plt.show()


# Visualize 5 incorrect predictions if they exist
n_examples = min(5, len(misclassified_indices)) 

plt.figure(figsize=(12, 8))
for i in range(n_examples):
    index = misclassified_indices[i]
    plt.subplot(2, n_examples, i + 1)
    plt.imshow(x_test[index].reshape(28, 28), cmap='gray')
    actual_label = np.argmax(y_test[index])
    predicted_label = np.argmax(predictions[index])
    plt.title(f"Actual: {actual_label}\nPredicted: {predicted_label}", color='red')  # Highlight misclassifications in red
    plt.axis('off')
plt.show()

**Evaluate our model? Does it perform well? Are there things we could do to improve its performance or tweak our model? Give some concrete examples.**

[Type here]