# Intermediate Machine Learning: Assignment 2

**Deadline**

Assignment 2 is due Wednesday, October 12 11:59pm. Late work will not be accepted as per the course policies (see the Syllabus and Course policies on Canvas).

Directly sharing answers is not okay, but discussing problems with the course staff or with other students is encouraged.

You should start early so that you have time to get help if you're stuck. The drop-in office hours schedule can be found on Canvas. You can also post questions or start discussions on Ed Discussion. The assignment may look long at first glance, but the problems are broken up into steps that should help you to make steady progress.

**Submission**

Submit your assignment as a pdf file on Gradescope, and as a notebook (.ipynb) on Canvas. You can access Gradescope through Canvas on the left-side of the class home page. The problems in each homework assignment are numbered. Note: When submitting on Gradescope, please select the correct pages of your pdf that correspond to each problem. This will allow graders to more easily find your complete solution to each problem.

To produce the .pdf, please do the following in order to preserve the cell structure of the notebook:

Go to "File" at the top-left of your Jupyter Notebook
Under "Download as", select "HTML (.html)"
After the .html has downloaded, open it and then select "File" and "Print" (note you will not actually be printing)
From the print window, select the option to save as a .pdf

**Topics**

 * Convolutional neural networks
 * Gaussian processes

This assignment will also help to solidify your Python and Jupyter notebook skills.


## Problem 1: Brain food (15 points)

This problem gives you some experience with convolutional neural networks for image classification using [TensorFlow](https://en.wikipedia.org/wiki/TensorFlow).  

The classification task is to discriminate real optical images of brain activity in mice from 
fake images that were constructed using a [generative adversarial network (GAN)](https://en.wikipedia.org/wiki/Generative_adversarial_network).
A paper on the underlying imaging technologies developed by Yale researchers is [here](https://www.nature.com/articles/s41592-020-00984-6).

For this problem we'll step you through the following steps:
* Downloading the data
* Loading the data
* Displaying some sample images
* Building a classification model using a simple CNN

After this, your task will be to comment on the structure and the performance of the CNN. 



###  Downloading the data

The data are contained in a group of compressed files on AWS. There are 10 files of real images, 
and 10 files of fake images; each file is roughly 100 MB in size; so the entire dataset is about 2 GB.
You should download the data to the computer you are running on, and place the in a folder named "data".

*Important note:* If you do not have enough space to download all of the data, just download what you can;
there will be no penalty for running on less data. If you want assistance running in Google Colab, please let us know.

Here are URLs to access the 20 data files:


https://sds365.s3.amazonaws.com/calcium/real_0.gz <br>
https://sds365.s3.amazonaws.com/calcium/real_1.gz <br>
https://sds365.s3.amazonaws.com/calcium/real_2.gz <br>
https://sds365.s3.amazonaws.com/calcium/real_3.gz <br>
https://sds365.s3.amazonaws.com/calcium/real_4.gz <br>
https://sds365.s3.amazonaws.com/calcium/real_5.gz <br>
https://sds365.s3.amazonaws.com/calcium/real_6.gz <br>
https://sds365.s3.amazonaws.com/calcium/real_7.gz <br>
https://sds365.s3.amazonaws.com/calcium/real_8.gz <br>
https://sds365.s3.amazonaws.com/calcium/real_9.gz <br>


https://sds365.s3.amazonaws.com/calcium/fake_0.gz <br>
https://sds365.s3.amazonaws.com/calcium/fake_1.gz <br>
https://sds365.s3.amazonaws.com/calcium/fake_2.gz <br>
https://sds365.s3.amazonaws.com/calcium/fake_3.gz <br>
https://sds365.s3.amazonaws.com/calcium/fake_4.gz <br>
https://sds365.s3.amazonaws.com/calcium/fake_5.gz <br>
https://sds365.s3.amazonaws.com/calcium/fake_6.gz <br>
https://sds365.s3.amazonaws.com/calcium/fake_7.gz <br>
https://sds365.s3.amazonaws.com/calcium/fake_8.gz <br>
https://sds365.s3.amazonaws.com/calcium/fake_9.gz <br>




We import some Python packages from TensorFlow and Keras.

In [None]:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models
import numpy as np
import gzip
import matplotlib.pyplot as plt

In [None]:
tf.__version__

Here are some helper functions for reading the data and plotting images.

In [None]:
def plot_images(imgs, title):
    plt.figure(figsize=(10,10))
    for i in range(9):
        plt.subplot(3,3,i+1)
        plt.imshow(imgs[i], cmap='rainbow')
        plt.axis('off')
    plt.suptitle(title)
    
def read_gz(filedir, shape=[-1,128,128]):
    print('reading %s' % filedir)
    with gzip.open(filedir, 'rb') as f:
        content = f.read()
    imgs = np.frombuffer(content, dtype='float32').reshape(shape)
    return imgs

def load_data(pieces):
    data = []
    label = []
    print('Loading data:\n-------------')
    for i in pieces:
        real_img = read_gz('data/real_{:d}.gz'.format(i), shape=[-1,128,128, 1])
        fake_img = read_gz('data/fake_{:d}.gz'.format(i), shape=[-1,128,128, 1])
        real_label = np.zeros((real_img.shape[0],1))
        fake_label = np.zeros((fake_img.shape[0],1))
        real_label[:,0] = 0
        fake_label[:,0] = 1
        data.append(real_img)
        data.append(fake_img)
        label.append(real_label)
        label.append(fake_label)
    print()
    data_all = np.concatenate(data, axis=0)
    label_all = np.concatenate(label, axis=0)
    return data_all, label_all
    


### Loading the data 

Let's look at some images. 

In [None]:
# real_images are original data, fake_images are synthetic data generated using a GAN model
real_images = read_gz('data/real_0.gz')
fake_images = read_gz('data/fake_0.gz')

Each of the images is 128x128 pixels, and there are 2048 images in each file:

In [None]:
real_images.shape, fake_images.shape

### Displaying some sample images

Now we'll display some real and fake images. Can you spot any differences between the two. Do you think that you could learn to tell them apart? Just think about this, you do not need to write out an answer.

In [None]:
plot_images(real_images, 'real images')
plot_images(fake_images, 'fake images')

### 1.1 Building a CNN model

Your model should be trained on six of the data files (3 real, and 3 fake, about 12,000 images total) and is tested on six of the data files. We begin by loading in the data.


In [None]:
train_images, train_labels = load_data([0,1,2])
test_images, test_labels = load_data([7,8,9])

Next, construct a convolutional neural network that contains at least four layers: A convolutional layer, a max pooling layer, a flattened layer, and a final dense layer with two terminal neurons.
The kernel size and number are among the choices you can make. Similarly, you can 
choose your own pooling size.


In [None]:
# Your code here

Next, compile and train the model. Here you can choose how many epochs you use 
to train the model, where each epoch scans through the 
data in a random order, processing a batch of images in each stochastic gradient descent step. Your code should
train the model with the Adam optimizer and the binary cross-entropy loss.


In [None]:
# Your code here

Finally, evaluate the model on the test data (the last three segments of images: 7,8,9). Your code should evaluate the model performance on test images and print out the test accuracy.

In [None]:
# Your code here

### 1.2: Discuss your model

Before moving into the discussion, consider improving your model in various ways:

* Train on more data (but always test on the same test set)
* Add more convolutional layers 
* Add more dense layers, using an appropriate activation function
* Regularize using a dropout layer

For the final model that you are satisfied with, discuss how and why you built this model. For example, you can:

* Give a diagram showing the sequence of layers used;
* Explain your rationale for using each of the layers;
* Comment on the number of parameters used by the model, and which layers have the most parameters;
* Describe your findings on number of epochs and training data size;
* Comment on the models you experimented with but did not include and how your modifications changed the test accuracy.

Note: Use suggestions as a guideline. When we evaluate your notebook, we will look for descriptions of your models that show understanding of how they work and why you chose a given architecture.


In [None]:
# Your discussions and perhaps some plots here

## Problem 2: It's not a bug, it's a feature! (20 points)

In this problem, we will ["open the black box"](https://news.yale.edu/2018/12/10/why-take-ydata-because-data-science-shouldnt-be-black-box) and inspect the filters and feature maps learned by a convolutional neural network trained to classify handwritten digits, using the MNIST database.

In [None]:
import numpy as np
from tensorflow import keras
from tensorflow.keras import layers
import matplotlib.pyplot as plt
import random
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

### 2.1 Visualizing the filters

To begin, we load the dataset with 60000 training images and 10000 test images.

In [None]:
num_classes = 10
input_shape = (28, 28, 1)

# the data, split between train and test sets
(x_train, y_train), (x_test, y_test) = keras.datasets.mnist.load_data()

# Scale images to the [0, 1] range
x_train = x_train.astype("float32") / 255
x_test = x_test.astype("float32") / 255
# Make sure images have shape (28, 28, 1)
x_train = np.expand_dims(x_train, -1)
x_test = np.expand_dims(x_test, -1)
print("x_train shape:", x_train.shape)
print(x_train.shape[0], "train samples")
print(x_test.shape[0], "test samples")


# convert class vectors to binary class matrices
y_train_binary = keras.utils.to_categorical(y_train, num_classes)
y_test_binary = keras.utils.to_categorical(y_test, num_classes)

Next, we initialize our convolutional neural network similar to the network we used for Problem 1 except that we now have a few more layers.

In [None]:
model = keras.Sequential(
    [
        keras.Input(shape=input_shape),
        layers.Conv2D(32, kernel_size=(5, 5), activation="relu", name='conv1'),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Conv2D(32, kernel_size=(5, 5), activation="relu", name='conv2'),
        layers.MaxPooling2D(pool_size=(2, 2)),
        layers.Flatten(),
        layers.Dropout(0.5),
        layers.Dense(num_classes, activation="softmax"),
    ]
)

model.summary()

In [None]:
batch_size = 128
epochs = 1

model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])

model.fit(x_train, y_train_binary, batch_size=batch_size, epochs=epochs, validation_split=0.1)

In [None]:
score = model.evaluate(x_test, y_test_binary, verbose=0)
print("Test loss:", score[0])
print("Test accuracy:", score[1])

Now that we've trained and tested the model, let's look at the filters learned in the first convolutional layer.

In [None]:
filters_conv1 = model.get_layer(name='conv1').get_weights()[0]

fig, axs = plt.subplots(4, 8)
fig.set_figheight(10)
fig.set_figwidth(15)

for i in range(4):
    for j in range(8):
        f = filters_conv1[:, :, 0, 8*i+j]
        axs[i, j].imshow(f[:, :], cmap='gray')
        axs[i, j].axis('off')
        axs[i, j].set_title(8*i+j)

Describe what you see. Do (some of) the learned filters make sense to you?

Hint: Many filters have been designed and widely applied in image processing. [Here](http://www.theobjects.com/dragonfly/dfhelp/3-5/Content/05_Image%20Processing/Edge%20Detection%20Filters.htm) are some examples of edge detection filters and their effect on the image. You can find the details about each filter by clicking the links at the bottom.

In [None]:
# Your Markdown Here

### 2.2 Visualizing the feature maps

We can also look at the corresponding feature map for each filter. There are 32 kernels at the first convolutional layer, so there are 32 feature maps for each sample. feature_map_conv1 is a 4D matrix where the first dimension is the index of the sample and the last dimension is the index of the correpsonding filter.

In [None]:
conv1_layer_model = keras.Model(inputs=model.input, outputs=model.get_layer('conv1').output)
feature_map_conv1 = conv1_layer_model(x_test)

Randomly draw 16 samples for visualization.

In [None]:
sample_index = random.sample(range(1, len(x_test)), 16)

Choose two filters among all 32 filters from 2.1, and visualize their feature maps.

In [None]:
filter_n1 = #
filter_n2 = #

There is no need to modify the next code cells, just run the four cells below.

In [None]:
plt.imshow(filters_conv1[:, :, 0, filter_n1], cmap='gray')

In [None]:
fig, axs = plt.subplots(4, 8)
fig.set_figheight(10)
fig.set_figwidth(15)

ix=0
for i in range(4):
    for j in range(4):
        axs[i, 2*j].imshow(x_test[sample_index[4*i+j], :, :, 0], cmap='gray')
        axs[i, 2*j].axis('off')
        axs[i, 2*j+1].imshow(feature_map_conv1[sample_index[4*i+j], :, :, filter_n1], cmap='gray')
        axs[i, 2*j+1].axis('off')

In [None]:
plt.imshow(filters_conv1[:, :, 0, filter_n2], cmap='gray')

In [None]:
fig, axs = plt.subplots(4, 8)
fig.set_figheight(10)
fig.set_figwidth(15)

ix=0
for i in range(4):
    for j in range(4):
        axs[i, 2*j].imshow(x_test[sample_index[4*i+j], :, :, 0], cmap='gray')
        axs[i, 2*j].axis('off')
        axs[i, 2*j+1].imshow(feature_map_conv1[sample_index[4*i+j], :, :, filter_n2], cmap='gray')
        axs[i, 2*j+1].axis('off')

Comment on what you see in the feature maps.
* How do they correspond to the original images?
* How do they correspond to the filters?
* Why might the feature maps be helpful for classifying digits?

In [None]:
# Your markdown here

### 2.3 Fitting a logistic regression model on feature maps

The features of the images are further summarized after the second convolutional layer.

In [None]:
conv2_layer_model = keras.Model(inputs=model.input, outputs=model.get_layer('conv2').output)
feature_map_conv2 = conv2_layer_model(x_test)

fig, axs = plt.subplots(4, 8)
fig.set_figheight(10)
fig.set_figwidth(15)

ix=0
for i in range(4):
    for j in range(4):
        axs[i, 2*j].imshow(x_test[sample_index[4*i+j], :, :, 0], cmap='gray')
        axs[i, 2*j].axis('off')
        axs[i, 2*j+1].imshow(feature_map_conv2[sample_index[4*i+j], :, :, 0], cmap='gray')
        axs[i, 2*j+1].axis('off')

Build and test a logistic regression model to classify two digits of your choice (i.e. a binary classification) using the features maps at the second convolutional layer as the input. You may use logistic regression functions such as [LogisticRegression in sklearn](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html). Use 80% of the data for training and 20% for test.

* How many features are there in your input X? Show the derivation of this number based on the architecture of the convolutional neural network.

* How is your logistic regression model related to the fully connected layer and softmax layer in the convolutional neural network?

* What is the accuracy of your model? Is this expected, or surprising? 

* Comment on any other aspects of your findings that are interesting to you.


In [None]:
X_lr = np.reshape(feature_map_conv2,(np.shape(feature_map_conv2)[0],-1))
y_lr = y_test

In [None]:
# Your code here

In [None]:
# Your markdown here

## Problem 3: All that glitters (20 points)

In this problem you will use Gaussian process regression to model the trends in gold medal performances of selected events in the summer Olympics. The objectives of this problem are for you to:

* Gain experience with Gaussian processes, to better understand how they work
* Explore how posterior inference depends on the properties of the prior mean and kernel
* Use Bayesian inference to identify unusual events
* Practice making your Python code modular and reusable

For this problem, the only starter code we provide is to read in the data and extract 
one event. You may write any GP code that you choose to, but please do not use any 
package for Gaussian processes; your code should be "np-complete" (using only 
basic `numpy` methods). You are encouraged to start from the [GP demo code](https://ydata123.org/sp22/interml/calendar.html) used in class.


When we ran the GP demo code from class on the marathon data, it generated the following plot:
<img src="https://github.com/YData123/sds365-fa22/raw/main/assignments/assn2/marathon.jpg" width="600">

Note several properties of this plot:
* It shows the Bayesian confidence of the regression, as a shaded area. This is a 95% confidence band because it has width given by $\pm 2 \sqrt{V}$, where $V$ is the estimated variance. The variance increases at the right side, for future years.

* The gold medal time for the 1904 marathon is outside of this confidence band. In fact, 
the 1904 marathon was an [unusual event](https://www.smithsonianmag.com/history/the-1904-olympic-marathon-may-have-been-the-strangest-ever-14910747/), and this is apparent from the model. 

* The plot shows the posterior mean, and also shows one random sample from the posterior distribution.

Your task in this problem is generate such a plot for six different Olympic events by writing a function

`def gp_olympic_event(year, result, kernel, mean, noise, event_name):
    ...`
    
 where the input variables are the following:
 
* `year`: a numpy array of years (integers)
* `result`: a numpy array of numerical results, for the gold medal performances in that event
* `kernel`: a kernel function 
* `mean`: a mean function 
* `noise`: a single float for the variance of the noise, $\sigma^2$
* `event_name`: a string used to label the y-axis, for example "marathon min/mile (men's event)"
 
Your function should compute the Gaussian process regression, and then display the resulting plot, analogous to the plot above for the men's marathon event.

You will then process **six** of the events, three men's events and three women's events, and call your function to generate the corresponding six plots.

For each event, you should create a markdown cell that describes the resulting model. Comment on such things as:

* How you chose the kernel, mean, and noise.
* Why the plot does or doesn't look satisfactory to you
* If there are any events such as the 1904 marathon that are notable.
* What happens to the posterior mean (for example during WWII) if there are gaps in the data

Use your best judgement to describe your findings; post questions to EdD if things are unclear. And have fun!



------------------

In the remainder of this problem description, we recall how we processed the marathon data, as an example. The following cell reads in the data and displays the collection of events that are included in the dataset. 

In [None]:
import numpy as np
import pandas as pd

dat = pd.read_csv('https://raw.githubusercontent.com/YData123/sds365-sp22/main/demos/gaussian_processes/olympic_results.csv')
events = set(np.array(dat['Event']))
print(events)

We then process the time to compute the minutes per mile (without checking that the race was actually 26.2 miles!)

In [None]:
marathon = dat[dat['Event'] == 'Marathon Men']
marathon = marathon[marathon['Medal']=='G']
marathon = marathon.sort_values('Year')
time = np.array(marathon['Result'])
mpm = []
for tm in time:
    t = np.array(tm.split(':'), dtype=float)
    minutes_per_mile = (t[0]*60*60 + t[1]*60 + t[2])/(60*26.2)
    mpm.append(minutes_per_mile)
    
marathon['Minutes per Mile'] = np.round(mpm,2)
marathon = marathon.drop(columns=['Gender', 'Event'], axis=1)
marathon.reset_index(drop=True, inplace=True)
year = np.array(marathon['Year'])
result = np.array(marathon['Minutes per Mile'])
marathon

Enter your code and markdown following this cell.