<a href="https://colab.research.google.com/github/dancopeland/creative-machine-learning-for-design/blob/master/4453x_homework1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Homework 1 | Classification 🐕🐱
4.453x Creative Machine Learning for Design<br/>
*Renaud Danhaive, Ous Abou Ras, Natasha Hirt, Caitlin Mueller*
<br/><br/>

---

👋👋 Welcome to the first homework of 4.453x!! 😃🎈
<br/><br/>
In this assignment, you will train a model to recognize cats from dogs based on a limited dataset that we have pre-processed from the Kaggle Cats and Dogs dataset separately using Tensorflow's API (Tensorflow is another deep learning framework) which provides easy abstractions to download all kinds of datasets.

We worked through a very similar notebook in Lecture 1.  Like we saw in this first notebook on transfer learning with a tiny self-generated dataset, it's hard to make things work really well if your dataset is really too small. Here, we have a slightly more substantial dataset (still tiny in modern terms) that should help us achieve decent performance.

This notebook jumps ahead a bit in terms of machine learning concepts and implementation.  We will return to these concepts in a deeper, more systematic manner as we progress in the semester.  The purpose of this assignment and notebook is to help you build confidence programming in Colab notebook environments and get comfortable working with data and external libraries.

The code below provides a skeleton around the questions you are asked to complete. Questions and instructions are directly in the body of the notebook.
<br/><br/>

❗❗❗

The deliverable for this homework is a *viewable* link to your completed and runnable notebook that implements the requested code. For questions that require you to submit a textual or visual answer, please collect your results in a separate Google Docs file to be shared at the end of the notebook. For questions that request numerical results, you may simply print them with your code.

❗❗❗

---

---

Let's introduce some coding best practices. Why? Well, we need to read your code, but more importantly *you* may need to re-read your code at some point. Simple coding practices help you structure your mind and code faster.
We recommend the following practices:
- Explicit is better than implicit when naming objects. For example, a variable pointing to an array of dog images is better named as `dog_images_array` than as `d_array`.
- Don't repeat yourself (DRY). If you find yourself copy-pasting code, it's a good sign you should put that code in a function.
- Try to produce code that could almost be read as a sentence by somebody with some knowledge of what the code is trying to accomplish.
- More readable code is (almost always) better code.

---

# 0.0 | Imports 📦

In [None]:
import numpy as np
import torch
import torchvision
import matplotlib.pyplot as plt
import sklearn

# Question 1.0 | Download and inspect the dataset


### 1.1 | Download the dataset

We'll download the dataset to the local storage of this runtime. To do so easily, we'll use command line tool called `gdown` specifically designed to download files from Google Drive. In Colab, if you start a line of code with `!`, that line is run in the command line. This allows to install missing packages or download things easily from the web to the local storage. To install `gdown`, we just need to run `!pip install gdown`.

Note: you may have to run this cell twice to actually download the file to `dogs_cats.npz` as Colab sometimes returns a warning first.

In [None]:
!pip install --upgrade --no-cache-dir gdown > /dev/null
!gdown https://drive.google.com/uc?id=1SnEDJg3DFsFvSU3VX1L3Dj3LrR2uBGuC

Downloading...
From: https://drive.google.com/uc?id=1SnEDJg3DFsFvSU3VX1L3Dj3LrR2uBGuC
To: /content/processedDogCat.npz
100% 540M/540M [00:03<00:00, 144MB/s]


### 1.2 | Load the dataset
Ok, so we've got our dataset in the file `processedDogCat.npz`. `.npz` is a special format that can be read by `numpy`, so let's load the data contained in the file in a variable.

In [None]:
dataset = np.load("processedDogCat.npz", allow_pickle=True)

### 1.3 | Inspect this dataset

In [None]:
print(list(dataset.keys()))

['X', 'Y']


We see the dataset has two entries `X` (the input features or images) and `Y` the target classes.

In [None]:
X = dataset["X"]
Y = dataset["Y"]

### 1.4 ❓ Print the shapes of the X and Y arrays. 
❗ How many samples are there in the dataset?  In other words, what are the dimensions of the X and Y arrays?  *Hint: the `np.shape` function may be useful here.*


### 1.5 ❓ Plot 5 random images from X (input)

### 1.6 ❓ What are the target labels in Y and which one corresponds to the "Dog" category?



```
# This is formatted as code
```

Each datapoint in the dataset ia a 150-by-150 RGB image and its associated class. First, we'll shuffle the dataset because it is currently ordered by class (not something you usually want for training). Shuffling the dataset will also allow you to easily set aside a part of your dataset for validation/testing.

### 1.7 ❓ Shuffle the dataset 
You can use `sklearn.utils.shuffle`.  Remember that we want to keep X and Y together so that the class labels stay with their associated images.

### 1.8 ❓ What is the range of the pixel values?

# Question 2.0 | Processing the images into extracted features using VGG16
Take inspiration from the code in the [L01 notebook](https://colab.research.google.com/drive/1GTWORtS_bU__mH33_Mput_KYxYopqBuV?usp=sharing). Make sure to use the GPU!


You will need to resize the arrays/tensors from their current size (which is too small) to the minimum size accepted by VGG16 (224 by 224). To do so, we recommend you use `torchvision.transforms.Resize` (note this transformation can be applied on a batch of multiple images at once).

❗❗ Because we are handling many more images than we did in the lecture notebook, you may need to process them in batches (hint: you need to do it if you encounter a RAM or CUDA memory issue).

### 2.1 ❓ Convert the array X to a tensor

### 2.2 ❓ Resize the image tensors. 
Remember that we need to permute the array to format the tensors in the anticipated way.

### 2.3 ❓ Normalize the image tensors using the normalization constants expected by VGG16.

### 2.4 ❓ Project the image tensors to their bottleneck features

### 2.5 ❓ Flatten the resulting *feature maps* (not the batches!)

---



### 2.6 ❓ Convert the result back to numpy

To be able to test how our model fares on unseen data, we need to set aside data that is not used during training. This data is our *validation* data and computing the accuracy of our model on this data gives us an idea of the *generalization error* of our model.

This is a really important process in ML, because in some sense predicting things perfectly on the training set does not matter (think of how stupidly replicating the data achieves perfect prediction). Generally, you would even want another test set, but we'll talk more about this in later weeks.

### 2.7 ❓ Separate the dataset in a training set (90% of the data) and a validation set (10% of the data)

# Question 3.0 | Fit k-NN model
Like in the [L01 notebook](https://colab.research.google.com/drive/1GTWORtS_bU__mH33_Mput_KYxYopqBuV?usp=sharing), fit a k-nearest neighbors model with $k=10$ to your training set.

Remember, the task of the model is to try to predict whether a given image shows a dog or a cat. This is a VERY important model that could change the world as we know it. 🚨🚨🚨🚨

### 3.1 ❓ Fit a k-nearest neighbors model to your training set

### 3.2 ❓ Compute the accuracy of the model on the training and validation sets (separately). Is the model doing well? 
❗ Record the values found and your observations in the Google Doc you link at the end of the notebook.

Hint: we measure accuracy as `number_correct_predictions/total_number_predictions`. 

### 3.3 ❓ Plot images of 5 good predictions and 5 bad predictions. Can you observe patterns in the bad prediction?
❗ Record these images and your observations in the Google Doc you link at the end of the notebook.

### 3.4 ❓ Test 5 different values of $k$ and compute the validation accuracy of each resulting model. What is the best of value of $k$ that you found?
❗ Record your observations in the Google Doc you link at the end of the notebook.


### Submit a link to the document where you recorded the requested observations below. Make sure it is viewable by the teaching staff. 

Your link 💻 >

# Congratulations, you're done! 💪 There was a lot of brand new stuff here, you should be proud of yourself!! 👏⚡