<a href="https://colab.research.google.com/github/aniketsharma00411/ML-Zoomcamp/blob/main/Session%208/Session%208.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Session #8 Homework

https://github.com/alexeygrigorev/mlbookcamp-code/blob/master/course-zoomcamp/08-deep-learning/homework.md

### Initialization

In [None]:
import numpy as np
import pandas as pd

import os
import shutil

from tensorflow.keras import Input
from tensorflow.keras.layers import Conv2D
from tensorflow.keras.layers import MaxPooling2D
from tensorflow.keras.layers import Flatten
from tensorflow.keras.layers import Dense
from tensorflow.keras import Model
from tensorflow.keras.optimizers import SGD
from tensorflow.keras.preprocessing.image import ImageDataGenerator

from statistics import median
from statistics import stdev
from statistics import mean

## Homework

### Dataset

In this homework, we'll build a model for predicting if we have an image of a dog or a cat. For this,
we will use the "Dogs & Cats" dataset that can be downloaded from [Kaggle](https://www.kaggle.com/c/dogs-vs-cats/data). 

You need to download the `train.zip` file.

If you have troubles downloading from Kaggle, use [this link](https://github.com/alexeygrigorev/large-datasets/releases/download/dogs-cats/train.zip) instead:

```bash
wget https://github.com/alexeygrigorev/large-datasets/releases/download/dogs-cats/train.zip
```

In the lectures we saw how to use a pre-trained neural network. In the homework, we'll train a much smaller model from scratch. 

**Note:** You don't need a computer with a GPU for this homework. A laptop or any personal computer should be sufficient. 


In [None]:
! wget https://github.com/alexeygrigorev/large-datasets/releases/download/dogs-cats/train.zip

--2021-11-21 16:37:02--  https://github.com/alexeygrigorev/large-datasets/releases/download/dogs-cats/train.zip
Resolving github.com (github.com)... 140.82.114.4
Connecting to github.com (github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://github-releases.githubusercontent.com/426348925/f39169c9-5f22-4a57-bb37-495c0d2974ab?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVEH53A%2F20211121%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20211121T163703Z&X-Amz-Expires=300&X-Amz-Signature=0d931f9494f26f7a7c402ecabd196f22e8ba3934a481ef5f3443bfb9f8dcee2b&X-Amz-SignedHeaders=host&actor_id=0&key_id=0&repo_id=426348925&response-content-disposition=attachment%3B%20filename%3Dtrain.zip&response-content-type=application%2Foctet-stream [following]
--2021-11-21 16:37:03--  https://github-releases.githubusercontent.com/426348925/f39169c9-5f22-4a57-bb37-495c0d2974ab?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=AKIAIWNJYAX4CSVE

In [None]:
! unzip -q train.zip

### Data Preparation

The dataset contains 12,500 images of cats and 12,500 images of dogs. 

Now we need to split this data into train and validation

* Create a `train` and `validation` folders
* In each folder, create `cats` and `dogs` folders
* Move the first 10,000 images to the train folder (from 0 to 9999) for boths cats and dogs - and put them in respective folders
* Move the remaining 2,500 images to the validation folder (from 10000 to 12499)

You can do this manually or with Python (check `os` and `shutil` packages).

In [None]:
def criteria(image):
    return int(image.split('.')[1])

In [None]:
def cat_dog_lists(images):
    cat = []
    dog = []
    for image in images:
        if 'cat' in image:
            cat.append(image)
        elif 'dog' in image:
            dog.append(image)
        else:
            raise 'Unknown image found'

    cat.sort(key=criteria)
    dog.sort(key=criteria)

    return cat[:10000], cat[10000:12500], dog[:10000], dog[10000:12500]

In [None]:
cat_train, cat_val, dog_train, dog_val = cat_dog_lists(os.listdir('train'))

In [None]:
os.rename('train', 'original')
os.mkdir('train')
os.mkdir('train/cat')
os.mkdir('train/dog')
os.mkdir('validation')
os.mkdir('validation/cat')
os.mkdir('validation/dog')

In [None]:
for image in cat_train:
    shutil.move(f'original/{image}', f'train/cat/{image}')

for image in cat_val:
    shutil.move(f'original/{image}', f'validation/cat/{image}')

for image in dog_train:
    shutil.move(f'original/{image}', f'train/dog/{image}')

for image in dog_val:
    shutil.move(f'original/{image}', f'validation/dog/{image}')

In [None]:
os.rmdir('original')

### Model

For this homework we will use Convolutional Neural Network (CNN. Like in the lectures, we'll use Keras.

You need to develop the model with following structure:

* The shape for input should be `(150, 150, 3)`
* Next, create a covolutional layer ([`Conv2D`](https://keras.io/api/layers/convolution_layers/convolution2d/)):
    * Use 32 filters
    * Kernel size should be `(3, 3)` (that's the size of the filter)
    * Use `'relu'` as activation 
* Reduce the size of the feature map with max pooling ([`MaxPooling2D`](https://keras.io/api/layers/pooling_layers/max_pooling2d/))
    * Set the pooling size to `(2, 2)`
* Turn the multi-dimensional result into vectors using a [`Flatten`](https://keras.io/api/layers/reshaping_layers/flatten/) layer
* Next, add a `Dense` layer with 64 neurons and `'relu'` activation
* Finally, create the `Dense` layer with 1 neuron - this will be the output
    * The output layer should have an activation - use the appropriate activation for the binary classification case

As optimizer use [`SGD`](https://keras.io/api/optimizers/sgd/) with the following parameters:

* `SGD(lr=0.002, momentum=0.8)`


For clarification about kernel size and max pooling, check [Week #11 Office Hours](https://www.youtube.com/watch?v=1WRgdBTUaAc).

In [None]:
input = Input(shape=(150, 150, 3))
layer1 = Conv2D(32, kernel_size=(3, 3), activation='relu')(input)
layer2 = MaxPooling2D(pool_size=(2, 2))(layer1)
layer3 = Flatten()(layer2)
layer4 = Dense(64, activation='relu')(layer3)
output = Dense(1, activation='sigmoid')(layer4)

model = Model(inputs=input, outputs=output)

### Question 1

Since we have a binary classification problem, what is the best loss function for us?

Note: since we specify an activation for the output layer, we don't need to set `from_logits=True`

In [None]:
model.compile(optimizer=SGD(learning_rate=0.002, momentum=0.8), loss='binary_crossentropy', metrics=['accuracy'])

The best loss function for the binary classsificaiton problem will be `binary_crossentropy`.

### Question 2

What's the total number of parameters of the model? You can use the `summary` method for that.

In [None]:
model.summary()

Model: "model"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
 input_1 (InputLayer)        [(None, 150, 150, 3)]     0         
                                                                 
 conv2d (Conv2D)             (None, 148, 148, 32)      896       
                                                                 
 max_pooling2d (MaxPooling2D  (None, 74, 74, 32)       0         
 )                                                               
                                                                 
 flatten (Flatten)           (None, 175232)            0         
                                                                 
 dense (Dense)               (None, 64)                11214912  
                                                                 
 dense_1 (Dense)             (None, 1)                 65        
                                                             

### Generators and Training

For the next two questions, use the following data generator for both train and validation:

```python
ImageDataGenerator(rescale=1./255)
```

* We don't need to do any additional pre-processing for the images.
* When reading the data from train/val directories, check the `class_mode` parameter. Which value should it be for a binary classification problem?
* Use `batch_size=20`
* Use `shuffle=True` for both training and validaition 

For training use `.fit()` with the following params:

```python
model.fit(
    train_generator,
    steps_per_epoch=100,
    epochs=10,
    validation_data=validation_generator,
    validation_steps=50
)
```

Note `validation_steps=50` - this parameter says "run only 50 steps on the validation data for evaluating the results". 
This way we iterate a bit faster, but don't use the entire validation dataset.
That's why it's important to shuffle the validation dataset as well.

In [None]:
gen = ImageDataGenerator(rescale=1./255)

train_generator = gen.flow_from_directory(
    'train',
    target_size=(150, 150),
    class_mode='binary',
    batch_size=20,
    shuffle=True
)

validation_generator = gen.flow_from_directory(
    'validation',
    target_size=(150, 150),
    class_mode='binary',
    batch_size=20,
    shuffle=True
)

Found 20000 images belonging to 2 classes.
Found 5000 images belonging to 2 classes.


In [None]:
history = model.fit(
    train_generator,
    steps_per_epoch=100,
    epochs=10,
    validation_data=validation_generator,
    validation_steps=50
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


### Question 3

What is the median of training accuracy for this model?

In [None]:
median(history.history['accuracy'])

0.5814999938011169

### Question 4

What is the standard deviation of training loss for this model?

In [None]:
stdev(history.history['loss'])

0.016509834863151104

### Data Augmentation

For the next two questions, we'll generate more data using data augmentations. 

Add the following augmentations to your training data generator:

* `rotation_range=40,`
* `width_shift_range=0.2,`
* `height_shift_range=0.2,`
* `shear_range=0.2,`
* `zoom_range=0.2,`
* `horizontal_flip=True,`
* `fill_mode='nearest'`

In [None]:
gen = ImageDataGenerator(
    rescale=1./255,
    rotation_range=40,
    width_shift_range=0.2,
    height_shift_range=0.2,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True,
    fill_mode='nearest'
    )

train_generator = gen.flow_from_directory(
    'train',
    target_size=(150, 150),
    class_mode='binary',
    batch_size=20,
    shuffle=True
)

validation_generator = gen.flow_from_directory(
    'validation',
    target_size=(150, 150),
    class_mode='binary',
    batch_size=20,
    shuffle=True
)

Found 20000 images belonging to 2 classes.
Found 5000 images belonging to 2 classes.


### Question 5 

Let's train our model for 10 more epochs using the same code as previously.
Make sure you don't re-create the model - we want to continue training the model
we already started training.

What is the mean of validation loss for the model trained with augmentations?

In [None]:
history = model.fit(
    train_generator,
    steps_per_epoch=100,
    epochs=10,
    validation_data=validation_generator,
    validation_steps=50
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [None]:
mean(history.history['val_loss'])

0.6560043811798095

### Question 6

What's the average of validation accuracy for the last 5 epochs (from 6 to 10)
for the model trained with augmentations?

In [None]:
mean(history.history['val_accuracy'][5:])

0.6098000049591065