<a href="https://colab.research.google.com/github/chrisguarnold/civica_dl_class/blob/main/CIVICA_DL_Code_2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>


# Code Nr. 2: MNIST Data Set Revisited
## Introduction to Deep Learning
## Civica Data Science Summer School, 29.7.2021
Christian Arnold, Cardiff University




We are revisiting the MNIST code here. 
* First, we will understand what is going on.
* Second, you will try to gather some experience with hyperparameters.


## 1 Housekeeping

In [None]:
# libraries
install.packages('keras')

library(keras)

## 2 Preparing Data

In [None]:
# Data  
mnist <- dataset_mnist()

In [None]:
# This time, we split the data *properly* into a train, validation and test set

# Images
# train data
train_images <- mnist$train$x
train_images <- array_reshape(train_images, c(60000, 28*28))
train_images <- train_images/255
str(train_images)

# training and validation data 
val_indices <- 1:10000
val_images <- train_images[val_indices,]
train_images_partial <- train_images[-val_indices,]
str(val_images)
str(train_images_partial)

# test data 
test_images <- mnist$test$x
test_images <- array_reshape(test_images, c(10000, 28*28))
test_images <- test_images/255
str(test_images) 

# targets
train_labels <- mnist$train$y
val_labels <- train_labels[val_indices]
train_labels_partial <- train_labels[-val_indices]
test_labels <- mnist$test$y

train_labels_partial <- to_categorical(train_labels_partial)
val_labels <- to_categorical(val_labels)
test_labels <- to_categorical(test_labels)

## 3 This is How It Works


### 3.1 Model

In [None]:
# A Simple Network Architecture 
network <- keras_model_sequential() %>% 
    layer_dense(units = 32, activation = "relu", input_shape = c(28*28)) %>%
    layer_dense(units = 10, activation = "softmax")

In [None]:
# Compile 
network %>% compile(
    optimizer = 'rmsprop',
    loss = 'categorical_crossentropy',
    metrics = c('accuracy')
)

### 3.2 Train

In [None]:
# Train and validate
history <- network %>% fit(
    train_images_partial,
    train_labels_partial,
    epochs = 20,
    batch_size = 512,
    validation_data = list(val_images, val_labels)
)

*Then the hyperparametertuning happens here (see exercises below)*

In [None]:
# Once you are happy with the hyperparameters, predict out of sample
network %>% evaluate(test_images, test_labels)

In [None]:
network %>% predict_classes(test_images)

## 4 Exercises: Learning How to Tune Hyperparameters

* Play around with different hyperparameters (see below) 
* Learn how the neural net is reacting to them

### Exercise 1: Different Network Architectures
* When does learning kick in on the training data? 
* When does the validation data follow?
* Can you control these moments?

In [None]:
# Try a narrow and shallow net 
network <- keras_model_sequential() %>% 
    layer_dense(units = 16, activation = "relu", input_shape = c(28*28)) %>%
    layer_dense(units = 10, activation = "softmax")

In [None]:
# This is the compile and evaluation block. 
# Repeat it below on your own or reuse this block.
network %>% compile(
    optimizer = 'rmsprop',
    loss = 'categorical_crossentropy',
    metrics = c('accuracy')
)

history <- network %>% fit(
    train_images_partial,
    train_labels_partial,
    epochs = 20,
    batch_size = 512,
    validation_data = list(val_images, val_labels)
)
max(history$metrics$val_accuracy)
plot(history)

In [None]:
# Try a narrow and deep net 
network <- keras_model_sequential() %>% 
    layer_dense(units = 16, activation = "relu", input_shape = c(28*28)) %>%
    layer_dense(units = 16, activation = "relu") %>%
    layer_dense(units = 16, activation = "relu") %>%
    layer_dense(units = 16, activation = "relu") %>%
    layer_dense(units = 16, activation = "relu") %>%
    layer_dense(units = 10, activation = "softmax")

# Compile and Evaluate...

In [None]:
# Try a wide and shallow net 
network <- keras_model_sequential() %>% 
    layer_dense(units = 256, activation = "relu", input_shape = c(28*28)) %>%
    layer_dense(units = 10, activation = "softmax")

# Compile and Evaluate...

In [None]:
# Try a wide and deep net 
network <- keras_model_sequential() %>% 
    layer_dense(units = 256, activation = "relu", input_shape = c(28*28)) %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 10, activation = "softmax")

# Compile and Evaluate...

In [None]:
# Try a net that starts wide and grows more narrow the deeper it gets
network <- keras_model_sequential() %>% 
    layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 128, activation = "relu") %>%
    layer_dense(units = 10, activation = "softmax")

# Compile and Evaluate...

### Exercise 2: Different Activation Functions

* What do you observe regarding accuracy and loss?
* What do you observe regarding training time?

In [None]:
# Sigmoid
network <- keras_model_sequential() %>% 
    layer_dense(units = 512, activation = "sigmoid", input_shape = c(28*28)) %>%
    layer_dense(units = 256, activation = "sigmoid") %>%
    layer_dense(units = 128, activation = "sigmoid") %>%
    layer_dense(units = 10, activation = "softmax")

In [None]:
# Compile and Evaluate
network %>% compile(
    optimizer = 'rmsprop',
    loss = 'categorical_crossentropy',
    metrics = c('accuracy')
)

history <- network %>% fit(
    train_images_partial,
    train_labels_partial,
    epochs = 20,
    batch_size = 512,
    validation_data = list(val_images, val_labels)
)
max(history$metrics$val_accuracy)
plot(history) 

In [None]:
# Tanh
network <- keras_model_sequential() %>% 
    layer_dense(units = 512, activation = "tanh", input_shape = c(28*28)) %>%
    layer_dense(units = 256, activation = "tanh") %>%
    layer_dense(units = 128, activation = "tanh") %>%
    layer_dense(units = 10, activation = "softmax")

# Compile and Evaluate...

In [None]:
# Relu
network <- keras_model_sequential() %>% 
    layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 128, activation = "relu") %>%
    layer_dense(units = 10, activation = "softmax")

# Compile and Evaluate...

### Exercise 3: Different Learning Rates
* What do you observe regarding train and test loss and accuracy?


In [None]:
# Use this model
network <- keras_model_sequential() %>% 
    layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 128, activation = "relu") %>%
    layer_dense(units = 10, activation = "softmax")

In [None]:
# Compile 
network %>% compile(
    optimizer = optimizer_rmsprop(lr=0.001),
    loss = "categorical_crossentropy",
    metrics = c("accuracy")
)

In [None]:
# Evaluate
history <- network %>% fit(
    train_images_partial,
    train_labels_partial,
    epochs = 20,
    batch_size = 512,
    validation_data = list(val_images, val_labels)
)
max(history$metrics$val_accuracy)
plot(history)

In [None]:
network %>% compile(
    optimizer = optimizer_rmsprop(lr=0.01),
    loss = "categorical_crossentropy",
    metrics = c("accuracy")
)

# Evaluate...

In [None]:
network %>% compile(
    optimizer = optimizer_rmsprop(lr=0.1),
    loss = "categorical_crossentropy",
    metrics = c("accuracy")
)

# Evaluate...

### Exercise 4: Different Optimizers
* What do you observe regarding train and test loss and accuracy?

In [None]:
# Use this model
network <- keras_model_sequential() %>% 
    layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 128, activation = "relu") %>%
    layer_dense(units = 10, activation = "softmax")

In [None]:
# Compile
network %>% compile(
    optimizer = 'sgd',
    loss = 'categorical_crossentropy',
    metrics = c('accuracy')
)

In [None]:
# Evaluate
history <- network %>% fit(
    train_images_partial,
    train_labels_partial,
    epochs = 20,
    batch_size = 512,
    validation_data = list(val_images, val_labels)
)
max(history$metrics$val_accuracy)
plot(history)

In [None]:
# Compile
network %>% compile(
    optimizer = 'adam',
    loss = 'categorical_crossentropy',
    metrics = c('accuracy')
)

# Evaluate...

In [None]:
# Compile
network %>% compile(
    optimizer = 'rmsprop',
    loss = 'categorical_crossentropy',
    metrics = c('accuracy')
)

# Evaluate...

### Exercise 5: Different Batch Sizes
* Use a standard model and the rmsprop standard setting 
* Don't forget to recompile before every new run. 

    


In [None]:
# Use this model
network <- keras_model_sequential() %>% 
    layer_dense(units = 512, activation = "relu", input_shape = c(28*28)) %>%
    layer_dense(units = 256, activation = "relu") %>%
    layer_dense(units = 128, activation = "relu") %>%
    layer_dense(units = 10, activation = "softmax")

In [None]:
# Use these compiler settings
network %>% compile(
    optimizer = 'rmsprop',
    loss = 'categorical_crossentropy',
    metrics = c('accuracy')
)

In [None]:
# Evaluate (takes a while!)
history <- network %>% fit(
    train_images_partial,
    train_labels_partial,
    epochs = 20,
    batch_size = 32,
    validation_data = list(val_images, val_labels)
)
max(history$metrics$val_accuracy)
plot(history)

In [None]:
# Evaluate
history <- network %>% fit(
    train_images_partial,
    train_labels_partial,
    epochs = 20,
    batch_size = 512,
    validation_data = list(val_images, val_labels)
)
max(history$metrics$val_accuracy)
plot(history)

In [None]:
# Evaluate
history <- network %>% fit(
    train_images_partial,
    train_labels_partial,
    epochs = 20,
    batch_size = 2056,
    validation_data = list(val_images, val_labels)
)
max(history$metrics$val_accuracy)
plot(history)

## 5 Takeaway
The big lesson to be learned is how sensitive neural nets are. 
* Depending on their set-up, they can yield dramatically different results.
* Neural nets are not 'a model', but rather a whole class of models.