**Chapter 11 – Training Deep Neural Networks**

Homework notebook

### 학번:  

### 이름:  

# Setup

First, let's import a few common modules, ensure MatplotLib plots figures inline and prepare a function to save the figures. We also check that Python 3.5 or later is installed (although Python 2.x may work, it is deprecated so we strongly recommend you use Python 3 instead), as well as Scikit-Learn ≥0.20 and TensorFlow ≥2.0.

In [None]:
# Python ≥3.5 is required
import sys
print("Python: ", sys.version_info)
assert sys.version_info >= (3, 7)

# Scikit-Learn ≥0.20 is required
import sklearn
print("sklearn version: ", sklearn.__version__)
assert sklearn.__version__ >= "0.20"

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
except Exception:
    pass

try:
    # %tensorflow_version only exists in Colab.
    %tensorflow_version 2.x
    IS_COLAB = True
except Exception:
    IS_COLAB = False

# TensorFlow ≥2.8 is required
import tensorflow as tf
print("TF version: ", tf.__version__)
assert tf.__version__ >= "2.8"

if not tf.config.list_physical_devices('GPU'):
    print("No GPU was detected. CNNs can be very slow without a GPU.")
    if IS_COLAB:
        print("Go to Runtime > Change runtime and select a GPU hardware accelerator.")

# GPU test
print("GPU installed: ",tf.test.is_built_with_gpu_support())

# To prevent "CUDNN_STATUS_ALLOC_FAILED" error with GPUs
gpus = tf.config.experimental.list_physical_devices('GPU')
if gpus:
  try:
    # Currently, memory growth needs to be the same across GPUs
    for gpu in gpus:
      tf.config.experimental.set_memory_growth(gpu, True)
    logical_gpus = tf.config.experimental.list_logical_devices('GPU')
    print(len(gpus), "Physical GPUs,", len(logical_gpus), "Logical GPUs")
  except RuntimeError as e:
    # Memory growth must be set before GPUs have been initialized
    print(e)

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

# Where to save the figures
PROJECT_ROOT_DIR = "."
CHAPTER_ID = "ann"
IMAGES_PATH = os.path.join(PROJECT_ROOT_DIR, "images", CHAPTER_ID)
os.makedirs(IMAGES_PATH, exist_ok=True)

def save_fig(fig_id, tight_layout=True, fig_extension="png", resolution=300):
    path = os.path.join(IMAGES_PATH, fig_id + "." + fig_extension)
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(path, format=fig_extension, dpi=resolution)

# Ignore useless warnings (see SciPy issue #5998)
import warnings
warnings.filterwarnings(action="ignore", message="^internal gelsd")

## Nonsaturating Activation Functions

### SELU

### Exercise 11.1  
Build the same model as above with elu activation function, train the model, and compare the results. (The same training procedure as above)

In [None]:
# Ex.11.1a Model with elu activation
keras.backend.clear_session()
tf.random.set_seed(42)

model3 = keras.models.Sequential([


#### Ex.11.1b Compare the results of loss, val_accuracy and training time for LeakyReLU, PReLU, SELU, and ELU  



## Reusing Pretrained Layers

### Reusing a Keras model

### Exercise 11.2  
"Reusing a Keras model" 부분에서 Fashion MNIST의 class 5,6,7을 제외하고 7개의 class로 학습하고 class 5,6,7에 대해 300개의 학습데이터를 이용하여 단독 학습한 결과와 transfer leranng을 이용한 경우에 대해 비교하시오. 제외시키는 class가 3개로 늘어난 것 외에는 모두 위의 실습 코드와 동일하게 작성하시오. 결과를 비교하고 분석하시오.

In [None]:
# Ex.1.2a 7개 class 데이터를 이용한 학습
def split_dataset(X, y):


In [None]:
# Ex.11.2b 300개 데이터를 이용한 3개 class에 대한 학습
model_C =

In [None]:
#Ex. 11.2c Transfer learning using pregtrained layers
model_A = keras.models.load_model("my_model_A.h5")
model_C_on_A =

#### Ex.11.2d 결과비교

# Faster Optimizers

### Exercise 11.3  
Using Model6, cretae models from model6c to model6i for the above optimizers, train models and compare the results (training time and accuracy)

In [None]:
# Ex.11.3a
# Momentum
np.random.seed(42)
tf.random.set_seed(42)
model6c =

In [None]:
# NAG
np.random.seed(42)
tf.random.set_seed(42)
model6d =

In [None]:
# AdaGrad
np.random.seed(42)
tf.random.set_seed(42)
model6e =

In [None]:
# RMSProp
np.random.seed(42)
tf.random.set_seed(42)
model6f =

In [None]:
# Adam
np.random.seed(42)
tf.random.set_seed(42)
model6g =

In [None]:
# Adamax
np.random.seed(42)
tf.random.set_seed(42)
model6h =

In [None]:
# Nadamax
np.random.seed(42)
tf.random.set_seed(42)
model6i =

#### Ex.11.3b  
Compare the results

## Learning Rate Scheduling

### Exercise 11.4  
Train the model using the above keras.optimizers.schedules.ExponentialDecay learning rate scheduler

In [None]:
#EX.11.4
tf.random.set_seed(42)
np.random.seed(42)
model13 =

### Exercise 11.5  
### 8. Deep Learning on CIFAR10

- 이번 문제는 컴퓨터 성능에 따라 학번의 학습이 몇시간이 걸릴 수 있으니 GPU가 설치된 고성능 PC나 colab에서 실행 권장  
- 자율실습실(형남 1302/3호) 또는 차세대반도체학과 실습실(조만식 427호: 예약제)

### a.
히든 레이어 20개, 각 레이어 당 100개의 뉴런을 가진 DNN 모델을 구축하시요. 이 때 He initilization 과 ELU activation function 을 사용하시오

In [None]:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
# Create a model: 20 hidden layers, 100 neurons each
model9 =

### b.
Nadam optimization 과 early stopping을 사용하여, 위에서 구축한 모델을 CIFAR10 dataset으로 학습하시오.

CIFAR10 데이터셋 로드: keras.datasets.cifar10.load_data() CIFAR10 데이터셋: 총 10개의 class 존재 --> output feature 10인 softmax 필요

참고)학습 때 모델 구조 또는 하이퍼파라미터 변경 시마다 적절한 learning rate 를 찾을 것.

Let's add the output layer to the model:

In [None]:
# add output layer


Nadam optimizer, Learningrate:5e-5를 사용해보자

In [None]:
# Define optimizer and compile model9
optimizer =

CIRAR10데이터셋 로드하기


early stopping 추가 위해 validation set 설정(training set의 초반 5000개 이미지)

In [None]:
# Load CIFAR10 dataset and generate train and valid dataset
cifar10 = keras.datasets.cifar10
(X_train_full, y_train_full), (X_test, y_test) = cifar10.load_data()



analysis of CIFAR 10 dataset. Refer the cells 14-23 of chap. 10

In [None]:
# Print shape and datatype of X_rain_full


In [None]:
class_names = ["airplane", "automobile", "bird", "cat", "deer", "dog",
               "frog", "horse", "ship", "truck"]
# check y_train


In [None]:
# plot image of X_trian[0]


In [None]:
# Plot the first 40 images with labels. (4 x 10 format)


Now we can create the callbacks for tensorboard and early stopping and train the model:

In [None]:
# Define callbacks. Save checkpoint as my_cifar10_model.h5
early_stopping_cb =
model_checkpoint_cb =

In [None]:
# Train model9: 100 epoch


In [None]:
# Tensoboard display

# If you encounter an error, open a tab in your browser and type "http://localhost:6006"

In [None]:
# load the saved model9 to model10 and evaluate


### c.
BatchNormalization 추가하여 Learning Curves 비교하기
- 수렴 속도, 성능, 학습 속도 등

The code below is very similar to the code above, with a few changes:

* I added a BN layer after every Dense layer (before the activation function), except for the output layer. I also added a BN layer before the first hidden layer.
* I changed the learning rate to 5e-4. I experimented with 1e-5, 3e-5, 5e-5, 1e-4, 3e-4, 5e-4, 1e-3 and 3e-3, and I chose the one with the best validation performance after 10 epochs.
* I renamed the run directories to run_bn_* and the model file name to my_cifar10_bn_model.h5.

In [None]:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)
# Create model11 and add BatchNormalization layer before activation. Others are the same as above
model11 =

* *Is the model converging faster than before?* Much faster! The previous model took OO epochs to reach the lowest validation loss, while the new model with BN took OO epochs. That's more than twice as fast as the previous model. The BN layers stabilized training and allowed us to use a much larger learning rate, so convergence was faster.
* *Does BN produce a better model?* Yes! The final model is also much better, with OO% accuracy instead of OO%. It's still not a very good model, but at least it's much better than before (a Convolutional Neural Network would do much better, but that's a different topic, see chapter 14).
* *How does BN affect training speed?* Although the model converged twice as fast, each epoch took about OOs instead of OOs, because of the extra computations required by the BN layers. So overall, although the number of epochs was reduced by OO%, the training time (wall time) was shortened by OO%. Which is still pretty significant!

### d.
*Exercise: Try replacing Batch Normalization with SELU, and make the necessary adjustements to ensure the network self-normalizes (i.e., standardize the input features, use LeCun normal initialization, make sure the DNN contains only a sequence of dense layers, etc.).*

In [None]:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model12 =

We get OO.O% accuracy, which is better than the original model, but not quite as good as the model using batch normalization. Moreover, it took OO epochs to reach the best model, which is much faster than both the original model and the BN model, plus each epoch took only OO seconds, just like the original model. So it's by far the fastest model to train (both in terms of epochs and wall time).

### e.
*Exercise: Try regularizing the model with alpha dropout. Then, without retraining your model, see if you can achieve better accuracy using MC Dropout.*

In [None]:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model13 =

The model reaches OO.O% accuracy on the validation set. That's very slightly worse than without dropout (OO.O%). With an extensive hyperparameter search, it might be possible to do better (I tried dropout rates of 5%, 10%, 20% and 40%, and learning rates 1e-4, 3e-4, 5e-4, and 1e-3), but probably not much better in this case.

Let's use MC Dropout now. We will need the `MCAlphaDropout` class we used earlier, so let's just copy it here for convenience:

In [None]:
class MCAlphaDropout(keras.layers.AlphaDropout):
    def call(self, inputs):
        return super().call(inputs, training=True)

Now let's create a new model, identical to the one we just trained (with the same weights), but with `MCAlphaDropout` dropout layers instead of `AlphaDropout` layers:

In [None]:
mc_model =

Then let's add a couple utility functions. The first will run the model many times (10 by default) and it will return the mean predicted class probabilities. The second will use these mean probabilities to predict the most likely class for each instance:

In [None]:
def mc_dropout_predict_probas(mc_model, X, n_samples=10):
    Y_probas = [mc_model.predict(X) for sample in range(n_samples)]
    return np.mean(Y_probas, axis=0)

def mc_dropout_predict_classes(mc_model, X, n_samples=10):
    Y_probas = mc_dropout_predict_probas(mc_model, X, n_samples)
    return np.argmax(Y_probas, axis=1)

Now let's make predictions for all the instances in the validation set, and compute the accuracy:

In [None]:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)



We only get virtually no accuracy improvement in this case (from OO.O% to OO.O%).

So the best model we got in this exercise is the Batch Normalization model.

### f.
*Exercise: Retrain your model using 1cycle scheduling and see if it improves training speed and model accuracy.*

In [None]:
keras.backend.clear_session()
tf.random.set_seed(42)
np.random.seed(42)

model14 =

In [None]:
batch_size = 128
rates, losses = find_learning_rate(model14, X_train_scaled, y_train, epochs=1, batch_size=batch_size)
plot_lr_vs_loss(rates, losses)
plt.axis([min(rates), max(rates), min(losses), (losses[0] + min(losses)) / 1.4])

In [None]:
n_epochs = 15
onecycle = OneCycleScheduler(len(X_train_scaled) // batch_size * n_epochs, max_rate=0.05)


One cycle allowed us to train the model in just OO epochs, each taking only O seconds (thanks to the larger batch size). This is over O times faster than the fastest model we trained so far. Moreover, we improved the model's performance (from OO.O% to OO.O%). The batch normalized model reaches a slightly better performance, but it's much slower to train.