<img>
<font color="#CA3532"><h1 align="left">Deep Learning</h1></font>
<font color="#6E6E6E"><h2 align="left">Introducción a Keras - Parte 2</h2></font> 

# <font color="#CA3532">Resolviendo MNIST con Keras</font>

En este notebook vamos a construir una red neuronal para el problema MNIST (http://yann.lecun.com/exdb/mnist/) usando Keras. Lo primero, como siempre, es importar las librerías necesarias:

In [None]:
import tensorflow as tf
from tensorflow import keras

import numpy as np
import matplotlib.pyplot as plt
from time import time
import shutil

Cargamos los datos de MNIST:

In [None]:
(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()

print(train_images.shape)
print(train_labels.shape)
print(train_labels)

print(test_images.shape)
print(test_labels.shape)
print(test_labels)

Dibujamos algunas de las imágenes:

In [None]:
plt.figure(figsize=(10,10))
for i in range(25):
    plt.subplot(5,5,i+1)
    plt.xticks([])
    plt.yticks([])
    plt.grid(False)
    plt.imshow(train_images[i], cmap=plt.cm.binary)
    plt.xlabel(train_labels[i])

Antes de construir los modelos normalizamos las imágenes dividiendo entre el valor máximo para tenerlo entre 0 y 1:

In [None]:
train_images = train_images / 255
test_images = test_images / 255

In [None]:
plt.imshow(train_images[0], cmap=plt.cm.binary)
plt.colorbar()
plt.show()

Ahora restamos la media:

In [None]:
mean_img = train_images.mean(axis=0)
train_images = train_images - mean_img
test_images = test_images - mean_img

In [None]:
plt.imshow(mean_img, cmap=plt.cm.binary)
plt.colorbar()
plt.show()

In [None]:
plt.imshow(train_images[0], cmap='bwr', vmin=-1, vmax=1)
plt.colorbar()
plt.show()

## <font color="#CA3532">Ejercicio</color>

Probar con diferentes hiperparámetros para MNIST

In [None]:
from google.colab import drive

drive.mount('/content/drive')

In [None]:
# Borramos logs para visualizar solamente nuestros modelos
%cd /content/drive/MyDrive/
!rm -rf logs_keras_miax10_parte2

In [None]:
# Variables que no vamos a modificar
log_dir = "/content/drive/MyDrive/logs_keras_miax10_parte2/"
input_shape = (28, 28)
num_clases = 10
n_epochs = 20

LEARNING_RATE_BASE = 0.01
BATCH_SIZE_BASE = 400

### <font color="#CA3532">Modelo base</font>

In [None]:
# Caso base:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'sigmoid'
loss = 'sparse_categorical_crossentropy'
nombre = 'base'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta"))
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida"))

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

### <font color="#CA3532">Inicialización de pesos</font>

Documentación de Keras: https://keras.io/api/layers/initializers/

La inicialización de pesos a ceros se hace con el inicializador ``tf.keras.initializers.Zeros()``. Hemos visto en teoría que inicializar los pesos a 0 no es nada eficiente. Vamos a comprobarlo.

In [None]:
# Caso inicialización de pesos a 0:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'sigmoid'
loss = 'sparse_categorical_crossentropy'
nombre = 'zeros'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta",
                             kernel_initializer=tf.keras.initializers.Zeros())) # Capa densa inicializada a 0s
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.Zeros())) # Capa softmax inicializada a 0s

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

La inicialización de pesos con una Normal se hace con el inicializador ``tf.keras.initializers.RandomNormal(mean, stddev)``. Hemos visto en teoría que inicializar los pesos con valores cercanos a 0 es mucho más eficaz que inicializarlos con valores mayores. Vamos a probarlo.

In [None]:
# Caso Normal con std pequeña:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'sigmoid'
loss = 'sparse_categorical_crossentropy'
nombre = 'normal_close_to_0'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.05))) # Capa densa inicializada a Normal(0, 0.05)
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.05))) # Capa densa inicializada a Normal(0, 0.05)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso Normal con std grande:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'sigmoid'
loss = 'sparse_categorical_crossentropy'
nombre = 'normal_far_from_0'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1.0)
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1.0)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
%reload_ext tensorboard
%tensorboard --logdir $log_dir

In [None]:
# Caso Normal con std estimada:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'sigmoid'
loss = 'sparse_categorical_crossentropy'
nombre = 'normal_good_std'

std = 1 / np.sqrt(train_images.shape[1])
print(" > STD estimada:", std)

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=std))) # Capa densa inicializada a Normal(0, std)
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
%reload_ext tensorboard
%tensorboard --logdir $log_dir

Como vemos en los experimentos, inicializar los pesos de forma apropiada es fundamental para tener un buen entrenamiento. Vamos a comparar la Normal con la std calculada con otros inicializadores más complejos, como **HeNormal**, **GlorotNormal** o **GlorotUniform**.

HeNormal prácticamente es equivalente a la inicialización buscando la mejor std que hemos calculado previamente. Define la desviación estándar mediante la siguiente fórmula:

$$stddev = \sqrt{\frac{2}{N_{input}}}$$

Keras API: https://keras.io/api/layers/initializers/#henormal-class

In [None]:
# Caso HeNormal:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'sigmoid'
loss = 'sparse_categorical_crossentropy'
nombre = 'heNormal'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta",
                             kernel_initializer=tf.keras.initializers.HeNormal())) # Capa densa inicializada a HeNormal()
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

GlorotNormal o XavierNormal define la desviación estándar mediante la siguiente fórmula:

$$stddev = \sqrt{\frac{2}{N_{input} + N_{output}}}$$

Keras API: https://keras.io/api/layers/initializers/#glorotnormal-class

In [None]:
# Caso GlorotNormal:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'sigmoid'
loss = 'sparse_categorical_crossentropy'
nombre = 'glorotNormal'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

GlorotUniform o XavierUniform define la inicialización de pesos de manera uniforme en el rango de valores ```[-limit, limit]``` donde:

$$limit = \sqrt{\frac{6}{N_{input} + N_{output}}}$$

Keras API: https://keras.io/api/layers/initializers/#glorotuniform-class

In [None]:
# Caso GlorotUniform:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'sigmoid'
loss = 'sparse_categorical_crossentropy'
nombre = 'glorotUniform'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotUniform())) # Capa densa inicializada a GlorotUniform()
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
%reload_ext tensorboard
%tensorboard --logdir $log_dir

La selección de la inicialización de los pesos es crucial en algunos problemas. Además, es muy sensible a las distintas activaciones que puede tener la capa. Vamos a probar la inicialización con una activación RELU.

In [None]:
# Caso Base RELU:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'base-relu'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta"))
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida"))

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso RELU HeNormal:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'relu-heNormal'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta",
                             kernel_initializer=tf.keras.initializers.HeNormal())) # Capa densa inicializada a HeNormal()
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0,1)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso RELU GlorotNormal:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'relu-glorotNormal'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0,1)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso RELU GlorotUniform:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'relu-glorotUniform'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotUniform())) # Capa densa inicializada a GlorotUniform()
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0,1)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
%reload_ext tensorboard
%tensorboard --logdir $log_dir

**Preguntas**

* ¿Es tan importante la inicialización de la capa de salida a N(0, 1)?

* ¿Por qué los días anteriores estaba funcionando tan bien la inicialización de la capa Densa si no le decíamos nada?

Keras API: https://keras.io/api/layers/core_layers/dense/

### <font color="#CA3532">Batch normalization</font>

El objetivo de utilizar Batch Normalization simplifica la tarea de inicializar los pesos, ya que la red no es tan sensible a una mala inicialización. Vamos a probarlo con una mala inicialización **Normal(0,10)** para las funciones de activación sigmoid y RELU.

In [None]:
# Caso Normal(0,0.01):
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'sigmoid'
loss = 'sparse_categorical_crossentropy'
nombre = 'malaInicializacion'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.01))) # Capa densa inicializada a Normal(0,0.01)
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.01))) # Capa densa inicializada a Normal(0,0.01)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso Normal(0,0.01) con BATCH NORMALIZATION:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'sigmoid'
loss = 'sparse_categorical_crossentropy'
nombre = 'malaInicializacion-batchNormalization'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=None, name="oculta",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.01))) # Capa densa inicializada a Normal(0,0.01)
model.add(tf.keras.layers.BatchNormalization()) # Capa batch normalization
model.add(keras.layers.Activation(activation)) # Aplicamos la activación después de aplicar el batch normalization
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.01))) # Capa densa inicializada a Normal(0,0.01)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso RELU Normal(0,0.01):
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'relu-malaInicializacion'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=activation, name="oculta",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.01))) # Capa densa inicializada a Normal(0,0.01)
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.01))) # Capa densa inicializada a Normal(0,0.01)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso RELU Normal(0,0.01) con BATCH NORMALIZATION:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'relu-malaInicializacion-batchNormalization'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=None, name="oculta",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.01))) # Capa densa inicializada a Normal(0,0.01)
model.add(tf.keras.layers.BatchNormalization()) # Capa batch normalization
model.add(keras.layers.Activation(activation)) # Aplicamos la activación después de aplicar el batch normalization
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=0.01))) # Capa densa inicializada a Normal(0,0.01)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
%reload_ext tensorboard
%tensorboard --logdir $log_dir

Con BatchNormalization, los pesos no necesariamente tienen que estar bien ajustados. Si ponemos pesos cerca de 0 (*Normal(0, 0.01)*), que es una mala inicialización, BatchNormalization hace que el entrenamiento vaya mucho mejor.

### <font color="#CA3532">Optimizadores</font>

Hasta ahora, hemos estado trabajando siempre con el optimizador de descenso por gradiente estándar (SGD). Sin embargo, hemos visto en la parte de teoría que existen diferentes algoritmos de optimización. Vamos a probarlos con el modelo con activación RELU inicializado de la forma más óptima.

In [None]:
# Caso RELU GlorotNormal con BatchNormalization y SGD
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'sgd'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=None, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation(activation))
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate),
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso RELU GlorotNormal con BatchNormalization y MOMENTO
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'momento'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=None, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation(activation))
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate, momentum=0.8), ### Se añade el argumento momentum al SGD
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso RELU Normal con std estimada y MOMENTO Nesterov
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'nesterov'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=None, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation(activation))
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.SGD(learning_rate=learning_rate, momentum=0.8, nesterov=True), ### Se añade el argumento nesterov = True
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso RELU Normal con std estimada y ADAGRAD
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'adagrad'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=None, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation(activation))
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.Adagrad(learning_rate=learning_rate), ### Cambiamos SGD por ADAGRAD
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso RELU Normal con std estimada y RMSPROP
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'rmsprop'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=None, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation(activation))
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.RMSprop(learning_rate=learning_rate), ### Cambiamos SGD por RMSprop
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso RELU Normal con std estimada y ADAM
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'adam'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=None, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation(activation))
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), ### Cambiamos SGD por Adam
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
# Caso RELU Normal con std estimada y N-ADAM (Adam con Nesterov Momentum)
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'nadam'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=None, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation(activation))
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.Nadam(learning_rate=learning_rate), ### Cambiamos SGD por Nadam
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(test_images, test_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
%reload_ext tensorboard
%tensorboard --logdir $log_dir

# <font color="#CA3532">Selección óptima de hiperparámetros con un conjunto de validación</font>

Hemos visto que hay muchos hiperparámetros a estimar, por lo que no podemos estar buscando manualmente cuáles son los más apropiados. Por un lado, necesitamos **evitar este proceso manual** y, por otro lado, necesitamos un **conjunto de validación** adicional (disjunto del conjunto de entrenamiento) para poder seleccionarlos.

Respecto al conjunto de validación, tenemos varias alternativas:

- Validación simple

- Validación cruzada

Respecto a la búsqueda de hiperparámetros, tenemos varias alternativas:

- Búsqueda a fuerza bruta (GridSearch)

- Búsqueda automática (Keras-tuner)

### <font color="#CA3532">Validación simple</font>

Una validación simple consiste en realizar la búsqueda de los hiperparámetros sobre este conjunto, es decir, seleccionar el modelo que obtiene el mejor resultado (mejor accuracy, por ejemplo) sobre el conjunto de validación. Vamos a generar un conjunto de validación de ejemplo:

In [None]:
## Tenemos un dataset y lo dividimos en training y test

(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()

## NORMALIZACION
train_images = train_images / 255
test_images = test_images / 255
mean_img = train_images.mean(axis=0)
train_images = train_images - mean_img
test_images = test_images - mean_img

print(train_images.shape)
print(train_labels.shape)
print(train_labels)
print()
print(test_images.shape)
print(test_labels.shape)
print(test_labels)

In [None]:
## Para generar un conjunto de validación, necesitamos dividir TRAINING en dos 
## subconjuntos. Podemos utilizar la función train_test_split

from sklearn.model_selection import train_test_split

train_images, validation_images, train_labels, validation_labels = train_test_split(train_images, train_labels, test_size=0.2)

print(train_images.shape)
print(train_labels.shape)
print(train_labels)
print()
print(validation_images.shape)
print(validation_labels.shape)
print(validation_labels)
print()
print(test_images.shape)
print(test_labels.shape)
print(test_labels)

In [None]:
# Caso RELU GlorotNormal con BatchNormalization y ADAM (Adam)
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'adam-validation'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=None, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation(activation))
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), ### Adam
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo:
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs, 
                    validation_data=(validation_images, validation_labels),
                    batch_size=batch_size,
                    callbacks=callbacks)

En la celda previa, deberíamos buscar qué conjunto de hiperparámetros utilizar para conseguir el mayor accuracy en validación. Asumimos que ya los tenemos, por lo que es necesario reentrenar un modelo nuevo con todos los datos de training para evaluar en test:

In [None]:
# Concatenamos datos de train y validacion
final_train_images = np.concatenate((train_images, validation_images), axis=0)
final_train_labels = np.concatenate((train_labels, validation_labels), axis=0)

In [None]:
# Caso RELU GlorotNormal con BatchNormalization y ADAM (Adam)
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'adam-test'

# Volvemos a crear el modelo para que se empiece a entrenar desde 0:
model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=None, name="oculta",
                             kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation(activation))
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                             kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), ### Adam
              loss=loss,
              metrics=['acc'])

# Callback a TensorBoard:
callbacks = [keras.callbacks.TensorBoard(log_dir=log_dir+"prueba-"+nombre, histogram_freq=1, write_images=True)]

# Entrenamiento del modelo con los datos concatenados. OJO: NO HAY VALIDATION_DATA
history = model.fit(final_train_images, 
                    final_train_labels, 
                    epochs=n_epochs, 
                    batch_size=batch_size,
                    callbacks=callbacks)

In [None]:
print(" > Training:", model.evaluate(final_train_images, final_train_labels))
print(" > Test:", model.evaluate(test_images, test_labels))

### <font color="#CA3532">Validación cruzada</font>

Una validación cruzada consiste en realizar la búsqueda de los hiperparámetros sobre un KFold, es decir, realizar K particiones del conjunto de entrenamiento (disjuntas) y calcular el promedio de accuracies utilizando K-1 particiones para training y 1 para validación. Así con todas las combinaciones posibles.

In [None]:
from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=5, shuffle=True)
# kfold.split(...) devuelve un generador que genera los índices de entrenamiento
# y validación para cada una de las particiones

In [None]:
## Tenemos un dataset y lo dividimos en training y test

(train_images, train_labels), (test_images, test_labels) = tf.keras.datasets.mnist.load_data()

## NORMALIZACION
train_images = train_images / 255
test_images = test_images / 255
mean_img = train_images.mean(axis=0)
train_images = train_images - mean_img
test_images = test_images - mean_img

In [None]:
batch_size = BATCH_SIZE_BASE
learning_rate = LEARNING_RATE_BASE
activation = 'relu'
loss = 'sparse_categorical_crossentropy'
nombre = 'adam-crossvalidation'

cvscores = []
for itrain, ival in kfold.split(train_images, train_labels):

    # Volvemos a crear el modelo para que se empiece a entrenar desde 0:
    model = keras.Sequential()
    model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
    model.add(keras.layers.Dense(64, activation=None, name="oculta",
                                kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
    model.add(keras.layers.BatchNormalization())
    model.add(keras.layers.Activation(activation))
    model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                                kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

    model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), ### Cambiamos SGD por Adam
                  loss=loss,
                  metrics=['acc'])
        
    # Lo entrenamos:
    history = model.fit(train_images[itrain], 
                        train_labels[itrain], 
                        epochs=n_epochs,
                        verbose=0,
                        batch_size=batch_size)
    
    # Lo evaluamos:
    _, acc = model.evaluate(train_images[ival], train_labels[ival], verbose=0)
    
    print("Accuracy: %.2f%%" % (acc*100.0))
    cvscores.append(acc*100.0)
print("%.2f%% \u00B1 %.2f%%" % (np.mean(cvscores), np.std(cvscores)))

In [None]:
## Una vez has decidido los hiperparámetros que hacen máximo el valor del accuracy 
## del cross-val, hay que entrenar un nuevo modelo con esos hiperparámetros, esta
## vez utilizando el 100% de los datos para entrenar.
##
## En esta situación no es necesario concatenar train y validación, ya que kfold.split
## nos devolvía los índices del split, no un subconjunto de datos

model = keras.Sequential()
model.add(keras.layers.Flatten(input_shape=input_shape, name="entrada"))
model.add(keras.layers.Dense(64, activation=None, name="oculta",
                            kernel_initializer=tf.keras.initializers.GlorotNormal())) # Capa densa inicializada a GlorotNormal()
model.add(keras.layers.BatchNormalization())
model.add(keras.layers.Activation(activation))
model.add(keras.layers.Dense(num_clases, activation="softmax", name="salida",
                            kernel_initializer=tf.keras.initializers.RandomNormal(stddev=1.0))) # Capa densa inicializada a Normal(0, 1)

model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), ### Cambiamos SGD por Adam
              loss=loss,
              metrics=['acc'])
    
# Lo entrenamos con todos los datos de train. OJO: NO HAY VALIDATION_DATA
history = model.fit(train_images, 
                    train_labels, 
                    epochs=n_epochs,
                    batch_size=batch_size)

In [None]:
print(" > Training:", model.evaluate(train_images, train_labels))
print(" > Test:", model.evaluate(test_images, test_labels))

En esta sección hemos visto como hacer el split de validación simple y validación cruzada. Sin embargo, no hemos realizado una búsqueda de hiperparámetros. Vamos a verlo con el problema de identificación de cáncer de mama, que es un dataset más pequeño.

# <font color="#CA3532">Resolviendo Breast Cancer con Keras</font>

In [None]:
from sklearn.datasets import load_breast_cancer

data = load_breast_cancer()

x = data.data
t = data.target[:, None]

print(x.shape)
print(t.shape)

In [None]:
from sklearn.model_selection import train_test_split

x_train, x_test, t_train, t_test = train_test_split(x, t, test_size=0.33, random_state=42)

print(x_train.shape)
print(t_train.shape)
print(x_test.shape)
print(t_test.shape)

## <font color="#CA3532">GridSearch</font>

GridSearch consiste en realizar una búsqueda por fuerza bruta probando todos los posibles valores en el rango especificado. Ahora vamos a programar a mano la búsqueda en gridSearch con el siguiente objetivo: realizar una validación cruzada utilizando una métrica de evaluación distinta: **F1-score**.

In [None]:
# Definimos el Kfold que vamos a utilizar en validación cruzada

from sklearn.model_selection import StratifiedKFold

kfold = StratifiedKFold(n_splits=5, shuffle=True)

In [None]:
# Definimos la lista de hiperparametros que queremos buscar

lista_numUnits = [32, 64]
lista_learningRate = [0.01, 0.03, 0.1, 0.3]
lista_batchSize = [500, 5000]
lista_activations = ['tanh', 'relu', 'selu']
lista_regularizationL2 = [0.0, 0.0001, 0.001]
lista_dropout = [0.0, 0.2, 0.4]
lista_inicializacion = ['Normal', 'GlorotNormal']
lista_optimizadores = ['SGD', 'Adam', 'RMSprop']
lista_losses = ['categorical_crossentropy', 'categorical_hinge']

In [None]:
## Transformamos los labels a categoricos para utilizar categorical_crossentropy y categorical_hinge

from tensorflow.keras.utils import to_categorical

t_train_categorical = to_categorical(t_train)
t_test_categorical = to_categorical(t_test)

In [None]:
# Funcion de construcción del modelo que modifica algunas variables según sus valores iniciales
#   initializer Normal hace el cálculo de la stddev óptima
#   los optimizers deben crearse con el valor de learningRate
#   si loss == categorical_hinge entonces la función de activación de la ultima capa debe ser None (lineal)

def build_model(learningRate, activation, l2reg, dropout, initializer, optimizer, numUnits, loss, seed=1, metrics=['acc']):
    if initializer == 'Normal':
      stddev = 1 / np.sqrt(train_images.shape[1])
      initializer = tf.keras.initializers.RandomNormal(stddev=stddev, seed=seed)
    else: # GlorotNormal
      initializer = tf.keras.initializers.GlorotNormal(seed=seed)

    if optimizer == 'SGD':
      optimizer = tf.keras.optimizers.SGD(learning_rate=learningRate)
    elif optimizer == 'Adam':
      optimizer = tf.keras.optimizers.Adam(learning_rate=learningRate)
    else: # 'RMSprop'
      optimizer = tf.keras.optimizers.RMSprop(learning_rate=learningRate)

    if loss == 'categorical_crossentropy':
      output_activation = 'softmax'
    else: # 'categorical_hinge'
      output_activation = None
    
    model = keras.Sequential()
    model.add(keras.layers.Input(shape=(30)))
    model.add(keras.layers.Dense(numUnits, activation=activation, name="oculta",
                                 kernel_initializer=initializer,
                                 kernel_regularizer=tf.keras.regularizers.l2(l2reg)))
    model.add(keras.layers.Dropout(dropout))
    model.add(keras.layers.Dense(2, activation=output_activation, name="salida", # 2 neuronas de salida porque es categorical [0, 1] o [1, 0]
                                 kernel_initializer=initializer,
                                 kernel_regularizer=tf.keras.regularizers.l2(l2reg)))

    model.compile(optimizer=optimizer, 
                  loss=loss,
                  metrics=metrics)
    
    return model

**Itertools** es una librería de python que te calcula fácilmente el producto cartesiano de un conjunto de listas utilizando la función ``itertools.product()``.

In [None]:
import itertools

In [None]:
%%time

# Como lo que se construye es un iterador, es necesario volver a ejecutarlo cada vez
# que se vaya a utilizar
combinations = itertools.product(lista_learningRate, lista_batchSize, lista_activations, 
                                 lista_regularizationL2, lista_dropout, lista_inicializacion,
                                 lista_optimizadores, lista_numUnits, lista_losses)

n_epochs = 20

max_score = 0.0
for learningRate, batchSize, activation, l2reg, dropout, initializer, optimizer, numUnits, loss in combinations:

  # Código para validación cruzada utilizando kfold
  cvscores = []
  for itrain, ival in kfold.split(x_train, t_train):
    model = build_model(learningRate, activation, l2reg, dropout, initializer, optimizer, numUnits, loss)
    model.fit(x_train[itrain], 
              t_train_categorical[itrain], 
              epochs=n_epochs,
              batch_size=batchSize,
              verbose=0)
    # Lo evaluamos:
    _, acc = model.evaluate(x_train[ival], t_train_categorical[ival], verbose=0)
    cvscores.append(acc*100.0)
  
  if np.mean(cvscores) > max_score:
    best_config = (learningRate, batchSize, activation, l2reg, dropout, initializer, optimizer, numUnits, loss)
    max_score = np.mean(cvscores)
    print(" > NEW Best config:", best_config)

print(" > Best validation config:", best_config)


In [None]:
# Esta ejecución tarda muchísimo. El producto cartesiano son más de 5000 configuraciones diferentes
# O bien tenemos mucho tiempo para dejar las máquinas ejecutando (en paralelo si es posible) o
# tardará quizás días en realizar esta búsqueda por gridsearch

**Pregunta**: ¿Qué podríamos hacer para optimizar la búsqueda de hiperparámetros?

In [None]:
# Aquí entraría en juego nuestra "intuición". No es necesario probar todas las combinaciones
# posibles si previamente hemos jugado con el problema y hemos visto que ciertas configuraciones
# funcionan mejor.

### <font color="#CA3532">Ejercicio</font>

Vamos a hacer una búsqueda paramétrica de solamente tres variables para calcular el modelo que mejor **F1-score** obtiene.

In [None]:
# Definimos la lista de hiperparametros que queremos buscar

lista_learningRate = [0.001, 0.01, 0.1]
lista_activations = ['sigmoid', 'relu']
lista_dropout = [0.0, 0.2]

In [None]:
combinations = itertools.product(lista_learningRate, lista_activations, lista_dropout)

n_epochs = 20
batch_size = 50

counter = 0
max_score = 0.0
for learningRate, activation, dropout in combinations:
  config = (learningRate, activation, dropout)
  print(" > Probando config:", config)

  # Código para validación cruzada utilizando kfold
  cvscores = []
  for itrain, ival in kfold.split(x_train, t_train):
    model = build_model(learningRate, activation, 0.0, dropout, 'Normal', 'Adam', 20, 'categorical_crossentropy', metrics=['acc', 'Precision', 'Recall'])
    model.fit(x_train[itrain], 
              t_train_categorical[itrain], 
              epochs=n_epochs,
              batch_size=batch_size,
              verbose=0)
    # Lo evaluamos:
    _, acc, prec, recall = model.evaluate(x_train[ival], t_train_categorical[ival], verbose=0)
    f1_score = 2 * prec * recall / (prec + recall + 1e-8) # Añado un épsilon para evitar división entre 0
    cvscores.append(f1_score)

  print("   > Score:", np.mean(cvscores))
  
  if np.mean(cvscores) > max_score:
    best_config = config
    max_score = np.mean(cvscores)
    print("   >>> NEW Best config (", max_score, "):", best_config)

print("\n > Best validation config (", max_score, "):", best_config)

Ya tenemos la mejor configuración. Ahora entrenamos el modelo con los datos de train completos y evaluamos en test:

In [None]:
# Fijo los hiperparámetros que he encontrado
learningRate = 0.01
activation = 'relu'
dropout = 0.2

model = build_model(learningRate, activation, 0.0, dropout, 'Normal', 'Adam', 20, 'categorical_crossentropy', metrics=['acc', 'Precision', 'Recall'])
model.fit(x_train, 
          t_train_categorical, 
          epochs=n_epochs,
          batch_size=batch_size,
          verbose=0)

# Lo evaluamos:
_, acc, prec, recall = model.evaluate(x_test, t_test_categorical, verbose=0)
f1_score = 2 * prec * recall / (prec + recall + 1e-8) # Añado un épsilon para evitar división entre 0
print("TEST F1 SCORE:", f1_score)

## <font color="#CA3532">Keras Tuner</font>

Keras Tuner es una librería que simplifica en gran medida el ajuste de los hiperparámetros de una red neuronal. 

**Keras tuner**: https://keras-team.github.io/keras-tuner/

In [None]:
!pip install -q -U keras-tuner
import keras_tuner as kt
from tensorflow.keras.utils import to_categorical

Se define un hipermodelo, una función que genera un modelo en Keras que depende de unos hiperparámetros ``hp`` con los que vamos a jugar.

In [None]:
# hp.Choice elige entre los valores dados
# hp.Int y hp.Float eligen entre un mínimo y un máximo

def modelo(l2reg, num_units, activation, optimizer, seed=1):
  model = keras.Sequential()
  model.add(keras.layers.Input(shape=(30)))
  model.add(keras.layers.Dense(units = num_units, activation=activation,
                               kernel_regularizer=keras.regularizers.l2(l2reg),
                               kernel_initializer=keras.initializers.GlorotNormal(seed=seed)))
  model.add(keras.layers.Dense(2, activation="softmax",
                               kernel_initializer=keras.initializers.RandomNormal(0, 1, seed=seed)))

  model.compile(optimizer=optimizer, loss='categorical_crossentropy', metrics=['acc'])
  
  return  model

def hipermodelo(hp):
  hp_l2reg = hp.Choice('l2reg', values = [1.0, 0.1, 0.01, 0.001, 0.0001]) 
  hp_num_units = hp.Choice('num_units', values = [5, 10, 15, 20]) 
  hp_act = hp.Choice('activation', values = ['sigmoid', 'relu']) 
  hp_opt = hp.Choice('optimizer', values = ['adam', 'sgd'])
  
  return modelo(hp_l2reg, hp_num_units, hp_act, hp_opt)

Hay diferentes algoritmos de búsqueda de hiperparámetros:

https://keras.io/api/keras_tuner/tuners/

Nosotros vamos a utilizar el algoritmo Hyperband: 

Li, Lisha, and Kevin Jamieson. "Hyperband: A Novel Bandit-Based Approach to Hyperparameter Optimization." Journal of Machine Learning Research 18 (2018): 1-52.

https://keras.io/api/keras_tuner/tuners/hyperband/

In [None]:
tuner = kt.Hyperband(hipermodelo,
                     objective = 'val_acc', # Métrica a optimizar
                     max_epochs = 50, # Número de épocas máximo a entrenar cada modelo. O bien pones un valor alto con early stopping o bien pones un valor bajo
                     factor = 3, # Cuantos modelos elimino de la búsqueda en cada bracket
                     directory = 'my_dir',
                     project_name = 'cancer')

In [None]:
tuner.search_space_summary()

In [None]:
# Preparamos los datos con validación simple:

data = load_breast_cancer()

x = data.data
t = data.target[:, None]

x_train, x_test, t_train, t_test = train_test_split(x, t, test_size=0.2, random_state=42)
x_train, x_val, t_train, t_val = train_test_split(x_train, t_train, test_size=0.2, random_state=30)

t_train_categorical = to_categorical(t_train)
t_val_categorical = to_categorical(t_val)
t_test_categorical = to_categorical(t_test)

print(x_train.shape)
print(t_train_categorical.shape)
print(x_val.shape)
print(t_val_categorical.shape)
print(x_test.shape)
print(t_test_categorical.shape)

In [None]:
# En el EarlyStopping definimos la métrica a valorar, que coincide con el objetivo del tuner
# Patience es el número de épocas que entrena sin mejora antes de parar
callbacks = [tf.keras.callbacks.EarlyStopping('val_acc', patience=5)]
tuner.search(x_train, t_train_categorical,
             validation_data=(x_val, t_val_categorical),
             callbacks=callbacks)

In [None]:
best_hps = tuner.get_best_hyperparameters()[0]
best_hps.values

Una vez tenemos los mejores hiperparámetros, los fijamos y entrenamos una única vez con todos los datos de training

In [None]:
# Concatenamos datos de train y validacion
final_x_train = np.concatenate((x_train, x_val), axis=0)
final_t_train_categorical = np.concatenate((t_train_categorical, t_val_categorical), axis=0)

In [None]:
l2reg = best_hps['l2reg']
num_units = best_hps['num_units']
activation = best_hps['activation']
optimizer = best_hps['optimizer']
epochs = 50
model = modelo(l2reg, num_units, activation, optimizer)
best_test_acc = 0.0
for epoch in range(epochs):
  history = model.fit(final_x_train, final_t_train_categorical, validation_data=(x_test, t_test_categorical))
  if history.history['val_acc'][0] > best_test_acc:
    best_test_acc = history.history['val_acc'][0]
    model.save_weights('best')
model.load_weights('best')

In [None]:
model.evaluate(x_test, t_test_categorical)

Por último, vamos a darle un poco más de capacidad de búsqueda al algoritmo. Ya no queremos fijar los valores, sino que le dejamos que busque en un rango:

In [None]:
def modelo_v2(l2reg, num_units, dropout, activation, learning_rate, num_capas_ocultas, seed=1):
  model = keras.Sequential()
  model.add(keras.layers.Input(shape=(30)))
  for _ in range(num_capas_ocultas):
    model.add(keras.layers.Dense(units = num_units, activation = activation, 
                                 kernel_regularizer=keras.regularizers.l2(l2reg),
                                 kernel_initializer=keras.initializers.GlorotNormal(seed=seed)))
    model.add(keras.layers.Dropout(rate = dropout))
  model.add(keras.layers.Dense(2, activation="softmax",
                               kernel_initializer=keras.initializers.RandomNormal(0, 1, seed=seed)))

  model.compile(optimizer=keras.optimizers.Adam(learning_rate=learning_rate), 
                loss='categorical_crossentropy',
                metrics=['acc', 'Precision', 'Recall'])
  
  return model

In [None]:
def hipermodelo_v2(hp):
  hp_l2reg = hp.Float('l2reg', min_value=1e-5, max_value=0.1, sampling='log') # Scale log permite hacer una búsqueda logaritmica
  hp_num_units = hp.Int('num_units', min_value=5, max_value=50)
  hp_dropout = hp.Float('dropout', min_value=0.0, max_value=0.5) # Dropout tiene una busqueda lineal
  hp_act = hp.Choice('activation', values = ['sigmoid', 'relu']) 
  hp_learning_rate = hp.Float('learning_rate', min_value=1e-4, max_value=1.0, sampling="log") # Igual que L2reg
  hp_num_capas_ocultas = hp.Int('num_layers', min_value=1, max_value=5)
  
  return modelo_v2(hp_l2reg, hp_num_units, hp_dropout, hp_act, hp_learning_rate, hp_num_capas_ocultas)

In [None]:
tuner = kt.Hyperband(hipermodelo_v2,
                     objective = 'val_acc', 
                     max_epochs = 50,
                     factor = 3,
                     directory = 'my_dir',
                     project_name = 'cancer_v2')

In [None]:
tuner.search_space_summary()

In [None]:
# En el EarlyStopping definimos la métrica a valorar, que coincide con el objetivo del tuner
# Patience es el número de épocas que entrena sin mejora antes de parar
callbacks = [tf.keras.callbacks.EarlyStopping('val_acc', patience=5)]
tuner.search(x_train, t_train_categorical,
             validation_data=(x_val, t_val_categorical),
             callbacks=callbacks)

In [None]:
best_hps = tuner.get_best_hyperparameters()[0]
best_hps.values

Una vez tenemos los mejores hiperparámetros, los fijamos y entrenamos una única vez con todos los datos de training

In [None]:
# Concatenamos datos de train y validacion
final_x_train = np.concatenate((x_train, x_val), axis=0)
final_t_train_categorical = np.concatenate((t_train_categorical, t_val_categorical), axis=0)

In [None]:
l2reg = best_hps['l2reg']
num_units = best_hps['num_units']
dropout = best_hps['dropout']
activation = best_hps['activation']
learning_rate = best_hps['learning_rate']
num_capas_ocultas = best_hps['num_layers']
model = modelo_v2(l2reg, num_units, dropout, activation, learning_rate, num_capas_ocultas)
epochs = 50

best_test_acc = 0.0
for epoch in range(epochs):
  history = model.fit(final_x_train, final_t_train_categorical, validation_data=(x_test, t_test_categorical))
  if history.history['val_acc'][0] > best_test_acc:
    best_test_acc = history.history['val_acc'][0]
    model.save_weights('best')
model.load_weights('best')

In [None]:
model.evaluate(x_test, t_test_categorical)