# Autoencoders

Un autoencoder son dos redes neuronales, una cuyo objetivo es reducir la representación de las instancias a un espacio dimensional más bajo. Por ejemplo, en el ejemplo de la MNIST existen 784 caractéristicas y se podría desear reducirlas a un número menor. Esto se llama reducción de dimensionalidad y tiene diversas aplicaciones, como entrenar modelos que no se comportan bien cuando hay muchas caractéristicas, eliminar caractéristicas redundantes, o reducir el nivel de ruido. Un ejemplo de utilización de autoencoders puede ser para comprimir imagenes, de hecho se ha probado que son competitivos cuando se comparan con estandares de la industria como JPEG2000 [1]. La arquitectura de un autoencoder es:

<img src="https://upload.wikimedia.org/wikipedia/commons/2/28/Autoencoder_structure.png"/>

> [Autoencoder](https://en.wikipedia.org/wiki/Autoencoder)



[1] Theis, L., Shi, W., Cunningham, A., & Huszár, F. (2017). Lossy image compression with compressive autoencoders. arXiv preprint arXiv:1703.00395.


## Ejemplo de autoencoder

En este ejemplo, se proyectan las imágeenes  eel conjunto de datos NMIST a un espacio 2D, que permite graficar las intancias en un plano, para posteriormente reconstruir las imágenes.


In [None]:
%matplotlib inline
import tensorflow.keras
from tensorflow.keras.layers import Activation, Dense, Input
from tensorflow.keras.layers import Conv2D, Flatten
from tensorflow.keras.layers import Reshape, Conv2DTranspose
from tensorflow.keras.models import Model
from tensorflow.keras import backend as K
from tensorflow.keras.datasets import mnist
import numpy as np
import matplotlib.pyplot as plt
from PIL import Image

# MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()

image_size = x_train.shape[1]
x_train = np.reshape(x_train, [-1, image_size * image_size])
x_test = np.reshape(x_test, [-1, image_size * image_size])
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

In [None]:
i = Input((image_size * image_size,))
d = Dense(100, activation='relu')(i)
d = Dense(2, activation='linear')(d)
encoder = Model(i, d, name='Encoder')
encoder.summary()

d_i = Input((2,))
d_d = Dense(100, activation='relu')(d_i)
d_d = Dense(image_size * image_size, activation='sigmoid')(d_d)
decoder = Model(d_i, d_d, name='Decoders')
decoder.summary()

autoencoder = Model(i, decoder(encoder(i)), name='Autoencoder')
autoencoder.compile(loss='mse', optimizer='nadam')

autoencoder.fit(x_train,
                x_train,
                validation_data=(x_test, x_test),
                epochs=30,
                batch_size=128, verbose=2)


In [None]:
img = np.empty((image_size*10, image_size*10))
img_pred = np.empty((image_size*10, image_size*10))
x_test_pred = autoencoder.predict(x_test)
for i in range(10):
    for j in range(10):
        img[i*image_size:(i+1)*image_size, j*image_size:(j+1)*image_size] = np.reshape(x_test[i*10+j, :], (image_size, image_size))
        img_pred[i*image_size:(i+1)*image_size, j*image_size:(j+1)*image_size] = np.reshape(x_test_pred[i*10+j, :], (image_size, image_size))
plt.rcParams['figure.figsize'] = [10, 10]
plt.imshow(img, cmap='gray')
plt.show()
plt.imshow(img_pred, cmap='gray')
plt.show()

In [None]:
emb = encoder.predict(x_test)
plt.rcParams['figure.figsize'] = [10, 10]
plt.scatter(emb[:, 0], emb[:, 1], c=y_test)
plt.colorbar()
plt.show()

## Denoiser autoencoder

El siguiente ejemplo, basado en los [ejemplos de Keras](https://github.com/keras-team/keras/blob/master/examples/mnist_denoising_autoencoder.py), utilizaremos un autoencoder para sacar ruido del MNIST. En el caso del ejemplo, se agregará ruido artificialmente. En particular a cada pixel se le agregará un ruido de media 0.5 y desviación estandard de 0.5. Notese que los pixeles están normalizados a valores entre 0 y 1, por lo que el ruido es significativo.

El encoder tiene las siguiente arquitectura:

1. Entrada de 28 x 28 x 1
1.  Convolucional de 32 filtros y kernel de 3x3
1.  Convolucional de 64 filtros y kernel de 3x3
1. Capa de aplanado. Cada imagen resulta en vectores de 3136 elementos
1. Densa con 16 neuronas


Es decir, al final del encoder cada imagen queda representada por un vector de 16 caractéristicas en lugar de 784 pixeles.

El decoder, quien es el encargado de regenerar la imagen tiene la siguiente arquitectura:

1. Entrada de 16
1. Una capa densa con 3136 salidas
1. Deconvolución de 64 filtros
1. Deconvolución de 32 filtros
1. Deconvolución de 1 filtro. Reconstruendo la imagen original.


Las deconvoluciones son operaciones que permiten reconstruir imagenes a las que se le aplicaron filtros convolucionales. Ver: [Deconvolutional Networks](https://www.matthewzeiler.com/mattzeiler/deconvolutionalnetworks.pdf).





In [None]:
(x_train, _), (x_test, _) = mnist.load_data()

image_size = x_train.shape[1]
x_train = np.reshape(x_train, [-1, image_size, image_size, 1])
x_test = np.reshape(x_test, [-1, image_size, image_size, 1])
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

# Generate corrupted MNIST images by adding noise with normal dist
# centered at 0.5 and std=0.5
noise = np.random.normal(loc=0.5, scale=0.5, size=x_train.shape)
x_train_noisy = x_train + noise
noise = np.random.normal(loc=0.5, scale=0.5, size=x_test.shape)
x_test_noisy = x_test + noise

x_train_noisy = np.clip(x_train_noisy, 0., 1.)
x_test_noisy = np.clip(x_test_noisy, 0., 1.)

# Network parameters
input_shape = (image_size, image_size, 1)
batch_size = 128
kernel_size = 3
latent_dim = 16
# Encoder/Decoder number of CNN layers and filters per layer
layer_filters = [32, 64]

# Build the Autoencoder Model
# First build the Encoder Model
inputs = Input(shape=input_shape, name='encoder_input')
x = inputs
# Stack of Conv2D blocks
# Notes:
# 1) Use Batch Normalization before ReLU on deep networks
# 2) Use MaxPooling2D as alternative to strides>1
# - faster but not as good as strides>1
for filters in layer_filters:
    x = Conv2D(filters=filters,
               kernel_size=kernel_size,
               strides=2,
               activation='relu',
               padding='same')(x)

# Shape info needed to build Decoder Model
shape = K.int_shape(x)

# Generate the latent vector
x = Flatten()(x)
latent = Dense(latent_dim, name='latent_vector')(x)

# Instantiate Encoder Model
encoder = Model(inputs, latent, name='encoder')
print('Encoder')
encoder.summary()

# Build the Decoder Model
latent_inputs = Input(shape=(latent_dim,), name='decoder_input')
x = Dense(shape[1] * shape[2] * shape[3])(latent_inputs)
x = Reshape((shape[1], shape[2], shape[3]))(x)

# Stack of Transposed Conv2D blocks
# Notes:
# 1) Use Batch Normalization before ReLU on deep networks
# 2) Use UpSampling2D as alternative to strides>1
# - faster but not as good as strides>1
for filters in layer_filters[::-1]:
    x = Conv2DTranspose(filters=filters,
                        kernel_size=kernel_size,
                        strides=2,
                        activation='relu',
                        padding='same')(x)

x = Conv2DTranspose(filters=1,
                    kernel_size=kernel_size,
                    padding='same')(x)

outputs = Activation('sigmoid', name='decoder_output')(x)

# Instantiate Decoder Model
decoder = Model(latent_inputs, outputs, name='decoder')
print('Decoder')
decoder.summary()

# Autoencoder = Encoder + Decoder
# Instantiate Autoencoder Model
print('Encoder-decoder apliado para entrenamiento')
autoencoder = Model(inputs, decoder(encoder(inputs)), name='autoencoder')
autoencoder.summary()

autoencoder.compile(loss='mse', optimizer='adam')

# Train the autoencoder
autoencoder.fit(x_train_noisy,
                x_train,
                validation_data=(x_test_noisy, x_test),
                epochs=10,
                batch_size=batch_size)

# Predict the Autoencoder output from corrupted test images
x_decoded = autoencoder.predict(x_test_noisy)

# Display the 1st 8 corrupted and denoised images
rows, cols = 10, 30
num = rows * cols
imgs = np.concatenate([x_test[:num], x_test_noisy[:num], x_decoded[:num]])
imgs = imgs.reshape((rows * 3, cols, image_size, image_size))
imgs = np.vstack(np.split(imgs, rows, axis=1))
imgs = imgs.reshape((rows * 3, -1, image_size, image_size))
imgs = np.vstack([np.hstack(i) for i in imgs])
imgs = (imgs * 255).astype(np.uint8)
plt.rcParams['figure.figsize'] = [25, 25]
plt.figure()
plt.axis('off')
plt.title('Original images: top rows, '
          'Corrupted Input: middle rows, '
          'Denoised Input:  third rows')
plt.imshow(imgs, interpolation='none', cmap='gray')
plt.show()

## Variational Autoencoder
Los Variational Autoencoders intentan aprender una representación estádistica de las dimensiones latentes. El encoder retorna la distribución de las dimensiones latentes, retornando su media y desviación estándar. Por su parte, el decoder utiliza un muestreo sobre esta distribución para generar las imágenes.

$autoencoder(x)=P(z|x)$

$decoder(z)=p(x|z)$

La función de perdida utilizada para este tipo de autoencoders es la **Divergencia de Kullback-Leibler**, también conocida como divergencia de información. Es una función de perdida no simetrica que evalúa cuan similares son dos distribuciones de probabilidad.

$KL(p||q)=\sum{p_i \ln{\frac{p_i}{q_i}}}$

In [None]:
%matplotlib inline
from tensorflow.keras.layers import Lambda, Input, Dense
from tensorflow.keras.models import Model
from tensorflow.keras.datasets import mnist
from tensorflow.keras.losses import mse, binary_crossentropy
from tensorflow.keras.utils import plot_model
from tensorflow.keras import backend as K

import numpy as np
import matplotlib.pyplot as plt
import argparse
import os


# reparameterization trick
# instead of sampling from Q(z|X), sample epsilon = N(0,I)
# z = z_mean + sqrt(var) * epsilon
def sampling(args):
    """Reparameterization trick by sampling from an isotropic unit Gaussian.
    # Arguments
        args (tensor): mean and log of variance of Q(z|X)
    # Returns
        z (tensor): sampled latent vector
    """

    z_mean, z_log_var = args
    batch = K.shape(z_mean)[0]
    dim = K.int_shape(z_mean)[1]
    # by default, random_normal has mean = 0 and std = 1.0
    epsilon = K.random_normal(shape=(batch, dim))
    return z_mean + K.exp(0.5 * z_log_var) * epsilon


def plot_results(models,
                 data,
                 batch_size=128):
    """Plots labels and MNIST digits as a function of the 2D latent vector
    # Arguments
        models (tuple): encoder and decoder models
        data (tuple): test data and label
        batch_size (int): prediction batch size
        model_name (string): which model is using this function
    """

    encoder, decoder = models
    x_test, y_test = data
    # display a 2D plot of the digit classes in the latent space
    z_mean, z_log_var, _ = encoder.predict(x_test,
                                   batch_size=batch_size)
    plt.figure(figsize=(12, 10))
    plt.scatter(z_mean[:, 0], z_mean[:, 1], c=y_test)
    plt.colorbar()
    plt.xlabel("z[0]")
    plt.ylabel("z[1]")
    plt.show()

    plt.figure(figsize=(12, 10))
    plt.scatter(z_log_var[:, 0], z_log_var[:, 1], c=y_test)
    plt.colorbar()
    plt.xlabel("z[0]")
    plt.ylabel("z[1]")
    plt.show()

    # display a 30x30 2D manifold of digits
    n = 30
    digit_size = 28
    figure = np.zeros((digit_size * n, digit_size * n))
    # linearly spaced coordinates corresponding to the 2D plot
    # of digit classes in the latent space
    grid_x = np.linspace(-4, 4, n)
    grid_y = np.linspace(-4, 4, n)[::-1]

    for i, yi in enumerate(grid_y):
        for j, xi in enumerate(grid_x):
            z_sample = np.array([[xi, yi]])
            x_decoded = decoder.predict(z_sample)
            digit = x_decoded[0].reshape(digit_size, digit_size)
            figure[i * digit_size: (i + 1) * digit_size,
                   j * digit_size: (j + 1) * digit_size] = digit

    plt.figure(figsize=(10, 10))
    start_range = digit_size // 2
    end_range = (n - 1) * digit_size + start_range + 1
    pixel_range = np.arange(start_range, end_range, digit_size)
    sample_range_x = np.round(grid_x, 1)
    sample_range_y = np.round(grid_y, 1)
    plt.xticks(pixel_range, sample_range_x)
    plt.yticks(pixel_range, sample_range_y)
    plt.xlabel("z[0]")
    plt.ylabel("z[1]")
    plt.imshow(figure, cmap='Greys_r')
    plt.show()


# MNIST dataset
(x_train, y_train), (x_test, y_test) = mnist.load_data()

image_size = x_train.shape[1]
original_dim = image_size * image_size
x_train = np.reshape(x_train, [-1, original_dim])
x_test = np.reshape(x_test, [-1, original_dim])
x_train = x_train.astype('float32') / 255
x_test = x_test.astype('float32') / 255

# network parameters
input_shape = (original_dim, )
intermediate_dim = 512
batch_size = 128
latent_dim = 2
epochs = 50

# VAE model = encoder + decoder
# build encoder model
inputs = Input(shape=input_shape, name='encoder_input')
x = Dense(intermediate_dim, activation='relu')(inputs)
z_mean = Dense(latent_dim, name='z_mean')(x)
z_log_var = Dense(latent_dim, name='z_log_var')(x)

# use reparameterization trick to push the sampling out as input
# note that "output_shape" isn't necessary with the TensorFlow backend
z = Lambda(sampling, output_shape=(latent_dim,), name='z')([z_mean, z_log_var])

# instantiate encoder model
encoder = Model(inputs, [z_mean, z_log_var, z], name='encoder')
encoder.summary()
plot_model(encoder, to_file='vae_mlp_encoder.png', show_shapes=True)

# build decoder model
latent_inputs = Input(shape=(latent_dim,), name='z_sampling')
x = Dense(intermediate_dim, activation='relu')(latent_inputs)
outputs = Dense(original_dim, activation='sigmoid')(x)

# instantiate decoder model
decoder = Model(latent_inputs, outputs, name='decoder')
decoder.summary()
plot_model(decoder, to_file='vae_mlp_decoder.png', show_shapes=True)

# instantiate VAE model
outputs = decoder(encoder(inputs)[2])
vae = Model(inputs, outputs, name='vae_mlp')

def run():
    models = (encoder, decoder)
    data = (x_test, y_test)

    # VAE loss = mse_loss or xent_loss + kl_loss
    #reconstruction_loss = mse(inputs, outputs)
    reconstruction_loss = binary_crossentropy(inputs, outputs)

    reconstruction_loss *= original_dim
    kl_loss = 1 + z_log_var - K.square(z_mean) - K.exp(z_log_var)
    kl_loss = K.sum(kl_loss, axis=-1)
    kl_loss *= -0.5
    vae_loss = K.mean(reconstruction_loss + kl_loss)
    vae.add_loss(vae_loss)
    vae.compile(optimizer='adam')
    vae.summary()
    
    # train the autoencoder
    vae.fit(x_train,
            epochs=epochs,
            batch_size=batch_size,
            validation_data=(x_test, None))
    #vae.save_weights('vae_mlp_mnist.h5')

    plot_results(models,
                 data,
                 batch_size=batch_size)
    
run()

# Transfer learning


Trasnfer learning es otra manera de utilizar las técnicas de Deep Learning. Se utiliza en casos donde los datos de entrenamiento son escasos, pero se tiene modelos entrenados para tareas similares. Para ejemplificar, utilizaremos el dataset conocido como [CIFAR-10](https://www.cs.toronto.edu/~kriz/cifar.html).


| Propiedad | Valor |
| --- | --- |
| Clases | 10 |
| Tamaño de las imágenes | 32 X 32  |
| Canales de las imágenes | 3 (RGB)  |
| Instancias de entrenamiento | 50.000 |
| Instancias de testeo | 10.000 |
| Valor mínimo de cada pixel | 0 |
| Valor máximo de cada pixel | 255 |

El dataset contiene imágenes en color de 32 X 32 pixeles divididas en 10 clases:
1. Avión
1. Auto										
1. Pájaro									
1. Gato							
1. Venado										
1. Perro						
1. Rana									
1. Caballo										
1. Barco							
1. Camión



In [None]:
from tensorflow.keras.datasets import cifar10

(x_train, y_train), (x_test, y_test) = cifar10.load_data()

print('100 primeros elementos del conjunto de entrenamiento')
f = plt.figure(111)
for i in range(10):
    for j in range(10):
        ax = f.add_subplot(10, 10, i + j*10 + 1)
        ax.set_xticklabels('')
        ax.set_yticklabels('')
        ax.imshow(x_train[i + j*10, :, :], cmap='gray')
plt.show()

x_train = x_train / 255
x_test = x_test / 255

In [None]:
from tensorflow.keras.layers import Conv2D, Flatten
from tensorflow.keras.utils import to_categorical
from sklearn.metrics import accuracy_score as acc, confusion_matrix

def show_confusion_matrix_nl(cm):
    fig = plt.figure()
    ax = fig.add_subplot(111)
    cax = ax.matshow(cm)
    plt.title('Matriz de confusión')
    fig.colorbar(cax)
    plt.xlabel('Verdadero')
    plt.ylabel('Predicho')
    plt.show()


i = Input(shape=(32, 32, 3))
d = Conv2D(5, (5,5), activation='relu')(i)
d = Conv2D(5, (5,5), activation='relu')(d)
d = Conv2D(5, (5,5), activation='relu')(d)
d = Conv2D(10, (5,5), activation='relu')(d)
d = Flatten()(d)
d = Dense(10, activation='softmax')(d)
model = Model(inputs=i, outputs=d)
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='nadam', metrics=['categorical_accuracy'])

predict = lambda x: np.argmax(model.predict(x), axis=-1)
show_confusion_matrix_nl(confusion_matrix(y_test, predict(x_test)))
print('La accuracy antes de entrenar es {}'.format(acc(y_test, predict(x_test))))

h = model.fit(x_train, to_categorical(y_train), epochs=10, batch_size=100, 
              validation_data=(x_test, to_categorical(y_test)), verbose=0)


print('Función de pérdidad:')
plt.plot(h.history['loss'], 'b-', h.history['val_loss'], 'r-')
plt.show()
print('Accuracy:')
plt.plot(h.history['categorical_accuracy'], 'b-', h.history['val_categorical_accuracy'], 'r-')
plt.show()

show_confusion_matrix_nl(confusion_matrix(y_test, predict(x_test)))
print('La accuracy después de entrenar es {}'.format(acc(y_test, predict(x_test))))

Supongamos que tenemos solo una porción de datos para entrenar, por ejemplo 2000 imágenes (200 por cada clases). ¿Sería posible entrenar la red neuronal?

In [None]:
sample_per_class = 200

x_small = np.empty((sample_per_class * 10, 32, 32, 3))
y_small = np.empty((sample_per_class * 10,))


counter = [0] * 10

i = 0
for x, y in zip(x_train, y_train):
    if counter[y[0]] == sample_per_class:
      continue
    counter[y[0]] += 1
    x_small[i, :, :, :] = x
    y_small[i] = y
    i += 1
    if i == sample_per_class * 10: 
        break

In [None]:
i = Input(shape=(32, 32, 3))
d = Conv2D(5, (5,5), activation='relu')(i)
d = Conv2D(5, (5,5), activation='relu')(d)
d = Conv2D(5, (5,5), activation='relu')(d)
d = Conv2D(10, (5,5), activation='relu')(d)
d = Flatten()(d)
d = Dense(10, activation='softmax')(d)
model = Model(inputs=i, outputs=d)
model.summary()
model.compile(loss='categorical_crossentropy', optimizer='nadam', metrics=['categorical_accuracy'])

show_confusion_matrix_nl(confusion_matrix(y_test, predict(x_test)))
print('La accuracy antes de entrenar es {}'.format(acc(y_test, predict(x_test))))

h = model.fit(x_small, to_categorical(y_small), epochs=50, batch_size=100, 
              validation_data=(x_test, to_categorical(y_test)), verbose=0)


print('Función de pérdidad:')
plt.plot(h.history['loss'], 'b-', h.history['val_loss'], 'r-')
plt.show()
print('Accuracy:')
plt.plot(h.history['categorical_accuracy'], 'b-', h.history['val_categorical_accuracy'], 'r-')
plt.show()

show_confusion_matrix_nl(confusion_matrix(y_test, predict(x_test)))
print('La accuracy después de entrenar es {}'.format(acc(y_test, predict(x_test))))

En el gráfico de accuracy podemos ver que la red aprende muy bien a identificar los ejemplos de entrenamiento. Llega a un accuracy del $40\%$, pero cuando hacemos la evaluación con el conjunto de test, el valor es del $30\%$. Este fenómeno se conoce como *overfitting* y es un problema importante cuando se usa este tipo de técnicas con pocos datos.

Para este tipo de problemas se utiliza el *transfer learning*. Para esto, se debe considerad alguna red neuronal arbitraría entrenada para clasificar imágenes con un dataset grande. Hay muchas disponibles públicamente. Keras provee varias [redes preentrenadas](https://keras.io/applications/) con el dataset de [ImageNet](http://www.image-net.org/), más de 14 millones de imágenes con 1000 clases. Por ser una arquictura simple, podemos tomar VGG16 que tiene más de **138 millones de parámetros**. A continuación, se puede observar la arquitectura de la red.

In [None]:
from tensorflow.keras.applications.vgg16 import VGG16
model = VGG16(include_top=True)
model.summary()

Si consideramos que las capas ocultas aprenden las características de las imágenes, podemos separar la red en dos partes:

1. Desde la capa de `Input` hasta la capa `block5_pool` como un extractor de características.
2. Las capas `fc1` y `fc2` como un clasificador. 

Si nos quedamos con la primera parte podemos tener un extractor de características para imágenes genéricas:

In [None]:
#El modelo es pesado y no queremos que se rompa por falta de memoria en la GPU
del model 
#Ahora si, sin el tope!!
model = VGG16(include_top=False)
model.summary()

Por comparación vamos a crear 2 dataset nuevos:


1. **x_small_t** y **x_test_t**: dataset small transformado con el modelo VGG16.
2. **x_small_f** y **x_test_f**: dataset small con forma cambiada para que cada pixel de la imagen sea un valor en un vector.


In [None]:
#Dataset de transfer learning
x_small_t = model.predict(x_small)
#Esto hace las veces de flatten
x_small_t = np.reshape(x_small_t, (x_small.shape[0], 512))
print('Forma del dataset transformado con VGG16 {}'.format(x_small_t.shape))
#Test set
x_test_t = model.predict(x_test)
x_test_t = np.reshape(x_test_t, (x_test.shape[0], 512))


#Dataset de imagenes
x_small_f = np.reshape(x_small, (x_small.shape[0], 32 * 32 * 3))
print('Forma del dataset original {}'.format(x_small_f.shape))
x_test_f = np.reshape(x_test, (x_test.shape[0], 32 * 32 * 3))

Podemos probar los dos tipos de características con una regresión logística:

In [None]:
from sklearn.svm import LinearSVC, SVC
from sklearn.linear_model import LogisticRegression
#Los parámetros son para evitar warnings, estandades hasta la versión 0.22
mt = LogisticRegression(solver='liblinear', multi_class='ovr')
mf = LogisticRegression(solver='liblinear', multi_class='ovr')

print('Entrenando Transfer')
mt.fit(x_small_t, y_small)
print('Entrenando Full')
mf.fit(x_small_f, y_small)

print('Accuracy: {}'.format(acc(y_test, mt.predict(x_test_t))))
print('Accuracy: {}'.format(acc(y_test, mf.predict(x_test_f))))

Podemos observar que las características transferidas tienen una mejor performance que usar los pixeles de forma cruda.

## Fine Tuning

Otro uso de las redes preentrenadas para extraer características es incorporarlas en otras redes neuronales para acelerar su entrenamiento. Por ejemplo, en el siguiente caso se utiliza la VGG16 como una capa inicial en una red neuronal. Para que esto funcione, es necesario que las modificaciones en los pesos más sutil que cuando se entrena una red de cero, ya que se supone que la mayoría de los pesos ya están cerca de un valor óptimo. En consecuencia, podemos cambiar el **learning rate** del optimizador, en este caso **Stocastic Gradiant Descent**, de $0.01$ a $0.001$, es decir un orden de magnitud menor.

In [None]:
from tensorflow.keras.optimizers import SGD


i = Input((32, 32, 3))
model = VGG16(include_top=False)(i)

d = Flatten()(model)
d = Dense(512, activation='relu')(d)
d = Dense(10, activation='softmax')(d)
model = Model(inputs=i, outputs=d)
model.summary()
model.compile(loss='categorical_crossentropy', optimizer= \
              SGD(lr=1e-3, momentum=0.0, decay=0.0, nesterov=False), \
              metrics=['categorical_accuracy'])#1e-4:ok y 30 epocs

show_confusion_matrix_nl(confusion_matrix(y_test, predict(x_test)))
print('La accuracy antes de entrenar es {}'.format(acc(y_test, predict(x_test))))

h = model.fit(x_train, to_categorical(y_train), epochs=4, batch_size=100, 
              validation_data=(x_test, to_categorical(y_test)), verbose=1)


print('Función de pérdidad:')
plt.plot(h.history['loss'], 'b-', h.history['val_loss'], 'r-')
plt.show()
print('Accuracy:')
plt.plot(h.history['categorical_accuracy'], 'b-', h.history['val_categorical_accuracy'], 'r-')
plt.show()

show_confusion_matrix_nl(confusion_matrix(y_test, predict(x_test)))
print('La accuracy después de entrenar es {}'.format(acc(y_test, predict(x_test))))

# GAN

Generative Adeversarial Networks es una técnica para generar nuevas instancias a partir de dos redes neuronales que compiten entre ellas:

* El generador: es la red neuronal encargadas de generar instancias falsas.
* El discriminador: es la red neuronal encargada de decidir si una instancia es falsa o verdadera.

Para el entrenamiento, se realizan pasadas en batch. Por un lado, al discriminador se lo alimenta con mitad de datos reales y mitad de datos falso, y como objetivo se espera que clasifique los reales como reales y los falsos como falso. En una segunda pasada, se fijan los pesos del discriminador, se conecta el generador con el discriminador, y como objetivo se fija que determine que todos los datos salidos del discriminador son verdaderos.

A continuación, se presenta un ejemplo basado en ek [AC-GAN](https://github.com/keras-team/keras/blob/master/examples/mnist_acgan.py) implementado en los ejemplos de Keras.

In [None]:
from collections import defaultdict
try:
    import cPickle as pickle
except ImportError:
    import pickle
from PIL import Image

from six.moves import range

from tensorflow.keras.datasets import mnist
from tensorflow.keras import layers
from tensorflow.keras.layers import Input, Dense, Reshape, Flatten, Embedding, Dropout
from tensorflow.keras.layers import BatchNormalization, LeakyReLU, Conv2DTranspose, Conv2D
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.utils import Progbar
import numpy as np
import matplotlib.pyplot as plt


np.random.seed(1337)
num_classes = 10


def build_generator(latent_size):
    # we will map a pair of (z, L), where z is a latent vector and L is a
    # label drawn from P_c, to image space (..., 28, 28, 1)
    cnn = Sequential()

    cnn.add(Dense(3 * 3 * 384, input_dim=latent_size, activation='relu'))
    cnn.add(Reshape((3, 3, 384)))

    # upsample to (7, 7, ...)
    cnn.add(Conv2DTranspose(192, 5, strides=1, padding='valid',
                            activation='relu',
                            kernel_initializer='glorot_normal'))
    cnn.add(BatchNormalization())

    # upsample to (14, 14, ...)
    cnn.add(Conv2DTranspose(96, 5, strides=2, padding='same',
                            activation='relu',
                            kernel_initializer='glorot_normal'))
    cnn.add(BatchNormalization())

    # upsample to (28, 28, ...)
    cnn.add(Conv2DTranspose(1, 5, strides=2, padding='same',
                            activation='tanh',
                            kernel_initializer='glorot_normal'))

    # this is the z space commonly referred to in GAN papers
    latent = Input(shape=(latent_size, ))

    # this will be our label
    image_class = Input(shape=(1,), dtype='int32')

    cls = Embedding(num_classes, latent_size,
                    embeddings_initializer='glorot_normal')(image_class)

    # hadamard product between z-space and a class conditional embedding
    h = layers.multiply([latent, cls])

    fake_image = cnn(h)

    return Model([latent, image_class], fake_image)


def build_discriminator():
    # build a relatively standard conv net, with LeakyReLUs as suggested in
    # the reference paper
    cnn = Sequential()

    cnn.add(Conv2D(32, 3, padding='same', strides=2,
                   input_shape=(28, 28, 1)))
    cnn.add(LeakyReLU(0.2))
    cnn.add(Dropout(0.3))

    cnn.add(Conv2D(64, 3, padding='same', strides=1))
    cnn.add(LeakyReLU(0.2))
    cnn.add(Dropout(0.3))

    cnn.add(Conv2D(128, 3, padding='same', strides=2))
    cnn.add(LeakyReLU(0.2))
    cnn.add(Dropout(0.3))

    cnn.add(Conv2D(256, 3, padding='same', strides=1))
    cnn.add(LeakyReLU(0.2))
    cnn.add(Dropout(0.3))

    cnn.add(Flatten())

    image = Input(shape=(28, 28, 1))

    features = cnn(image)

    # first output (name=generation) is whether or not the discriminator
    # thinks the image that is being shown is fake, and the second output
    # (name=auxiliary) is the class that the discriminator thinks the image
    # belongs to.
    fake = Dense(1, activation='sigmoid', name='generation')(features)
    aux = Dense(num_classes, activation='softmax', name='auxiliary')(features)

    return Model(image, [fake, aux])



In [None]:
# batch and latent size taken from the paper
epochs = 2
batch_size = 100
latent_size = 100
# Adam parameters suggested in https://arxiv.org/abs/1511.06434
adam_lr = 0.0002
adam_beta_1 = 0.5

# build the discriminator
print('Discriminator model:')
discriminator = build_discriminator()
discriminator.compile(
    optimizer=Adam(lr=adam_lr, beta_1=adam_beta_1),
    loss=['binary_crossentropy', 'sparse_categorical_crossentropy']
)
discriminator.summary()

# build the generator
generator = build_generator(latent_size)

latent = Input(shape=(latent_size, ))
image_class = Input(shape=(1,), dtype='int32')

# get a fake image
fake = generator([latent, image_class])

# we only want to be able to train generation for the combined model
discriminator.trainable = False
fake, aux = discriminator(fake)
combined = Model([latent, image_class], [fake, aux])

print('Combined model:')
combined.compile(
    optimizer=Adam(lr=adam_lr, beta_1=adam_beta_1),
    loss=['binary_crossentropy', 'sparse_categorical_crossentropy']
)
combined.summary()

# get our mnist data, and force it to be of shape (..., 28, 28, 1) with
# range [-1, 1]
(x_train, y_train), (x_test, y_test) = mnist.load_data()
x_train = (x_train.astype(np.float32) - 127.5) / 127.5
x_train = np.expand_dims(x_train, axis=-1)

x_test = (x_test.astype(np.float32) - 127.5) / 127.5
x_test = np.expand_dims(x_test, axis=-1)

num_train, num_test = x_train.shape[0], x_test.shape[0]

train_history = defaultdict(list)
test_history = defaultdict(list)

for epoch in range(1, epochs + 1):
    print('Epoch {}/{}'.format(epoch, epochs))

    num_batches = int(np.ceil(x_train.shape[0] / float(batch_size)))
    progress_bar = Progbar(target=num_batches)

    epoch_gen_loss = []
    epoch_disc_loss = []

    for index in range(num_batches):
        # get a batch of real images
        image_batch = x_train[index * batch_size:(index + 1) * batch_size]
        label_batch = y_train[index * batch_size:(index + 1) * batch_size]

        # generate a new batch of noise
        noise = np.random.uniform(-1, 1, (len(image_batch), latent_size))

        # sample some labels from p_c
        sampled_labels = np.random.randint(0, num_classes, len(image_batch))

        # generate a batch of fake images, using the generated labels as a
        # conditioner. We reshape the sampled labels to be
        # (len(image_batch), 1) so that we can feed them into the embedding
        # layer as a length one sequence
        generated_images = generator.predict(
            [noise, sampled_labels.reshape((-1, 1))], verbose=0)

        x = np.concatenate((image_batch, generated_images))

        # use one-sided soft real/fake labels
        # Salimans et al., 2016
        # https://arxiv.org/pdf/1606.03498.pdf (Section 3.4)
        soft_zero, soft_one = 0, 0.95
        y = np.array(
            [soft_one] * len(image_batch) + [soft_zero] * len(image_batch))
        aux_y = np.concatenate((label_batch, sampled_labels), axis=0)

        # we don't want the discriminator to also maximize the classification
        # accuracy of the auxiliary classifier on generated images, so we
        # don't train discriminator to produce class labels for generated
        # images (see https://openreview.net/forum?id=rJXTf9Bxg).
        # To preserve sum of sample weights for the auxiliary classifier,
        # we assign sample weight of 2 to the real images.
        disc_sample_weight = [np.ones(2 * len(image_batch)),
                              np.concatenate((np.ones(len(image_batch)) * 2,
                                              np.zeros(len(image_batch))))]

        # see if the discriminator can figure itself out...
        epoch_disc_loss.append(discriminator.train_on_batch(
            x, [y, aux_y], sample_weight=disc_sample_weight))

        # make new noise. we generate 2 * batch size here such that we have
        # the generator optimize over an identical number of images as the
        # discriminator
        noise = np.random.uniform(-1, 1, (2 * len(image_batch), latent_size))
        sampled_labels = np.random.randint(0, num_classes, 2 * len(image_batch))

        # we want to train the generator to trick the discriminator
        # For the generator, we want all the {fake, not-fake} labels to say
        # not-fake
        trick = np.ones(2 * len(image_batch)) * soft_one

        epoch_gen_loss.append(combined.train_on_batch(
           [noise, sampled_labels.reshape((-1, 1))],
           [trick, sampled_labels]))

        progress_bar.update(index + 1)

    print('Testing for epoch {}:'.format(epoch))

    # evaluate the testing loss here

    # generate a new batch of noise
    noise = np.random.uniform(-1, 1, (num_test, latent_size))

    # sample some labels from p_c and generate images from them
    sampled_labels = np.random.randint(0, num_classes, num_test)
    generated_images = generator.predict(
        [noise, sampled_labels.reshape((-1, 1))], verbose=False)

    x = np.concatenate((x_test, generated_images))
    y = np.array([1] * num_test + [0] * num_test)
    aux_y = np.concatenate((y_test, sampled_labels), axis=0)

    # see if the discriminator can figure itself out...
    discriminator_test_loss = discriminator.evaluate(
        x, [y, aux_y], verbose=False)

    discriminator_train_loss = np.mean(np.array(epoch_disc_loss), axis=0)

    # make new noise
    noise = np.random.uniform(-1, 1, (2 * num_test, latent_size))
    sampled_labels = np.random.randint(0, num_classes, 2 * num_test)

    trick = np.ones(2 * num_test)

    generator_test_loss = combined.evaluate(
        [noise, sampled_labels.reshape((-1, 1))],
        [trick, sampled_labels], verbose=False)

    generator_train_loss = np.mean(np.array(epoch_gen_loss), axis=0)

    # generate an epoch report on performance
    train_history['generator'].append(generator_train_loss)
    train_history['discriminator'].append(discriminator_train_loss)

    test_history['generator'].append(generator_test_loss)
    test_history['discriminator'].append(discriminator_test_loss)

    print('{0:<22s} | {1:4s} | {2:15s} | {3:5s}'.format(
        'component', *discriminator.metrics_names))
    print('-' * 65)

    ROW_FMT = '{0:<22s} | {1:<4.2f} | {2:<15.4f} | {3:<5.4f}'
    print(ROW_FMT.format('generator (train)',
                         *train_history['generator'][-1]))
    print(ROW_FMT.format('generator (test)',
                         *test_history['generator'][-1]))
    print(ROW_FMT.format('discriminator (train)',
                         *train_history['discriminator'][-1]))
    print(ROW_FMT.format('discriminator (test)',
                         *test_history['discriminator'][-1]))

    # generate some digits to display
    num_rows = 40
    noise = np.tile(np.random.uniform(-1, 1, (num_rows, latent_size)),
                     (num_classes, 1))

    sampled_labels = np.array([
        [i] * num_rows for i in range(num_classes)
    ]).reshape(-1, 1)

    # get a batch to display
    generated_images = generator.predict(
        [noise, sampled_labels], verbose=0)

    # prepare real images sorted by class label
    real_labels = y_train[(epoch - 1) * num_rows * num_classes:
                          epoch * num_rows * num_classes]
    indices = np.argsort(real_labels, axis=0)
    real_images = x_train[(epoch - 1) * num_rows * num_classes:
                         epoch * num_rows * num_classes][indices]

    # display generated images, white separator, real images
    img = np.concatenate(
        (generated_images,
         np.repeat(np.ones_like(x_train[:1]), num_rows, axis=0),
         real_images))

    # arrange them into a grid
    img = (np.concatenate([r.reshape(-1, 28)
                           for r in np.split(img, 2 * num_classes + 1)
                           ], axis=-1) * 127.5 + 127.5).astype(np.uint8)
    plt.figure(figsize=(30, 30))
    plt.imshow(img)
    plt.show()

# Generación de texto desde una imagen

Este es un ejemplo de generación de descripción de imagenes. En este ejemplo se utiliza el código disponible en [Github](https://github.com/knife982000/imageCaptioning). El modelo es una variación de [Show, Attend and Tell: Neural Image Caption Generation with Visual Attention](https://arxiv.org/abs/1502.03044). 

Este ejemplo combina: 
1. Attention.
2. Una red neuronal recurrente.
3. Un modelo preentrenado de Inception v3.
4. Un beam search para buscar las descripciones más probables.

In [None]:
from google.colab import drive
drive.mount('/content/drive')

model_file = '/content/drive/MyDrive/DUIA-Redes-Neuronales/image_captioning/model.h5'
tokenizer_file = '/content/drive/MyDrive/DUIA-Redes-Neuronales/image_captioning/tokenizer.pickle'

In [None]:
from tensorflow.keras.layers import Layer, Embedding, Input, Dense
from tensorflow.keras import backend as K
from tensorflow.keras.models import Model
import tensorflow as tf
from tensorflow.python.ops import array_ops
from tensorflow.python.framework import dtypes
import numpy as np
import matplotlib.pyplot as plt


class GRUAttention(Layer):

    def __init__(self, units, attention_units=None,
                 return_sequences=False, return_state=False,
                 mask_zeros=False, **kargs):
        super(GRUAttention, self).__init__(**kargs)
        self.units = units
        if attention_units is None:
            attention_units = units
        self.attention_units = attention_units
        self.return_sequences = return_sequences
        self.return_state = return_state
        self.mask_zeros = mask_zeros
        pass

    # noinspection PyAttributeOutsideInit
    def build(self, input_shape):
        img_features = input_shape[0][-1]
        text_features = input_shape[1][-1]
        dtype = dtypes.as_dtype(self.dtype or K.floatx())
        self.kernel = self.add_weight('kernel', shape=(img_features + text_features, 3 * self.units), dtype=dtype)
        self.input_bias = self.add_weight('bias', shape=(3 * self.units,), dtype=dtype)
        self.recurrent_kernel = self.add_weight('recurrent_kernel', shape=(self.units, 3 * self.units), dtype=dtype)
        self.recurrent_bias = self.add_weight('recurrent_bias', shape=(3 * self.units,), dtype=dtype)

        self.att_img_kernel = self.add_weight('att_img_kernel', shape=(img_features, self.attention_units),
                                              dtype=dtype)
        self.att_img_bias = self.add_weight('att_img_bias', shape=(self.attention_units,), dtype=dtype)

        self.att_hidden_kernel = self.add_weight('att_hidden_kernel', shape=(self.units, self.attention_units),
                                                 dtype=dtype)
        self.att_hidden_bias = self.add_weight('att_hidden_bias', shape=(self.attention_units,), dtype=dtype)

        self.att_v_kernel = self.add_weight('att_v_kernel', shape=(self.attention_units, 1),
                                               dtype=dtype)
        self.att_v_bias = self.add_weight('att_v_bias', shape=(1,), dtype=dtype)
        pass

    def get_config(self):
        config = {
            'units': self.units,
            'attention_units': self.attention_units,
            'return_sequences': self.return_sequences,
            'return_state': self.return_state,
            'mask_zeros': self.mask_zeros
        }
        base_config = super(GRUAttention, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

    def dense(self, input, kernel, bias):
        return K.dot(input, kernel) + bias

    def attention(self, image, hidden):
        hidden_with_time_axis = tf.expand_dims(hidden, 1)

        # attention_hidden_layer shape == (batch_size, 64, units)
        attention_hidden_layer = K.tanh(self.dense(image, self.att_img_kernel, self.att_img_bias) +
                                        self.dense(hidden_with_time_axis, self.att_hidden_kernel, self.att_hidden_bias))

        # score shape == (batch_size, 64, 1)
        # This gives you an unnormalized score for each image feature.
        score = self.dense(attention_hidden_layer, self.att_v_kernel, self.att_v_bias)

        # attention_weights shape == (batch_size, 64, 1)
        attention_weights = K.softmax(score, axis=1)

        # context_vector shape after sum == (batch_size, hidden_size) ??embedding_dim??
        context_vector = attention_weights * image
        context_vector = tf.reduce_sum(context_vector, axis=1)

        return context_vector

    def call(self, input, initial_state=None):
        img, text = input

        def step(cell_inputs, cell_states):
            """Step function that will be used by Keras RNN backend."""
            h_tm1 = cell_states[0]
            features = self.attention(img, h_tm1)
            cell_inputs = K.concatenate([cell_inputs, features], axis=-1)

            # inputs projected by all gate matrices at once
            matrix_x = K.dot(cell_inputs, self.kernel)
            matrix_x = K.bias_add(matrix_x, self.input_bias)

            x_z, x_r, x_h = array_ops.split(matrix_x, 3, axis=1)

            # hidden state projected by all gate matrices at once
            matrix_inner = K.dot(h_tm1, self.recurrent_kernel)
            matrix_inner = K.bias_add(matrix_inner, self.recurrent_bias)

            recurrent_z, recurrent_r, recurrent_h = array_ops.split(matrix_inner, 3,
                                                                    axis=1)
            z = K.sigmoid(x_z + recurrent_z)
            r = K.sigmoid(x_r + recurrent_r)
            hh = K.tanh(x_h + r * recurrent_h)

            # previous and candidate state mixed by update gate
            h = z * h_tm1 + (1 - z) * hh
            return h, [h]

        if initial_state is None:
            initial_state = (array_ops.zeros((array_ops.shape(text)[0], self.units)),)
        last, sequence, hidden = K.rnn(step, text, initial_state, zero_output_for_mask=self.mask_zeros)
        if self.return_state and self.return_sequences:
            return sequence, hidden
        if self.return_state:
            return last, hidden
        if self.return_sequences:
            return sequence
        return last

In [None]:
import math

def generate(i, pict, text, model, width, end_id):
    c = []
    p = []
    e = []
    preds = model.predict([pict, text])[0, i-1, :]
    for _ in range(width):
        m = np.argmax(preds)
        c.append(m)
        p.append(preds[m])
        e.append(m == end_id)
        preds[m] = 0
    return c, p, e

def beam(pict, model, tokenizer, width, maxlen):
    end_id = tokenizer.word_index['<end>']
    start = np.zeros((1, maxlen), dtype=np.int32)
    start[0, 0] = tokenizer.word_index['<start>']
    candidates = [start]
    ended = [False]
    probs = [[1]]
    pict = np.reshape(pict, (1, -1, pict.shape[-1]))
    for i in range(1, maxlen):
        n_candidates = []
        n_ended = []
        n_probs = []
        all_ended = True
        for c, e, p in zip(candidates, ended, probs):
            if e:
                n_candidates.append(c)
                n_ended.append(e)
                n_probs.append(p)
            else:
                all_ended = False
                nc, n_p, ne = generate(i, pict, c, model, width, end_id)
                for vnc, vnp, vne in zip(nc, n_p, ne):
                    n_c = c.copy()
                    n_c[0, i] = vnc
                    n_candidates.append(n_c)
                    n_ended.append(vne)
                    n_probs.append(list(p) + [vnp])
                pass
        if all_ended:
            break
        log_prob = {e: np.average([math.log(x) for x in p]) for e, p in enumerate(n_probs)}
        index = list(range(len(n_probs)))
        index.sort(key=lambda x: log_prob[x], reverse=True)
        index = index[:width]
        candidates = [n_candidates[i] for i in index]
        probs = [n_probs[i] for i in index]
        ended = [n_ended[i] for i in index]
    return [c[0, :] for c in candidates], probs

In [None]:
image_pre = tf.keras.applications.InceptionV3(include_top=False, weights='imagenet')

In [None]:
import pickle
with open(tokenizer_file, 'rb') as f:
    tokenizer = pickle.load(f)

In [None]:
img = Input((64, 2048))
d_img = Dense(300)(img)

txt = Input((None, ))
emb = Embedding(len(tokenizer.word_index), 300)(txt)

d = GRUAttention(300, mask_zeros=True, return_sequences=True)([d_img, emb])
d = Dense(len(tokenizer.word_index), activation='softmax')(d)

model = Model([img, txt], d)

model.summary()
model.compile(loss='sparse_categorical_crossentropy', optimizer='nadam', metrics=['sparse_categorical_accuracy'])

model.load_weights(model_file)

In [None]:
!wget https://raw.githubusercontent.com/knife982000/imageCaptioning/master/images/cafe.jpg
!wget https://raw.githubusercontent.com/knife982000/imageCaptioning/master/images/landscape.jpg

In [None]:
def load_image(image_path):
    img = tf.io.read_file(image_path)
    img = tf.image.decode_jpeg(img, channels=3)
    imgp = tf.image.resize(img, (299, 299))
    imgp = tf.keras.applications.inception_v3.preprocess_input(imgp)
    return imgp, img, image_path

In [None]:
def process_file(file_name):
    img, original, _ = load_image(file_name)

    plt.imshow(original)
    plt.show()

    img = image_pre.predict(img.numpy()[np.newaxis, ...])
    print(img.shape)

    cands, probs = beam(img, model, tokenizer, 3, 20)
    for c, p in zip(cands, probs):
        print('\t' + ' '.join([tokenizer.index_word[w] for w in c]))
        print('\t{}'.format(p))
    pass

In [None]:
process_file('cafe.jpg')

In [None]:
process_file('landscape.jpg')