<div><img style="float: right; width: 120px; vertical-align:middle" src="https://www.upm.es/sfs/Rectorado/Gabinete%20del%20Rector/Logos/EU_Informatica/ETSI%20SIST_INFORM_COLOR.png" alt="ETSISI logo" />

# Transfer Learning<a id="top"></a>

<i><small>Authors: Alberto Díaz Álvarez and Félix José Fuentes Hurtado<br>Last update: 2023-03-28</small></i></div> 

***

## Introduction

In this notebook we are going to work with the concept of "Transfer Learning". The idea behind it is "let's take advantage of the knowledge learned in one model for the training of another model".

Being a bit more specific, the process consists of making use of a neural network previously trained with good performance on a larger data set, using it as a basis on which to create a new model that leverages the accuracy of that previous network for a new task. The "intuition" behind this is that, as the first layers deal with certain features (in our example, image features), a problem that also deals with these types of features will use the same or very similar ones (in our example edges, blobs, etc.).

## Goals

We will learn how to save and load models, and partially use them to create models based on others already trained using transfer learning techniques.

## Libraries and configuration

Next we will import the libraries that will be used throughout the notebook.

In [None]:
import emnist
import matplotlib.pyplot as plt
import pandas as pd
import tensorflow as tf

We also set some parameters to adapt the graphic presentation.

In [None]:
%matplotlib inline
plt.style.use('ggplot')
plt.rcParams.update({'figure.figsize': (20, 6),'figure.dpi': 64})

***

## Feature extraction vs fine tuning

There are two extremes when using transfer Learning; in one, we start from a pre-trained network, but allow some of the weights to be modified (usually the last layer or layers). This is called "fine-tuning" or _fine-tuning_ because we are slightly adjusting the weights of the pre-trained network to the new task. We usually train such a network with a lower learning rate than normal, since we expect that the features are already relatively good and do not need to be changed too much.

At the other extreme, it consists of taking the pre-trained network and totally freezing the weights, using one of its hidden layers (usually the last one) as a feature extractor and thus as input to a smaller neural network.

## Saving and loading our model

In this example we will first use the neural network that we used in the first exercise to solve the MNIST problem. We are going to save and load it to train it again, but this time with the weights already loaded. This is only a sample, for more information please visit [Save and load Keras models](https://www.tensorflow.org/guide/keras/save_and_serialize?hl=sl), where different save and load alternatives are shown.

In [None]:
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()

print(f'Training shape: {x_train.shape} input, {y_train.shape} output')
print(f'Test shape:     {x_test.shape} input, {y_test.shape} output')

This time we are not going to do _one-hot_ encoding on the output. There is a method for calculating the _loss_ analogous to `categorical_crossentropy` called `sparse_categorical_crossentropy` that essentially does the same thing. The only difference is the format in which the output is represented.

If this is in _one-hot_ format, you need categorical_crossentropy, and if it is an integer value, `sparse_categorical_crossentropy`. It has no more. The usage depends entirely on how you load the dataset. One advantage of using the `sparse_categorical_crossentropy` is that it saves memory by using a single integer for a class, rather than a full vector.

In [None]:
model = tf.keras.models.Sequential([
    tf.keras.layers.Flatten(input_shape=(28, 28)),
    tf.keras.layers.Dense(8, activation='relu'),
    tf.keras.layers.Dense(4, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')
])

model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics = ['sparse_categorical_accuracy'])
model.summary()
history = model.fit(x_train, y_train, epochs=100, batch_size=len(x_train), validation_split=0.1, verbose=0)

Let's see the evolution of the training graphically.

In [None]:
pd.DataFrame(history.history).plot()
plt.xlabel('Epoch num.')
plt.show()

Now let's save the model. We will evaluate against the test set before saving and after loading to make sure it is the same.

In [None]:
# Extract the metrics
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
print(f'Original model: {loss} loss, {accuracy} accuracy')
# Save model
model.save('tmp/supermodel.h5')
# Load model
model2 = tf.keras.models.load_model('tmp/supermodel.h5')
# Extract the metrics from the loaded model
loss, accuracy = model.evaluate(x_test, y_test, verbose=0)
print(f'Lodaded model:  {loss} loss, {accuracy} accuracy')

The good thing is that we can now continue to train the model where we left off.

In [None]:
history = model2.fit(x_train, y_train, epochs=100, batch_size=len(x_train), validation_split=0.1, verbose=0)

We can examine the trend of this training phase to see that it does indeed start where it left off in the previous one.

In [None]:
pd.DataFrame(history.history).plot()
plt.xlabel('Epoch num.')
plt.show()

Now, we are going to use our model to try to recognize not only numbers, but also letters (a bit pretentious, yes, but it is to learn how to use our models with _transfer learning_). For this we will rely on the set [https://www.nist.gov/itl/products-and-services/emnist-dataset](EMNIST (Extended MNIST)) and a _wrapper_ called `emnist` (`pip install emnist`) to avoid having to download the dataset by hand.

Once we have it installed, we will use the _dataset_ with the balanced classes: 

In [None]:
x_train_emnist, y_train_emnist = emnist.extract_training_samples('balanced')
x_test_emnist, y_test_emnist = emnist.extract_test_samples('balanced')

x_train_emnist = x_train_emnist / 255
x_test_emnist = x_test_emnist / 255

print(f'Training shape: {x_train_emnist.shape} input, {y_train_emnist.shape} output')
print(f'Test shape:     {x_test_emnist.shape}  input, {y_test_emnist.shape} output')

The examples have the following form:

In [None]:
plt.imshow(x_train_emnist[0], cmap='hot');

And their corresponding labels:

In [None]:
print(y_train_emnist[0])

What we will do is load our previously saved model and use its first layers without modification. We will only change the last layer to classify our examples, which are quite a few more.

This will be the only layer we will train. This is done with the assumption that the first layers of a model extract or learn the relevant features that make the examples unique, and that the last ones learn to infer from these features.

In [None]:
model = tf.keras.models.load_model('tmp/supermodel.h5')

model_emnist = tf.keras.Sequential()
for i, layer in enumerate(model.layers[:-1]):  # ¡¡No incluimos la última!!
    model_emnist.add(layer)
    model_emnist.layers[-1].trainable = False  # Si no, entrenará los parámetros de estas capas
model_emnist.add(tf.keras.layers.Dense(47, activation='softmax'))

model_emnist.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics = ['sparse_categorical_accuracy'])
model_emnist.summary()

Looking at the summary of the network architecture, we can see that of all the parameters that exist, only 235 will be trained, those corresponding to the connections between the penultimate and final layer.

In [None]:
history = model_emnist.fit(
    x_train_emnist, y_train_emnist, epochs=100, batch_size=len(x_train_emnist), validation_split=0.1, verbose=0)

Now let's see how the training evolution has progressed.

In [None]:
pd.DataFrame(history.history).plot()
plt.xlabel('Epoch num.')
plt.show()

Well, maybe it was not the best example, since the starting model is small and not very generalist. But at least we know how to manipulate a model to create a new one from a previously trained one. Now we will use another larger model to see how it behaves with the _dataset_ EMNIST.

### Using a pre-trained model

We can make use of saved models as part of new models. It is usually not trivial, but not too complicated as the models themselves usually provide help or at least the architecture description to understand how to make them.

We have multiple models to work with. In Keras alone there are dozens of pre-trained models ready to download from the `applications` API. Let's see an example with the _ResNet50_ model.

For this, we will need to transform our EMNIST images (actually two-dimensional arrays) into 3-channel color images. Also the minimum expected image size is $75 \times 75$, so we will have to adjust them as well.

In [None]:
x_train = x_train_emnist.reshape((-1, 28, 28, 1))
x_test = x_test_emnist.reshape((-1, 28, 28, 1))
x_train, x_test = tf.image.resize(x_train, (32, 32)), tf.image.resize(x_test, (32, 32))
x_train, x_test = tf.image.grayscale_to_rgb(x_train), tf.image.grayscale_to_rgb(x_test)
y_train, y_test = y_train_emnist, y_test_emnist

print(f'Training shape: {x_train.shape} input, {y_train.shape} output')
print(f'Test shape:     {x_test.shape}  input, {y_test.shape} output')

Now, as we have done before, we will load the ResNet50 model without including the last layer (argument `include_top` to `False`). We will also specify that we **do not** want to train the preloaded model. Microsoft engineers have spent time and machines so that these models are quite well trained.

In [None]:
pretrained_model = tf.keras.applications.ResNet50(input_shape=[32, 32, 3], include_top=False)
pretrained_model.trainable = False

model = tf.keras.Sequential([
    pretrained_model,
    tf.keras.layers.Flatten(),
    tf.keras.layers.Dense(47, activation='softmax')
])
model.compile(loss="sparse_categorical_crossentropy", optimizer="adam", metrics = ['sparse_categorical_accuracy'])
model.summary()

Now all that would be left for us to do is to train. Fortunately, out of more than 23 million parameters, we will only adjust a little more than 96,000, so great.

In [None]:
history = model.fit(x_train, y_train, epochs=10, batch_size=1024, validation_split=0.1, verbose=0)

Slow, huh? Well, although we have avoided training millions of parameters, what we have not been able to avoid is inference, and this usually costs.

Let's see how the training has evolved.

In [None]:
pd.DataFrame(history.history).plot()
plt.xlabel('Epoch num.')
plt.show()

## Where to obtain pre-trained models?

As of today (March 28, 2023) there are more than $38$ pretrained models available in Keras through the `applications` API. When downloaded, the weights will automatically download into the `~/.keras/models/` directory. Unfortunately, all API models to date are used for images.

We have however more sources of pre-trained models available.

### [TensorFlow Hub](https://tfhub.dev/)

As it could not be otherwise, there is a "Hub" for TensorFlow models. An example of instantiation can be the following:

In [None]:
import tensorflow_hub as hub


model = tf.keras.Sequential([
    hub.KerasLayer(
        'https://tfhub.dev/google/tf2-preview/inception_v3/feature_vector/4',
        input_shape=(224, 224, 3),
        trainable=False),
    tf.keras.layers.Dense(10)
])

model.summary()

The TensorFlow Hub API is available via pip: `pip install tensorflow-hub`.

### Embeddings

We will see them later in Natural Language Processing (NLP), but for the record, there are some independent projects famous enough to have their own download site. This is the case of _embeddings_, for example:

* [GloVe: Global Vectors for Word Representation](https://nlp.stanford.edu/projects/glove/)
* [Word2vec](https://code.google.com/archive/p/word2vec/)
* [fastText](https://fasttext.cc/docs/en/english-vectors.html)

### [Hugging face](https://huggingface.co/)

Hugging Face es una empresa con sede en los Estados Unidos que desarrolla herramientas para la creación de aplicaciones basadas en aprendizaje automático. Son los desarrolladores de la biblioteca `transformers`, la cual se usa extensamente para aplicaciones de NLP.

Su plataforma permite a los usuarios compartir modelos y conjuntos de datos de aprendizaje automático. En la actualidad disponen de miles de modelo preentrenados para realizar tareas de todo tipo:

- **Visión artificial**, como clasificación de imágenes o de vídeo, segmentación o detección de objetos.
- **NLP**, como análisis de sentimiento, generación de texto o traducción.
- **Audio**, como reconocimiento del habla, clasificación o generación de audio.

***

<div><img style="float: right; width: 120px; vertical-align:top" src="https://mirrors.creativecommons.org/presskit/buttons/88x31/png/by-nc-sa.png" alt="Creative Commons by-nc-sa logo" />

[Back to top](#top)

</div>