<!--NAVIGATION-->
<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>< [Scikit-learn](06.01-Scikit.ipynb) | [Contents](00.00-Index.ipynb) | [Quiz](06.03-Quiz.ipynb) ></span>

<a href="https://colab.research.google.com/github/eurostat/e-learning/blob/main/python-official-statistics/06.02-Tensorflow.ipynb"><img align="left" src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open in Colab" title="Open and Execute in Google Colaboratory"></a>

<a id='top'></a>

# Tensorflow + Keras and PyTorch
## Content  
- [What is Tensorflow?](#tensorflow)
- [And Keras?](#keras)
- [TensorFlow - Not Just DL](#classic)
- [Simple linear regression](#regression)
- [MLP Classifier: Handwritten Digits](#tf-mlp)
- [Convolutional Neural Network](#tf-cnn)
- [PyTorch](#pt)
- [Transfer Learning](#transfer)

``Sklearn`` doesn't have much support for Deep Neural Networks. So when you want/need to unleash the power of deep learning you need something more.  
Luckily there are two powerful packages to work with: ``Tensorflow`` and ``PyTorch``.

<a id='tensorflow'></a>

## What is Tensorflow?
TensorFlow is an open-source library developed by ``Google`` primarily for ``deep learning applications``. It also supports traditional machine learning. TensorFlow was originally developed for large numerical computations without keeping deep learning in mind.  
TensorFlow is more of a low-level library. Basically, we can think of TensorFlow as the Lego bricks (similar to NumPy and SciPy) that we can use to implement machine learning algorithms whereas Scikit-Learn comes with off-the-shelf algorithms and has an easy to use interface.  
Starting with version 2.0, more efficiency and convenience was brought to the game.



<a id='keras'></a>

## What is Keras?
Keras is built on top of TensorFlow, which makes it a wrapper for deep learning purposes. It is incredibly user-friendly and easy to pick up. A solid asset is its neural network block modularity and the fact that it is written in Python, which makes it easy to debug.  

`Note`: In this lecture I will refer as Tensorflow when talking about Keras + Tensorflow.

<a id='classic'></a>

## TensorFlow - Not Just for Deep Learning
Yes, TensorFlow is not just for deep learning. It provides a great variety of building blocks for general numerical computation and machine learning. Next we will introduce the wide range of general machine learning algorithms and their building blocks provided by TensorFlow.  

### High-level Estimators
TensorFlow provides various number of machine learning algorithms inside it’s estimators module. Besides easy-to-use deep learning APIs such as Deep Neural Networks, Recurrent Neural Networks, etc, there are also a collection of popular machine learning algorithms. Currently, the following algorithms are included:
- K-means clustering
- Random Forests
- Support Vector Machines
- Gaussian Mixture Model clustering
- Linear/logistic regression

### Statistical Distributions
A wide variety of statistical distributions functions are also provided by TensorFlow, including but not limited to distributions like Bernoulli, Beta, Chi2, Dirichlet, Gamma, Uniform, etc. They are important building blocks when it comes to build machine learning algorithms, especially for probabilistic approaches like Bayesian models.

### Loss Functions and Metrics

Machine learning algorithms rely on optimizations based the loss function provided. TensorFlow provides a wide range of loss functions to choose, such as sigmoid and softmax cross entropy, log-loss, hinge loss, sum of squares, sum of pairwise squares, etc.  

A variety types of metrics are available, such as precision, recall, accuracy, auc, MSE, as well as their streaming versions.

<a id='regression'></a>

## TensorFlow - Simple linear regression
We'll try to produce the same model as with Sklearn but now using Tensorflow:


In [None]:
import matplotlib.pyplot as plt
import numpy as np

# slope
a = 2
# intercept
b = -1
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = a * x + b + rng.randn(50)
plt.scatter(x, y)

In [None]:
import tensorflow as tf
tf.get_logger().setLevel('ERROR')

# create model
feat_cols = [tf.feature_column.numeric_column('x',shape=[1])]
model = tf.estimator.LinearRegressor(feature_columns=feat_cols)
# create input function for training
train_input_func = tf.compat.v1.estimator.inputs.numpy_input_fn(
    {'x': x}, y, batch_size=4, num_epochs=None, shuffle=True)
# training (fit in sklearn)
model.train(input_fn = train_input_func, steps = 1000)
# create input function for predictions
predict_input_func = tf.compat.v1.estimator.inputs.numpy_input_fn(
    {'x':np.linspace(-1,11,8)},shuffle=False)
# create predictions list (from model.predict)
predictions = []
for i in model.predict(input_fn = predict_input_func):
    predictions.append(i['predictions'])

plt.scatter(x, y)
plt.plot(np.linspace(-1,11,8), predictions,'ro')

It looks more complicated to build a simple regression with Tensorflow than with Sklearn.  
But what about Deep Learning models?

<a id='tf-mlp'></a>

## MLP Classifier: Handwritten Digits
Back for our favorite test database MNIST for handwritten digits recognition. It is the same database we used in the example for MLP classification implemented in Sklearn. Somehow it is the ``Hello World`` example for simple image recognition in DNNs.  

Even if Keras contains this database, we will use the same code and database as prepared for sklearn model:

In [None]:
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split

# Load data from https://www.openml.org/d/554
X, y = fetch_openml("mnist_784", version=1, return_X_y=True, as_frame=False)
# Converting data from (0-255) integer values to (0-1) float, a rescaling
X = X / 255.0

In [None]:
# Split data into train partition and test partition
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0, test_size=0.5)
X_train.shape, y_train.shape

As extra, One Hot Encode for label:

In [None]:
from tensorflow import keras
from keras import layers
from keras.utils import np_utils

print('y_test before:\n', y_test[0:3])
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
print('y_test after:\n', y_test[0:3])

### Create the model
- One single hidden layer with 40 neurons.

In [None]:
mlp = keras.Sequential([
    layers.Dense(40, input_dim=784, kernel_initializer='normal', activation='relu'),
    layers.Dense(10, kernel_initializer='normal', activation='softmax')])
mlp.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(mlp.summary())

### Train the Model
30 epochs

In [None]:
from keras import callbacks

early_stopping = callbacks.EarlyStopping(
    min_delta=0,
    patience=10,
    restore_best_weights=True,
)

history = mlp.fit(
    X_train,
    y_train,
    validation_split=0.2,
    epochs=30,
    batch_size=256,
    verbose=0,
    callbacks=[early_stopping])

### Evaluate and Plot Model Performance

In [None]:
import pandas as pd

df_history = pd.DataFrame(history.history)
fig, ax = plt.subplots(1, 2, figsize=(16, 8))

cut_first_n = 0
df_history.loc[cut_first_n:, ['loss', 'val_loss']].plot(ax=ax[0])
df_history.loc[cut_first_n:, ['accuracy', 'val_accuracy']].plot(ax=ax[1])
ax[0].grid(which='both')
ax[1].grid(which='both')
plt.show()

scores = mlp.evaluate(X_test, y_test, verbose=0)
print(f"MLP Error: {100-scores[1]*100:.2f}%")

### Make Predictions

In [None]:
x = X_test[0]
plt.imshow(x.reshape(28, 28), cmap=plt.get_cmap('gray'))
for label, proba in enumerate(mlp.predict(x.reshape(1, -1))[0]):
  print(f"{label} with probability of {proba*100:.2f}%")

In [None]:
import random
import numpy as np

N = 10
fig, ax = plt.subplots(1, N, figsize=(16, 4))
predictions = []
predictions_p = []
for _ in range(N):
  i = random.randint(0, X_test.shape[0])
  ax[len(predictions)].imshow(X_test[i].reshape(28, 28), cmap=plt.get_cmap('gray'))
  res = mlp.predict(X_test[i].reshape(1, -1))
  i = np.argmax(res)
  predictions.append(np.argmax(res))
  predictions_p.append(res[0][i])
print(predictions)
print(predictions_p)
plt.show()

<a id='tf-cnn'></a>

## Convolutional Neural Network (CNN)
A Convolutional Neural Network, also known as ``CNN`` or ``ConvNet``, is a class of neural networks that specializes in processing data that has a grid-like topology, such as an image. A digital image is a binary representation of visual data.  
Specialized DNNs, like this one, are not available as models in Sklearn.

Let's try one for the same database, MNIST:

In [None]:
from keras.datasets import mnist

(X_train, y_train), (X_test, y_test) = mnist.load_data()

print('Train: ', X_train.shape, y_train.shape)
print('Test: ', X_test.shape, y_test.shape)

### Reshape & Rescale
In Keras, when we use the TensorFlow backend, the layers used for two-dimensional convolutions expect pixel values with the dimensions [width][height][channels].

In [None]:
X_train = X_train.reshape(X_train.shape[0], 28, 28, 1).astype('float32') / 255
X_test = X_test.reshape(X_test.shape[0], 28, 28, 1).astype('float32') / 255

print('Train: ', X_train.shape, y_train.shape)
print('Test: ', X_test.shape, y_test.shape)

### Same One Hot Encoding the Label

In [None]:
y_train = np_utils.to_categorical(y_train)
y_test = np_utils.to_categorical(y_test)
num_classes = y_test.shape[1]

### Build Model

1. Convolutional layer with 32 feature maps, from 5 x 5 filters and ReLu activation function. This is the input layer, expecting images with the structure outline above, 28 x 28 x 1.

2. Pooling layer taking the max over 2 x 2 patches.

3. Dropout layer with a probability of 20%.

4. Flatten layer.

5. Fully connected layer with 128 neurons and rectifier activation function.

6. Output layer with 10 neurons for the 10 classes and a softmax activation function to output probability-like predictions for each class.

In [None]:
cnn = keras.Sequential([
    layers.Conv2D(32, (5, 5), input_shape=(28, 28, 1), activation='relu'),
    layers.MaxPooling2D(pool_size=(2, 2)),
    layers.Dropout(0.2),
    layers.Flatten(),
    layers.Dense(128, activation='relu'),
    layers.Dense(num_classes, activation='softmax')])
cnn.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

print(cnn.summary())

### Train the Model
Same as before, 30 epochs.  
It will take a while to complete the training since the CNN model is far more complex (params: 592,074) than MLP model (params: 31,810).

In [None]:
early_stopping = callbacks.EarlyStopping(
    min_delta=0,
    patience=10,
    restore_best_weights=True,
)

history = cnn.fit(
    X_train,
    y_train,
    validation_split=0.2,
    epochs=30,
    batch_size=256,
    verbose=0,
    callbacks=[early_stopping])

### Evaluate and Plot the Model Performance
Using an almost identic code:

In [None]:
df_history = pd.DataFrame(history.history)
fig, ax = plt.subplots(1, 2, figsize=(16, 8))

cut_first_n = 0
df_history.loc[cut_first_n:, ['loss', 'val_loss']].plot(ax=ax[0])
df_history.loc[cut_first_n:, ['accuracy', 'val_accuracy']].plot(ax=ax[1])
ax[0].grid(which='both')
ax[1].grid(which='both')
plt.show()

scores = cnn.evaluate(X_test, y_test, verbose=0)
print("CNN Error: %.2f%%" % (100-scores[1]*100))

But the performances are superior!

### Make Predictions

In [None]:
x = X_test[0]
plt.imshow(x, cmap=plt.get_cmap('gray'))
for label, proba in enumerate(cnn.predict(x.reshape(1, 28, 28, 1))[0]):
    print(f"{label} with probability of {proba*100:.2f}%")

In [None]:
N = 10
fig, ax = plt.subplots(1, N, figsize=(16, 4))
predictions = []
predictions_p = []
for _ in range(N):
  i = random.randint(0, X_test.shape[0])
  ax[len(predictions)].imshow(X_test[i].reshape(28, 28), cmap=plt.get_cmap('gray'))
  res = cnn.predict(X_test[i].reshape(1, 28, 28, 1))
  i = np.argmax(res)
  predictions.append(np.argmax(res))
  predictions_p.append(res[0][i])
print(predictions)
print(predictions_p)
plt.show()

<a id='pt'></a>

## PyTorch
``PyTorch`` is an ``optimized tensor library`` primarily used for ``Deep Learning applications`` using GPUs and CPUs. It is an open-source machine learning library for Python, mainly developed by the ``Facebook`` AI Research team.  
The PyTorch framework supports over 200 different mathematical operations. The popularity of PyTorch continues to rise as it simplifies the creation of artificial neural network (ANN) models. PyTorch is mainly used for applications of research, data science and artificial intelligence (AI).  
Even if it is still a young framework it has a stronger community movement and it's more Python friendly (``pythonic``) than Tensorflow.  
``Tesla`` utilizes Pytorch for distributed CNN training. For autopilot, Tesla trains around 48 networks that do 1,000 different predictions and it takes 70,000 GPU hours.

### CNN: Handwritten Digits
Now, we'll repeat the previous example written in Tensorflow/Keras.

In [None]:
import torch
import torchvision as tv

n_epochs = 3
batch_size_train = 64
batch_size_test = 1000
learning_rate = 0.01
momentum = 0.5
log_interval = 10

random_seed = 1
torch.backends.cudnn.enabled = False
torch.manual_seed(random_seed)

### Load the MNIST database

In [None]:
train_loader = torch.utils.data.DataLoader(
    tv.datasets.MNIST('/files/', train=True, download=True,
            transform=tv.transforms.Compose([
                    tv.transforms.ToTensor(),
                    tv.transforms.Normalize((0.1307,), (0.3081,))])),
    batch_size=batch_size_train, shuffle=True)

test_loader = torch.utils.data.DataLoader(
    tv.datasets.MNIST('/files/', train=False, download=True,
            transform=tv.transforms.Compose([
                    tv.transforms.ToTensor(),
                    tv.transforms.Normalize((0.1307,), (0.3081,))])),
    batch_size=batch_size_test, shuffle=True)

### Prepare the database

In [None]:
examples = enumerate(test_loader)
batch_idx, (example_data, example_targets) = next(examples)
example_data.shape

### Little bit of visualization

In [None]:
fig = plt.figure()
for i in range(6):
    plt.subplot(2,3,i+1)
    plt.tight_layout()
    plt.imshow(example_data[i][0], cmap='gray', interpolation='none')
    plt.title("Ground Truth: {}".format(example_targets[i]), color='white')
    plt.xticks([])
    plt.yticks([])

### Building the CNN model
We'll use two 2-D convolutional layers followed by two fully-connected (or linear) layers. As activation function we'll choose rectified linear units (ReLUs in short) and as a means of regularization we'll use two dropout layers. In PyTorch a nice way to build a network is by creating a new class for the network we wish to build:

In [None]:
import torch.nn as nn
import torch.nn.functional as F
from torch import optim

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 10, kernel_size=5)
        self.conv2 = nn.Conv2d(10, 20, kernel_size=5)
        self.conv2_drop = nn.Dropout2d()
        self.fc1 = nn.Linear(320, 50)
        self.fc2 = nn.Linear(50, 10)

    def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 320)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.log_softmax(x, dim=1)

Now let's initialize the network and the optimizer.

In [None]:
network = Net()
optimizer = optim.SGD(network.parameters(), lr=learning_rate, momentum=momentum)

### Training the Model
We'll keep track of the progress with some printouts. In order to create a nice training curve later on we also create two lists for saving training and testing losses. On the x-axis we want to display the number of training examples the network has seen during training. 

In [None]:
train_losses = []
train_counter = []
test_losses = []
test_counter = [i*len(train_loader.dataset) for i in range(n_epochs + 1)]

Time to build our training loop. First we want to make sure the network is in ``training mode``. Then we iterate over all training data once per epoch. Loading the individual batches is handled by the DataLoader. First we need to manually set the gradients to zero using optimizer.zero_grad() since PyTorch by default accumulates gradients. We then produce the output of our network (forward pass) and compute a negative log-likelihodd loss between the output and the ground truth label. The backward() call we now collect a new set of gradients which we propagate back into each of the network's parameters using optimizer.step().

In [None]:
def train(epoch):
  network.train()
  for batch_idx, (data, target) in enumerate(train_loader):
    optimizer.zero_grad()
    output = network(data)
    loss = F.nll_loss(output, target)
    loss.backward()
    optimizer.step()
    if batch_idx % log_interval == 0:
      print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
        epoch, batch_idx * len(data), len(train_loader.dataset),
        100. * batch_idx / len(train_loader), loss.item()))
      train_losses.append(loss.item())
      train_counter.append(
        (batch_idx*64) + ((epoch-1)*len(train_loader.dataset)))
      torch.save(network.state_dict(), 'results/model.pth')
      torch.save(optimizer.state_dict(), 'results/optimizer.pth')

Now for our test loop. Here we sum up the test loss and keep track of correctly classified digits to compute the accuracy of the network.

In [None]:
def test():
  network.eval()
  test_loss = 0
  correct = 0
  with torch.no_grad():
    for data, target in test_loader:
      output = network(data)
      test_loss += F.nll_loss(output, target, size_average=False).item()
      pred = output.data.max(1, keepdim=True)[1]
      correct += pred.eq(target.data.view_as(pred)).sum()
  test_loss /= len(test_loader.dataset)
  test_losses.append(test_loss)
  print('\nTest set: Avg. loss: {:.4f}, Accuracy: {}/{} ({:.0f}%)\n'.format(
    test_loss, correct, len(test_loader.dataset),
    100. * correct / len(test_loader.dataset)))

Time to run the training! We'll manually add a test() call before we loop over n_epochs to evaluate our model with randomly initialized parameters.

In [None]:
test()
for epoch in range(1, n_epochs + 1):
    train(epoch)
    test()

### Evaluating the Model's Performance
And that's it. With just 3 epochs of training we already managed to achieve 97% accuracy on the test set! We started out with randomly initialized parameters and as expected only got about 10% accuracy on the test set before starting the training.

Let's plot the training curve.

In [None]:
fig = plt.figure()
plt.plot(train_counter, train_losses, color='blue')
plt.scatter(test_counter, test_losses, color='red')
plt.legend(['Train Loss', 'Test Loss'], loc='upper right')
plt.xlabel('number of training examples seen')
plt.ylabel('negative log likelihood loss')

### Some predictions

In [None]:
fig = plt.figure()
for i in range(3):
    plt.subplot(1,3,i+1)
    plt.tight_layout()
    xm = random.randint(1, len(example_data))
    plt.imshow(example_data[xm][0], cmap='gray', interpolation='none')
    plt.title("Ground Truth: {}".format(example_targets[xm]), color='white')
    plt.xticks([])
    plt.yticks([])
    with torch.no_grad():
        test_output = network(example_data[xm])
        pred_y = torch.max(test_output, 1)[1].data.numpy().squeeze()
        print(f'Prediction number: {pred_y}')

<a id='transfer'></a>

## Transfer Learning
So, the data needed to train a DNN must be very large, the models are complicated and will cost me a lot of time and money, then I cannot use it even if the results are remarcable?  

Actually it may be a workaround:  

The intuition behind transfer learning for, let's say, image classification is that if a model is trained on a large and general enough dataset, this model will effectively serve as a generic model of the visual world. You can then take advantage of these learned feature maps without having to start from scratch by training a large model on a large dataset.

<span style=''><img style='background: rgb(128, 128, 128, .15); align: left; display: inline-block; padding: 20px' src='img/transfer.webp'/></span>

### Why use transfer learning?
Assuming you have 100 images of cats and 100 dogs and want to build a model to classify the images. How would you train a model using this small dataset? You can train your model from scratch, but it will most likely overfit horribly. Enter transfer learning. Generally speaking, there are two big reasons why you want to use transfer learning:
- training models with high accuracy requires a lot of data. For example, the ImageNet dataset contains over 1 million images. In the real world, you are unlikely to have such a large dataset. 
- assuming that you had that kind of dataset, you might still not have the resources required to train a model on such a large dataset. Hence transfer learning makes a lot of sense if you don’t have the compute resources needed to train models on huge datasets. 
- even if you had the compute resources at your disposal, you still have to wait for days or weeks to train such a model. Therefore using a pre-trained model will save you precious time. 

### How to implement transfer learning?
You can implement transfer learning in these general steps. (with example in Keras)
1. Obtain the pre-trained model  
You can also optionally download the pre-trained weights. If you don’t download the weights, you will have to use the architecture to train your model from scratch.
- There are more than two dozen pre-trained models available from Keras. They’re served via [Keras applications](https://keras.io/api/applications/). For instance, here is how you can initialize the MobileNet architecture trained on ImageNet:

In [None]:
# downloading MobileNet model trained on imagenet as keras application
model = keras.applications.MobileNet(
    input_shape=None,
    alpha=1.0,
    depth_multiplier=1,
    dropout=0.001,
    include_top=True,
    weights="imagenet",
    input_tensor=None,
    pooling=None,
    classes=1000,
    classifier_activation="softmax",
)
model.summary()

- It’s worth mentioning that Keras applications are not your only option for transfer learning tasks. You can also use models from [TensorFlow Hub](https://www.tensorflow.org/hub):

In [None]:
# using tensor flow hub to create a model, see the output level added (let's say cats and dogs)
import tensorflow_hub as hub
model = keras.Sequential([
    hub.KerasLayer("https://tfhub.dev/google/imagenet/mobilenet_v2_100_224/feature_vector/5", trainable=False), 
    keras.layers.Dense(2, activation='softmax')])
model.build([None, 224, 224, 3])
model.summary()

2. Create a model
Based on the pretrained model create your own model and remove the final output level. Later on, you will add a final output layer that is compatible with your problem.
3. Freeze layers so they don’t change during training
4. Add new trainable layers  
The next step is to add new trainable layers that will turn old features into predictions on the new dataset. This is important because the pre-trained model is loaded without the final output layer.
5. Train the new layers on the dataset
6. Improve the model via fine-tuning  
Fine-tuning is done by unfreezing the base model or part of it and training the entire model again on the whole dataset at a very low learning rate. The low learning rate will increase the performance of the model on the new dataset while preventing overfitting. 

<!--NAVIGATION-->
<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>< [Scikit-learn](06.01-Scikit.ipynb) | [Contents](00.00-Index.ipynb) | [Quiz](06.03-Quiz.ipynb) > [Top](#top) ^ </span>

<span style='background: rgb(128, 128, 128, .15); width: 100%; display: block; padding: 10px 0 10px 10px'>This is the Jupyter notebook version of the __Python for Official Statistics__ produced by Eurostat; the content is available [on GitHub](https://github.com/eurostat/e-learning/tree/main/python-official-statistics).
<br>The text and code are released under the [EUPL-1.2 license](https://github.com/eurostat/e-learning/blob/main/LICENSE).</span>