# 3. Deeper dive into model architectures and practical aspacts in training

Now that we have a good understanding how the keras API works we'll mainly work on two things:

1. Building more complex architectures.
    - What happens if not all inputs are numerical?
    - How can we use inputs of more than one data type?
    - What are Embedding layers and how can they help us?
2. Practical aspects regarding model training.
    - What is the history callback and how can we use it?
    - How can we visualize how our model is doing during training?
3. Bonus practical aspects:  
    - What is model calibration and how can we visualize it?
    - How can we perform hyperparameter tuning?

In [123]:
import tensorflow as tf
assert tf.__version__[0] == '2', 'this tutorial is for tensorflow versions of 2 or higher'

import pandas as pd

from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
%matplotlib inline 

## Dataset

To better demonstrate the topics that we want to discuss, we'll use a different dataset than the toy example we've seen up till now. The dataset is called *airlines_delay* and can be found in [Kaggle](https://www.kaggle.com/datasets/jimschacko/airlines-dataset-to-predict-a-delay). This dataset consists of 7 features (4 numerical, 3 string) and the goal of this dataset is to predict if a flight will be delayed (essentially a binary classification task).

In [220]:
data = pd.read_csv('../data/airlines_delay.csv')
data

Unnamed: 0,Flight,Time,Length,Airline,AirportFrom,AirportTo,DayOfWeek,Class
0,2313.0,1296.0,141.0,DL,ATL,HOU,1,0
1,6948.0,360.0,146.0,OO,COS,ORD,4,0
2,1247.0,1170.0,143.0,B6,BOS,CLT,3,0
3,31.0,1410.0,344.0,US,OGG,PHX,6,0
4,563.0,692.0,98.0,FL,BMI,ATL,4,0
...,...,...,...,...,...,...,...,...
539377,6973.0,530.0,72.0,OO,GEG,SEA,5,1
539378,1264.0,560.0,115.0,WN,LAS,DEN,4,1
539379,5209.0,827.0,74.0,EV,CAE,ATL,2,1
539380,607.0,715.0,65.0,WN,BWI,BUF,4,1


The thing that is going to give us the most trouble is the 3 categorical variables it has, `Airline`, `AirportFrom` and `AirportTo`.

In [125]:
print(data['Airline'].value_counts(), '\n')
print(data['AirportFrom'].value_counts(), '\n')
print(data['AirportTo'].value_counts())

WN    94097
DL    60940
OO    50254
AA    45656
MQ    36604
US    34500
XE    31126
EV    27983
UA    27619
CO    21118
FL    20827
9E    20686
B6    18112
YV    13725
OH    12630
AS    11471
F9     6456
HA     5578
Name: Airline, dtype: int64 

ATL    34449
ORD    24822
DFW    22153
DEN    19843
LAX    16657
       ...  
MMH       16
SJT       15
GUM       10
ADK        9
ABR        2
Name: AirportFrom, Length: 293, dtype: int64 

ATL    34440
ORD    24871
DFW    22153
DEN    19848
LAX    16656
       ...  
MMH       16
SJT       15
GUM       10
ADK        9
ABR        2
Name: AirportTo, Length: 293, dtype: int64


## Part 1: Deeper dive into model architectures

### Attempt 1: Ignore categorical features

For our first attempt we'll completely ignore these categorical features and only deal with the numerical features.

In [126]:
X = data.drop(columns=['Airline', 'AirportFrom', 'AirportTo', 'Class'])
y = data['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

In [127]:
inp = tf.keras.layers.Input((4,))
hid1 = tf.keras.layers.Dense(300, activation='relu')(inp)
hid2 = tf.keras.layers.Dense(100, activation='relu')(hid1)
out = tf.keras.layers.Dense(1, activation='sigmoid')(hid2)

model = tf.keras.models.Model(inp, out)

model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy',
                                                                     'Precision',  # for some reason 
                                                                     'Recall',     # these are 
                                                                     'AUC'])       # case sensitive

model.fit(X_train, y_train, epochs=10, batch_size=128, validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


<keras.callbacks.History at 0x7f934295bb50>

In [128]:
metrics = ['Loss', 'Accuracy', 'Precision', 'Recall', 'AUC']

results = model.evaluate(X_test, y_test)

for name, value in zip(metrics, results):
    print(f'{name:<10}: {value:.2f}')

Loss      : 0.67
Accuracy  : 0.58
Precision : 0.54
Recall    : 0.36
AUC       : 0.60


Arguably we're not great really great results in this setup. We could try to tune the architecture or other hyperparams like the learning rate more, but I don't think this would lead to a significant boost in performance. A much more promissing direction would be to try to incorporate the other features to the model.

### Attempt 2: Embed categorical features

We want to utilize the remaining features of the model, however these are in a form not understandable by our network, i.e. **categorical**. The most common way to deal with this issue is to **represent each catebory by a fixed length vector**. These vectors are called **embeddings** and are **fully trainable**. But how does this work?

Before we begin, we need to define a **vocabulary size** (let's call this $V$) and an **embedding dim** (let's say this is $D$. The second is simply the size of each embedding (i.e. how many dims will the vector that represents each category have). The first shows how many categories will get their own, dedicated embedding. In features that don't have too many unique values, this is set to be the same as the cardinality of the feature (i.e. every unique value gets its own dedicated embedding). If the feature has too many unique values, only the $V$ most frequent categories will get their own embedding. The remaining will usually all be represented by a single embedding that we call OOV (i.e. out-of-vocabulary). Keras calls these two properites `input_dim` and `output_dim` respectively.

Internally, a lookup table of dimensions $V \times D$ is created, where each row refers to a category. All of these parameters are trainable! When the network sees a specific input, it looks up the $D$-dimensional embedding of that input and feeds it to the next layer. 

![](https://github.com/djib2011/tensorflow-training/blob/main/figures/embedding.png)

In keras this is implemented through the [`Embedding`](https://keras.io/api/layers/core_layers/embedding/) layer. The embedding layer doesn't work by default on string inputs, though. They first need to be encoded as integers. For this purpose we will use the [`StringLookup`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/StringLookup) layer.

How can we use embeddings in our case, though?

There are a few things to notice in our case:
- We have very small vocabulary sizes. This means that we can have dedicated embeddings for the whole vocabulary$^1$
- The two airport features have the same exact vocabulary, so we will use a single embedding table for both of these.
- We have both numerical and categorical features we wan't to use. This means that we'll need to embed the categories (each feature separately) and then concatenate the embeddings with the numeric features. Let's say we use an embedding size of $3$ for the airline and $5$ for each of the airport features. The concatenated vector that will be fed to the dense layers will have $17$ dims: $3$ (Airline) $+5$ (AirportFrom) $+5$ (AirportTo) $+4$ (numeric) $=17$.

$^1$ *Note: this isn't a good practive, as some categories that don't have many samples will not get many updates for their embeddings, leaving them undertrained. Because this requires tuning, though, we won't play with OOV embeddings at all.*

In [225]:
# Input for the 4 numeric features of the dataset
numeric_inp = tf.keras.layers.Input((4,))  # shape --> (batch, 4)

# Make separate inputs for the 3 categorical features
airline_inp = tf.keras.layers.Input((1,), dtype=tf.string)
airport_from_inp = tf.keras.layers.Input((1,), dtype=tf.string)
airport_to_inp = tf.keras.layers.Input((1,), dtype=tf.string)

# Create lookup tables mapping the strings to integers
airline_look = tf.keras.layers.StringLookup(vocabulary=data['Airline'].unique())
airport_look = tf.keras.layers.StringLookup(vocabulary=data['AirportTo'].unique())

# Encode the 3 categorical features using the lookup tables
airline_encoded = airline_look(airline_inp)
airport_from_encoded = airport_look(airport_from_inp)
airport_to_encoded = airport_look(airport_to_inp)

# Create embedding tables for each of the 3 categorical features
airline_emb = tf.keras.layers.Embedding(input_dim=len(data['Airline'].unique())+1,  
                                        output_dim=3)  # the +1 is for the OOV embedding
airport_emb = tf.keras.layers.Embedding(input_dim=len(data['AirportTo'].unique())+1,
                                        output_dim=5)

# Add the embeddings as layers after their respective inputs
airline_vec = airline_emb(airline_encoded)            # shape --> (batch, 1, 3)
airport_from_vec = airport_emb(airport_from_encoded)  # shape --> (batch, 1, 5)
airport_to_vec = airport_emb(airport_to_encoded)      # shape --> (batch, 1, 5)

# Flatten the embeddings, so that they can be concatenated with the numeric features
airline_vec = tf.keras.layers.Flatten()(airline_vec)            # shape --> (batch, 3)
airport_from_vec = tf.keras.layers.Flatten()(airport_from_vec)  # shape --> (batch, 5)
airport_to_vec = tf.keras.layers.Flatten()(airport_to_vec)      # shape --> (batch, 5)

# Concatenate the embeddings together with the numeric inputs
concat = tf.keras.layers.Concatenate()([numeric_inp, airline_vec, airport_from_vec,
                                        airport_to_vec])  # shape --> (batch, 17)

# Add dense layers 
hid1 = tf.keras.layers.Dense(300, activation='relu')(concat)
hid2 = tf.keras.layers.Dense(100, activation='relu')(hid1)
out = tf.keras.layers.Dense(1, activation='sigmoid')(hid2)

# Define model. We need to define all the inputs we used.
# We'll do this as a dict to be more safe when passing the values during training
model = tf.keras.models.Model(inputs={'numeric': numeric_inp,
                                      'airline': airline_inp,
                                      'airport_from': airport_from_inp,
                                      'airport_to': airport_to_inp},
                              outputs=out)

# Compile and train the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy', 'Precision',
                                                                     'Recall', 'AUC'])

A sketch of our model can be seen below.

Let's prepare the dataset in the form that our model expects it.

In [226]:
X = data.drop(columns=['Class'])
y = data['Class']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)


def convert_to_dict(df):
    return {'numeric': df.drop(columns=['Airline', 'AirportFrom', 'AirportTo']).astype(float).values,
            'airline': df['Airline'],
            'airport_from': df['AirportFrom'],
            'airport_to': df['AirportTo']}


X_train = convert_to_dict(X_train)
X_test = convert_to_dict(X_test)

Train and evaluate our model.

In [230]:
hist = model.fit(X_train, y_train, epochs=10, batch_size=128, validation_data=(X_test, y_test))

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


In [229]:
results = model.evaluate(X_test, y_test)

for name, value in zip(metrics, results):
    print(f'{name:<10}: {value:.2f}')

Loss      : 0.63
Accuracy  : 0.64
Precision : 0.63
Recall    : 0.49
AUC       : 0.69


By adding these categorical features we managed to improve the model's performance bit. By tuning parameters such as the embedding dim and the vocabulary size, we might get an even better performance out of our embeddings.

## Part 1: Practical aspects regarding model training

### History callback

You might have noticed in the previous training that I assigned the output of `model.fit()` to a variable called `hist`. This is called the [History callback](https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/History). This stores all information relevant to the model's training (i.e. the stuff that is printed on screen). This callback gives us access to this information, which we can use for analysis, visualizations, etc.