<a href="https://colab.research.google.com/github/ap-nlp-research/language_translation_en_ru_tf2/blob/master/CrossEntropyVsFocalLoss.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CrossEntropy vs Focal Loss


The goal of the project is to compare the performance of two loss functions:
1. Traditional Cross Entropy
2. Focal Loss

Cross Entropy evaluates entropy between classes, which could relatevely low and difficult to train in cases where the number of classes is large. In order to counteract this problem Facebook AI Research suggest to use [Focal Loss](https://arxiv.org/pdf/1708.02002.pdf) that in theory should focus the training on hard examples and therefore speed up the traing.

The models implemented in Tensorflow 2.0 with Keras as a high-level API. Models are trained and analyzed based on EN-RU [wmt19_translate dataset](https://www.tensorflow.org/datasets/datasets#wmt19_translate) ([ACL 2019 FOURTH CONFERENCE ON MACHINE TRANSLATION (WMT19)](http://www.statmt.org/wmt19/translation-task.html)).

In [1]:
!pip install tensorflow-gpu==2.0.0-alpha0
!git clone https://github.com/ap-nlp-research/language_translation_en_ru_tf2.git translation_en_ru



In [0]:
import os
import pickle as pk
import subprocess
import re
import numpy as np
from functools import partial
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import backend as K

## Data ETL

The data load, extraction, and transformation was done beforehand with [create_dataset_en_ru.py](./create_dataset_en_ru.py) script. This script stores a dictionary containing source data under 'x' label. Target data is stored under 'y' label. In addition to the source and target data, the dictionary contains x and y tockenizers (stored as 'x_tk' and 'y_tk'):

dataset: dict

{
    'x': np.ndarray,
    'y': np.ndarray,
    'x_tk': keras.preprocessing.text.Tokenizer,
    'y_tk': keras.preprocessing.text.Tokenizer
}

In [0]:
with open("./translation_en_ru/data/wlm_en_ru.pkl", 'rb') as file:
    dataset = pk.load(file)

## Utility Functions

In addition to the data ETL, the code below provides two additional functions for converting logits into word indicies and converting word indicies into text.

In [0]:
def logits_to_id(logits):
    """
    Turns logits into word ids
    :param logits: Logits from a neural network
    """
    return np.argmax(logits, 1)

def id_to_text(idx, tokenizer):
    """
    Turns id into text using the tokenizer
    :param idx: word id
    :param tokenizer: Keras Tokenizer fit on the labels
    :return: String that represents the text of the logits
    """
    index_to_words = {id: word for word, id in tokenizer.word_index.items()}
    index_to_words[0] = '<PAD>'

    return ' '.join([index_to_words[prediction] for prediction in idx]).replace(" <PAD>", "")

In [5]:
print("Here is an example for a sample number 1:")
print("Source('en') example:", id_to_text(dataset['x'][0], dataset['x_tk']))
print("Target('ru') example:", id_to_text(dataset['y'][0], dataset['y_tk']))
print(" ")
print("A sample number 2:")
print("Source('en') example:", id_to_text(dataset['x'][-10], dataset['x_tk']))
print("Target('ru') example:", id_to_text(dataset['y'][-10], dataset['y_tk']))
print("source vocabulary size:", dataset['x'].max())
print("target vocabulary size:", dataset['y'].max())
print("Source shape:", dataset['x'].shape)
print("Target shape:", dataset['y'].shape)

Here is an example for a sample number 1:
Source('en') example: the company has been registered with the municipal court in prague in section b file 1 4 8 5 7
Target('ru') example: фирма зарегистрирована в городском суде в г праге раздел б [rare] 1 4 8 5 7
 
A sample number 2:
Source('en') example: six years ago i had a surgery and l 4 l 5 [rare] were [rare] now l 5 s 1 [rare] [rare] and i had a second surgery that went well
Target('ru') example: шесть лет назад мне сделали операцию и на дисках l 4 l 5 сейчас [rare] [rare] диски l 5 s 1 и было необходимо второе хирургическое вмешательство которое произошло вчера и прошло хорошо
source vocabulary size: 3499
target vocabulary size: 14999
Source shape: (14751, 148)
Target shape: (14751, 148)


## Models

The models are implemented with a similar set of parameters. The main idea is to keep models as small and simple as possible to quickly train them and validate the difference derived from the loss function. The summary of main hyper parameters presented below:

* Mapping:
    - Embeddings - word indices will be mapped into a 16-dimentional space
    - Dense mapping - recurrence outputs mapped into the target-language space, represented with OHE, via Dense mapping
* Layers:
    - GRU - number of units 128
    - Batch Normalization - To speed up the training batch normalization is inserted after embeddings and before dense mapping
* Optimization:
    - Adam - all models trained with Adam optimizer and the same learning rate (1e-3)
* Loss function:
    - sparse_categorical_crossentropy_from_logits - keras.losses.sparse_categorical_crossentropy

In [0]:
learning_rate = 1e-3
embeddings_units = 16
gru_units = 64
epochs = 3
validation_split = 0.1
batch_size = 64
loss = keras.losses.sparse_categorical_crossentropy

#### Model 2 - Embedded GRU - Focal Loss

In [0]:
def focal_loss(gamma=2., alpha=.25):

    def focal_loss_fixed(y_true, y_pred):
        y_pred = tf.clip_by_value(y_pred, K.epsilon(), 1-K.epsilon())
        vocab_size = y_pred.shape[-1]
        length = tf.shape(y_pred)[1]
        zeros = tf.zeros((tf.shape(y_pred)[0], vocab_size), dtype=y_pred.dtype)
        ones = tf.ones_like(zeros)

        indicies = tf.range(0, length, dtype=tf.int32)

        def fn(idx):
            el_ids = tf.reshape(tf.cast(y_true[:, idx, :], tf.int32), (-1,))
            pred = y_pred[:, idx, :]
            truth = K.one_hot(el_ids, num_classes=vocab_size)
            truth = K.equal(truth, 1)
            pt_1 = tf.where(truth, pred, ones)
            pt_0 = tf.where(tf.logical_not(truth), pred, zeros)
            return -alpha * K.sum(K.pow(1. - pt_1, gamma) * K.log(pt_1), axis=-1) \
                   - (1-alpha) * K.sum(K.pow(pt_0, gamma) * K.log(1. - pt_0), axis=-1)

        return tf.map_fn(fn, indicies, dtype=tf.float32)

    return focal_loss_fixed

In [8]:
def FocalLoss_model(input_shape, output_sequence_length, source_vocab_size, target_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    input_seq = keras.Input(input_shape[1:])
    if output_sequence_length>input_shape[1]:
        expanded_seq = keras.backend.squeeze(
            keras.layers.ZeroPadding1D((0, output_sequence_length-input_shape[1]))(
                keras.layers.Reshape((input_shape[1], 1))(input_seq)
            ),
            axis = -1
        )
    else:
        expanded_seq = input_seq
        
        
    embedded_seq = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.Embedding(source_vocab_size, embeddings_units, input_length=output_sequence_length)(expanded_seq)
    )
    rnn = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.GRU(gru_units, return_sequences=True)(embedded_seq)
    )
    probabilities = keras.layers.TimeDistributed(keras.layers.Dense(target_vocab_size, activation='softmax'))(rnn)
    
    model = keras.Model(input_seq, probabilities)
    
    model.compile(loss=focal_loss(alpha=.25, gamma=2),
                  optimizer=keras.optimizers.Adam(learning_rate, clipnorm=3.0),
                  metrics=['accuracy'])
    return model

  
# Train the neural network
keras.backend.clear_session()
model = FocalLoss_model(
    dataset['x'].shape,
    dataset['y'].shape[1],
    dataset['x'].max()+1,
    dataset['y'].max()+1)



print("Model summary:")
model.summary()

Model summary:
Model: "model"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
input_1 (InputLayer)         [(None, 148)]             0         
_________________________________________________________________
embedding (Embedding)        (None, 148, 16)           56000     
_________________________________________________________________
time_distributed (TimeDistri (None, 148, 16)           64        
_________________________________________________________________
unified_gru (UnifiedGRU)     (None, 148, 64)           15744     
_________________________________________________________________
time_distributed_1 (TimeDist (None, 148, 64)           256       
_________________________________________________________________
time_distributed_2 (TimeDist (None, 148, 15000)        975000    
Total params: 1,047,064
Trainable params: 1,046,904
Non-trainable params: 160
__________________________________

In [0]:
model.fit(
    dataset['x'], 
    dataset['y'][:,:, None], 
    batch_size=batch_size, 
    epochs=epochs, 
    validation_split=validation_split
)

Train on 13275 samples, validate on 1476 samples
Epoch 1/3
 1920/13275 [===>..........................] - ETA: 5:49 - loss: 2.3035 - accuracy: 0.7513

In [0]:
# Print prediction(s)
sentense_id = 2
x_sample = dataset['x'][sentense_id]
y_sample = dataset['y'][sentense_id]
print("Source('en') example:", id_to_text( x_sample, dataset['x_tk'] ))
print("Source('ru') example:", id_to_text( y_sample, dataset['y_tk'] ))
prediction = model.predict(x_sample[None, :], verbose=1).squeeze()
print("Translation(en_ru) example:", id_to_text( logits_to_id(prediction), dataset['y_tk'] ))

**Model list:**

1. Embedded GRU - CrossEntropy
1. Embedded GRU - Focal Loss

#### Model 1 - Embedded GRU - CrossEntropy

In [0]:
def xEntropy_model(input_shape, output_sequence_length, source_vocab_size, target_vocab_size):
    """
    Build and train a RNN model using word embedding on x and y
    :param input_shape: Tuple of input shape
    :param output_sequence_length: Length of output sequence
    :param english_vocab_size: Number of unique English words in the dataset
    :param french_vocab_size: Number of unique French words in the dataset
    :return: Keras model built, but not trained
    """
    input_seq = keras.Input(input_shape[1:])
    if output_sequence_length>input_shape[1]:
        expanded_seq = keras.backend.squeeze(
            keras.layers.ZeroPadding1D((0, output_sequence_length-input_shape[1]))(
                keras.layers.Reshape((input_shape[1], 1))(input_seq)
            ),
            axis = -1
        )
    else:
        expanded_seq = input_seq
        
        
    embedded_seq = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.Embedding(source_vocab_size, embeddings_units, input_length=output_sequence_length)(expanded_seq)
    )
    rnn = keras.layers.TimeDistributed(keras.layers.BatchNormalization(axis=-1))(
        keras.layers.GRU(gru_units, return_sequences=True)(embedded_seq)
    )
    probabilities = keras.layers.TimeDistributed(keras.layers.Dense(target_vocab_size, activation='softmax'))(rnn)
    
    model = keras.Model(input_seq, probabilities)
    
    model.compile(loss=loss,
                  optimizer=keras.optimizers.Adam(learning_rate, clipnorm=3.0),
                  metrics=['accuracy'])
    return model

  
# Train the neural network
keras.backend.clear_session()
model = xEntropy_model(
    dataset['x'].shape,
    dataset['y'].shape[1],
    dataset['x'].max()+1,
    dataset['y'].max()+1)



print("Model summary:")
model.summary()

In [0]:
model.fit(
    dataset['x'], 
    dataset['y'][:,:, None], 
    batch_size=batch_size, 
    epochs=epochs, 
    validation_split=validation_split)

In [0]:
# Print prediction(s)
sentense_id = 2
x_sample = dataset['x'][sentense_id]
y_sample = dataset['y'][sentense_id]
print("Source('en') example:", id_to_text( x_sample, dataset['x_tk'] ))
print("Source('ru') example:", id_to_text( y_sample, dataset['y_tk'] ))
prediction = model.predict(x_sample[None, :], verbose=1).squeeze()
print("Translation(en_ru) example:", id_to_text( logits_to_id(prediction), dataset['y_tk'] ))