<img vspace="33px" align="right" src="https://www.munich-startup.de/wp-content/uploads/2019/03/TUM_logo-440x236.png" width="120px"/>
<h1>Sign Language Production</h1>
<h3>Applied Deep Learning for NLP</h3>
<p><b>Diego Miguel Lozano</b> | <b>Wenceslao Villegas Marset</b></p>
<p>March 9<sup>th</sup>, 2022</p>

---

# Table of contents

> ### [1. Introduction](#section_1)
>> [**1.1 What is Sign Language Production (SLP)?**](#section_1_1)<br>
>> [**1.2 What is the starting point for our project?**](#section_1_2)

# 0. Set-up and Imports

In [1]:
import sys
import logging
import torch
from IPython.display import Video
from pathlib import Path
from pprint import pprint
from torch import nn
from torchtext import data
from transformers import AutoTokenizer, AutoModelForMaskedLM, logging
from transformers import logging

<a id='section_1'></a>
# 1. Introduction

<a id='section_1_1'></a>
### 1.1 What is Sign Language Production (SLP)?

Sign Language Production focuses on translating spoken languages into sign languages and viceversa. According to the World Health Organization (WHO), in 2020 there were more than 466 million deaf people in the world [[1]](#ref_1). This area could be of great help for the hearing-impared community, being for that necessary the development of techniques for both recognition and production of sign languages.

While the Sign Language Recognition has seen numerous advancements in the last years [[2](#ref_2), [3](#ref_3), [4](#ref_4), [5](#ref_5), [6](#ref_6), [7](#ref_7)], Sign Language Production is still a very challenging task, since it involves an interpretation between visual and linguistic information [[8]](#ref_8).

<a id='section_1_2'></a>
### 1.2 What is the starting point for our project?

As we just mentioned, SLP is complex and far from being solved. Nevertheless, there have recently been promising developments, such as the application of Transformer architectures to SLP, what has come to be called "Progressive Transformers."

In this project, we take the [source code](https://github.com/BenSaunders27/ProgressiveTransformersSLP) for the paper "Progressive Transformers for End-to-End Sign Language Production" [[9]](#ref_9) as the starting point.

We propose to test different improvement approaches to boost the model's performance like:


* Using pre-trained BERT embeddings and fine-tuning them during training.
* Leveraging different pre-trained models from the hugging-face ecosystem to perform data augmentation on the source senteces. 
* Testing out the improvement in performance with different hyperparameter configurations for the transformer architecture.


<a id='section_1_3'></a>
### 1.3 The data.

Source data stems from the RWTH-PHOENIX-Weather-2014T dataset.

*Dataset Infomation*: Over a period of three years (2009 - 2011) the daily news and weather forecast airings of the German public tv-station PHOENIX featuring sign language interpretation have been recorded and the weather forecasts of a subset of 386 editions have been transcribed using gloss notation. Furthermore, we used automatic speech recognition with manual cleaning to transcribe the original German speech. As such, this corpus allows to train end-to-end sign language translation systems from sign language video input to spoken language.

The signing is recorded by a stationary color camera placed in front of the sign language interpreters. Interpreters wear dark clothes in front of an artificial grey background with color transition. All recorded videos are at 25 frames per second and the size of the frames is 210 by 260 pixels. Each frame shows the interpreter box only.

* **Text data.**:

Consists of sequences of text that represent the transcription of each video recording.

Example: 
> das bedeutet viele wolken und immer wieder zum teil kräftige schauer und gewitter 

* **Gloss representation**:

Gloss equivalent for the text transcript. A gloss is a German word or words that are used to name the corresponding Sign Language signs.

Example: 
> ES-BEDEUTET VIEL WOLKE UND KOENNEN REGEN GEWITTER KOENNEN

* **Skeleton**:

Sequence of 3D skeletal poses that the model has to learn to produce (ground truth). The format of this data is as follows:

* Tuple format: (index of a start point, index of an end point, index of a bone)

                (0)
                 |
                 |
                 0
                 |
                 |
        (2)--1--(1)--1--(3)
         |               |
         |               |
         2               2
         |               |
         |               |
        (4)             (5)

      has this structure:

      (
        (0, 1, 0),
        (1, 2, 1),
        (1, 3, 1),
        (2, 4, 2),
        (3, 5, 2),
      )

Then a resulting skeleton pose on a frame would be composed of 150 values, since we have 25 joints with xyz coordinates. And each text sample would have N corresponding frames that would represent the corresponding sequence of skeleton poses. During training the shape of the network output is 151 since a counter is added for the network to predict (explained further down).


---

# A brief intro to the "Progressive Transformers for SLP" project

In this section, we will quickly explain the main aspects of Progressive Transformers project. If we had to summarize it in only three points, these would be the following: counter decoding, two different approaches –Text-to-Gloss-to-Pose (T2G2P) and Text-to-Pose (T2P)–, and data augmentation.

## Counter decoding

One of the main challenges of SLP is that the output has to maintain certain continuity. The predicted pose in a video frame has to flow naturally from the previous one, and analogously for the frames that follow. This is achieved in the following manner: the model not only predicts the sign pose, but also a "counter". This counter is nothing else but real number in the interval [0, 1]. This value increases monotonically from 0 to 1.0, marking thus the begining and end of sequence, respectively.

<img width="600px" src="./images/counter-decoding.jpg" alt="Counter Decoding"/>
<br>
<div align="center"><i>Representation of counter decoding.</i></div>

<br>

---

**ℹ️ Question: Why not simply use an BOS token and EOS?**

**💡 Answer:** Begining of Sentence (BOS) and End of Sentence (EOS) tokens work well with sentences, but when producing video as we mentioned before we need something more than just marking the beginning and end of it. Therefore, the counter serves both as an BOS and EOS and captures information about the flow of the video.

---


### Two different approaches

In the paper, they experimented with two different approaches: T2G2P and T2P:

<img width="600px" src="./images/T2G2P-vs-T2P.png" alt="T2G2P vs T2P Architectures"/>
<br>
<div align="center"><i>Architecture details of (a) Symbolic and (b) Progressive Transformers. (ST: Symbolic Transformer, PT: Progressive Transformer, PE: Positional Encoding, CE: Counter Embedding, MHA: Multi-Head Attention) <a href="#ref_10">[10]</a>.</i></div>

In both cases, the models follow the architecture introduced in "Attention is All You Need" <a href="#ref_11">[10]</a>.

In the first approach –T2G2P– glosses are produced from the input tokens in a first step. Then, this glosses serve as input for another transformer, which then translates the glosses into sign poses.

The second –T2P– is an end-to-end approach, in which the text is directly translated into sign poses.

### Data augmentation

Finally, the paper explores some data augmentation techniques to determine whether they improve the base model. These augmentations where only carried out with the T2G2P architecture.

- **Future Prediction**: this type of augmentation forces the model to predict the next 10 frames from the current time step, instead of just the next frame. In this way, the model cannot just copy the previous time step, which effectively improves performance over the base architecture.


- **Just Counter**: in this case only the counter values are provided as target input to the model, omitting the 3D skeleton joint coordinates. Again, this has shown to improve results.


- **Gaussian Noise**: the last augmentation method consists in adding Gaussian noise to the skeleton pose sequences during training. This makes the model more robust to prediction inputs.


The following table collects the results of the previous augmentation approaches:

<img width="700px" src="./images/augmentation-results.png" alt="Data Augmentation Results"/>
<br>
<div align="center"><i>The best BLEU-4 performance comes from a combination of future prediction and Gaussian noise augmentation. The model must learn to cope with both multi-frame prediction and a noisy input, building a firm robustness to drift <a href="#ref_10">[10]</a>.</i></div>

### Implementation description.

The model architecture is based on the "Attention is All You Need" <a href="#ref_11">[10]</a> paper. 

A relevant modification was introduced to adapt it and achieve good performance on the SLP task. 
* The final layer for the encoder is a Linear one with  512 + 1 units, which represent *coordinates of the output skeleton* + *counter decoding value*.



### Overall project structure.

Below a description of the modules present in the project after being further extended by us.

```
slp
│   README.md
│   Sign Language Production.ipynb    
└───images
└───ProgressiveTransformersSLP
│   └───Configs
│       │   Base.yaml - Config file to set model, data loading/processing and training parameters.
│       │   src_vocab.txt - GLOSS vocabulary.
│       │   ...
│   └───Data
│       │   train.text - Speech text for each training sample.
│       │   train.skels - Skeleton annotaions for each training sample.
│       │   train.gloss - Glosses for each training sample.
│       │   ...
│   └───external_metrics
│       │   mscoco_rouge.py - ROUGE-L metric implementation.
│       │   train.gloss - BLEU metric implementation like in https://github.com/mjpost/sacrebleu
|       └───optim - Some implementations of optimization algorithms such as: lamb, RAdam, Ranger, etc.
|       └───__main__.py - Main entrypoing to run model training.
|       └───batch.py - Wrapper over torch batch iterator, adding masking and other attributes.
|       └───builders.py - Assorted builder functions.
|       └───constants.py - Project wide constants.
|       └───data.py - Data loading utilities and main torchtext.data.Dataset class
|       └───decoders.py - Transformer decoder implementation.
|       └───dtw.py - Dynamic time warping imlementation as in https://github.com/pierre-rouanet/dtw.
|       └───decoders.py - Transformer decoder implementation.
|       └───embedding.py - Embedding class implementation with support for BERT pretrained ones.
|       └───encoders.py - Transformer encoder implementation.
|       └───helpers.py - Helper functions for logging, reporting, etc.
|       └───initialization.py - Custom NN parameter initialization functions.
|       └───loss.py - Loss function implementations.
|       └───metrics.py - Performance metric functions.
|       └───model.py - Main class assembling all the model's modules (decoder/encoder) and constructor functions.
|       └───plot_videos.py - Validataion video generation for skeleton predictions.
|       └───prediction.py - Code for running validation steps on dev set data (perform dtw then loss, etc).
|       └───search.py - Greedy hyperparameter search function.
|       └───training.py - Training loop implementation.
|       └───transformer_layers.py - Layer implementations from NMT toolkit.
|       └───vocabulary.py - Vocabulary loading and checking code.
└───augmentations
    │   backtranslation.py - Code to perform DE -> EN -> DE translation for augmentation purposes.
```



---

#  Using pre-trained embeddings

The original project trains embeddings from scratch. As we have learned during the seminar, the use of pretrained embeddings can effectively improve the performance of models, especially in situations where data is scarce (which is our case).

Let's first see how the embedding initialization is happening in the original source code.

The first thing that happens when beginning the training is the data loading. The vocabulary of the model is initalized differently depending if we are using T2G2P or T2P.

- **Bulding vocabulary in T2G2P**: in this case, the vocabulary is taken from a file [`src_vocab`](https://github.com/BenSaunders27/ProgressiveTransformersSLP/blob/master/Configs/src_vocab.txt) that we will analyze a bit more in depth later.


- **Bulding vocabulary in T2P**: when using the End-to-End approach (Text-to-Pose), the vocabulary is built from the training input data.

<br>

In both cases, the function [`build_vocab()`](https://github.com/BenSaunders27/ProgressiveTransformersSLP/blob/adbd3e9ea9f1b20063d84021a0d6eb9a124ebb87/vocabulary.py#L130-L187) is used. Let's take a look at it.


```python
if vocab_file is not None:
    # load it from file
    vocab = Vocabulary(file=vocab_file)
else:
    ...
```

<br>

First of all, if we have a vocabulary file (like in the case of T2G2P), we initialize the vocabulary from it, as we already mentioned.


```python
def _from_file(self, file: str) -> None:
        """
        Make vocabulary from contents of file.
        File format: token with index i is in line i.

        :param file: path to file where the vocabulary is loaded from
        """
        tokens = []
        with open(file, "r") as open_file:
            for line in open_file:
                tokens.append(line.strip("\n"))
        self._from_list(tokens)
```

This function simply reads the vocabulary file line by line. Since each line contains only one token, there is no further processing to be done.

If there is no input vocabulary file, the tokens are extracted from the training dataset.

Let's run some code to better visualize this.

## Inside the original SLP model vocab

As we have already mentioned, the original project that we use as starting point provides a plain-text file [`src_vocab`](https://github.com/BenSaunders27/ProgressiveTransformersSLP/blob/master/Configs/src_vocab.txt) containing the vocabulary for which embeddings will then be trained.

Before jumping in and trying to directly use our pretrained embeddings, it is sensible to first analyze a bit how things work in the original project.

In [2]:
code_dir = Path("./ProgressiveTransformersSLP")

if str(code_dir.resolve()) not in sys.path:
    sys.path.insert(0, str(code_dir.resolve()))  # just so that imports can be resolved

from ProgressiveTransformersSLP.vocabulary import Vocabulary, build_vocab

In [3]:
# Path to the vocabulary file
vocab_file = code_dir/Path("Configs/src_vocab.txt")

# Build vocabulary
vocabulary = Vocabulary(file=vocab_file)

# Get all the tokens in the built vocabulary
tokens = [token for token in vocabulary.itos]

# We will select some tokens that are worth analyzing
selected_tokens = (tokens[1:5] + [tokens[172]] + tokens[531:541] +
                   tokens[868:870] + tokens[718:725] + tokens[1085:1088])
pprint(selected_tokens)

['<unk>',
 '<pad>',
 '<s>',
 '</s>',
 'HEISS',
 'AUSWAEHLEN',
 'BALD',
 'BEKOMMEN',
 'BITTE',
 'BODENSEE',
 'BRITANNIEN',
 'CHAOS',
 'DAMEN',
 'DAUERND',
 'DUENN',
 'J+L+I',
 'K+R+E+T+A',
 'neg-EINFLUSS',
 'neg-FUEHLEN',
 'neg-GEWITTER',
 'neg-HOEHE',
 'neg-IMMER',
 'neg-KOMMEN',
 'neg-MEHR',
 'negalp-BRAUCHEN',
 'negalp-GIBT',
 'negalp-MUSS']


From the previous vocabulary, there are three aspects that are worth mentioning:

- Words such as `AUSWAEHLEN`, `DUENN` and `HEISS` give us a hint that **normalization** is used. A popular algorithm for German normalization is the [German2 snowball algorithm](https://snowballstem.org/algorithms/german2/stemmer.html) which defines the following mappings:
  - 'ß' is replaced by 'ss'.
  - 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
  - 'ae' and 'oe' are replaced by 'a', and 'o', respectively.
  - 'ue' is replaced by 'u', when not following a vowel or q.


- As we saw during the seminar lectures, the **special tokens** `<unk>`, `<pad>`, `<s>`, `</s>` are also included in the dictionary. These tokens mark unknown words, padding, beginning of sequence (BOS), and end of sequence (EOS), respectively.


- Some of the words in the vocabulary include the prefixes `neg-` and `negalp-`. We could guess that `neg-` simply means that the word is negated, e.g., `neg-GENUG`≡ `NICHT GENUG`, but what about the `negalp-` prefix? And also, what do words such as `J+L+I` and `K+R+E+T+A` mean? A look to the paper of the RWTH-PHOENIX-Weather dataset [[10]](#ref_10) (the first version of the dataset used to train the model) gives us the answer:


<img width="400px" src="./images/RWTH-PHOENIX-Weather-Annotation-Scheme.png" alt="RWTH-PHOENIX-Weather Annotation Scheme"/>
<br>
<div align="center"><i>Source <a href="#ref_10">[10]</a>.</i></div>

So in reality `neg-` means "signs negated by headshake" and `negalp-` "signs negated by the alpha[betical] rule" <sup>[1](note_1)</sup>. Words such as `K+R+E+T+A` are words (finger) spelled letter by letter.

Interestingly enough, none of the other types of tokens appear in the source dictionary.

---

<a id='note_1'><sup>1</sup></a> In Sign Language, there are several ways of negating words. One of these ways is using a side-to-side headshake or a frown expression. Also, some verbs have their own negated forms, which is what `negalp-` indicates here [[11]](#ref_11).

---

**ℹ️ Question: But... where is this vocabulary coming from?**

**💡 Answer:** As it turns out, this vocabulary is simply made up of **glosses**. As we mentioned before, the original project proposes two ways of carrying out the translations from text to sign language. That also explains why when going for the T2P approach we don't use this file.

---

## BERT it up!

If something is clear is that BERT's dictionary will not contain glosses, let alone glosses specifically tailored to SLP.

But then, how does BERT's vocabulary looks like? Let's take a look!

Fortunately, Hugging Face's great API has got us covered: tokenizers expose their vocabulary through the method `get_vocab()`. Let's try with the model [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased).

In [15]:
# Dismiss expected "Some weights of the model checkpoint at..." warning
# when loading a pretrained model.
logging.set_verbosity_error()

# Load tokenizer and model from pretrained model
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-uncased")
model = AutoModelForMaskedLM.from_pretrained("dbmdz/bert-base-german-uncased")

In [16]:
# Print some words of its dictionary
pprint(list(tokenizer.get_vocab().items())[:40])

[('elektrotechnik', 27915),
 ('trägt', 4924),
 ('lauf', 3699),
 ('weihn', 3862),
 ('##:29', 7501),
 ('##ausgleich', 17336),
 ('schwand', 24253),
 ('##öhnen', 17915),
 ('straub', 27572),
 ('##dienste', 5320),
 ('lic', 23955),
 ('kurven', 22262),
 ('##02.', 6984),
 ('##musik', 4110),
 ('theo', 27146),
 ('letztes', 11053),
 ('russischer', 19267),
 ('schmie', 29874),
 ('##politische', 14619),
 ('5000', 12509),
 ('wohnzimmer', 10453),
 ('##ō', 31039),
 ('solution', 24133),
 ('##arbeitete', 10860),
 ('herauszu', 14287),
 ('##mente', 15954),
 ('##elementen', 24297),
 ('##zur', 5215),
 ('emir', 18817),
 ('##fast', 4720),
 ('##ktisch', 15657),
 ('##ün', 311),
 ('liebte', 28487),
 ('bergl', 18508),
 ('1828', 27261),
 ('##vot', 26826),
 ('unused94', 95),
 ('38.', 26319),
 ('zertifizierung', 27349),
 ('ital', 21301)]


Above, we can see each token with its corresponding ID (just as we saw in the seminar lectures). However, there are two things that catch our attention...

---

**ℹ️ Question: Why do some tokens start with "##"?**

**💡 Answer:** Well, this is just a way of indicating that this token is "non-initial", i.e., originally it belonged to a longer word. This way of tokenizing words comes from the [WordPiece](https://paperswithcode.com/method/wordpiece) algorithm, which is used by the BERT tokenizer.

---

**ℹ️ Question: What about the `'unused...'` tokens?**

**💡 Answer:** These are, unsurprisingly, tokens that are not used. However, they can come handy to add more words to the vocabulary:

> Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized. ([source](https://github.com/google-research/bert/issues/9#issuecomment-434796704))

---

As a side note, we would like to mention that someone took the time to explore BERT's vocabulary and wrote a great article about it. The article can be found at https://juditacs.github.io/2019/02/19/bert-tokenization-stats.html.

---

We might also want to take a look at the emeddings of a word:

In [20]:
# Get token ID (lowercase since the model is uncased
token_id = tokenizer.get_vocab()["Frieden".lower()]
embeddings = model.get_input_embeddings()(torch.tensor(token_id))

# Print only the 32 first values
print(embeddings[:32])
print("\nDimensions:", len(embeddings))

tensor([-0.0589,  0.0175,  0.0557, -0.0638,  0.0209, -0.0444, -0.0280,  0.0746,
        -0.0919,  0.0341, -0.0629,  0.0872,  0.0065,  0.0207,  0.0331, -0.0571,
        -0.0225, -0.0220,  0.0595,  0.0685, -0.1042,  0.0522, -0.0253,  0.0563,
        -0.0017, -0.0496,  0.0112, -0.0188, -0.0332, -0.0097, -0.0394, -0.0409],
       grad_fn=<SliceBackward0>)

Dimensions: 768


### Alright, so how do we plug in the pretrained embeddings?

In order to use the pretrained embeddings, we had to carry out several modifications.

The **original codebase** is **complex** and there are many intertwined components. One of our main goals was to add functionality in a way that what was there before could still be used. This means that we had to make these changes in a way that the baseline settings could still be used.

At first we try to clean up and refactor the code, but we soon realized that, due to the size of the original project, it would take too long and deliver no value to our project.

## Pretrained embeddings: list of changes to be made

*Note: all the files listed below are located in the `ProgressiveTransformersSLP` directory.*

### Adding a configuration parameter in `Configs/Base.yaml`

`Configs/Base.yaml` is the file that defines the different configuration parameters for the **dataset, model and training**. Here we can find, for example, training hyperparameters such as the learning rate or the number of epochs, but also the embedding dimensions of our model, etc.

It looks like follows:

In [4]:
with open(Path("./ProgressiveTransformersSLP/Configs/Base.yaml")) as f:
    print(f.read())

data:
    src: "text" # Source - Either Gloss->Pose or Text->Pose (gloss,text)
    trg: "skels" # Target - 3D body co-ordinates (skels)
    files: "files" # Filenames for each sequence

    train: "./slp/ProgressiveTransformersSLP/Data/tmp/train"
    dev: "./slp/ProgressiveTransformersSLP/Data/tmp/dev"
    test: "./slp/ProgressiveTransformersSLP/Data/tmp/test"

    max_sent_length: 300 # Max Sentence Length
    skip_frames: 1 # Skip frames in the data, to reduce the data input size
    # src_vocab: "./Configs/src_vocab.txt" # Gloss vocab. Only use when src: "gloss".

training:
    random_seed: 27 # Random seed for initialisation
    optimizer: "adam" # Chosen optimiser (adam, ..)
    learning_rate: 0.001 # Initial model learning rate
    learning_rate_min: 0.0002 # Learning rate minimum, when training will stop
    weight_decay: 0.0 # Weight Decay
    clip_grad_norm: 5.0 # Gradient clipping value
    batch_size: 8 # Batch Size for training
    scheduling: "plateau" # Scheduling at trai

If we take a look at `model` → `encoder`→ `embeddings`, we have added a **new field** `model` that can take the values `"none"` (baseline model, training embeddings from scratch), or `"bert"`, using BERT's WordPiece embeddings.

### Specifying constants in `constants.py`

In order to correctly initialize constant values (e.g., special tokens such as 'PAD' or 'UNK') and still preserve the functionality of the original project, we had to write an admittedly "hacky" script that nevertheless works:

In [None]:
from transformers import AutoTokenizer

# Declare variables
pretrained_model_str = None  # "bert" or "none"
tokenizer = None   # transformers.AutoModelForMaskedLM
vocab = None  # BERT vocabulary
UNK_TOKEN = None
PAD_TOKEN = None
BOS_TOKEN = None
EOS_TOKEN = None
TARGET_PAD = None
DEFAULT_UNK_ID = None

special_tokens = {
    "none": ('<unk>', '<pad>', '<s>', '</s>'),
    "bert": ('[UNK]', '[PAD]', '[CLS]', '[SEP]')
}


def initialize_constants(cfg: dict):
    global pretrained_model_str, tokenizer, vocab, UNK_TOKEN, PAD_TOKEN, \
        BOS_TOKEN, EOS_TOKEN, TARGET_PAD, DEFAULT_UNK_ID
    pretrained_model_str = cfg["model"]["encoder"]["embeddings"]["model"]
    if pretrained_model_str not in ("none", "bert"):
        raise ValueError(f"embeddings from model {pretrained_model_str} not supported")
    UNK_TOKEN, PAD_TOKEN, BOS_TOKEN, EOS_TOKEN = special_tokens[pretrained_model_str]
    if pretrained_model_str == "bert":
        tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-uncased")
        unk_token_id = tokenizer.get_vocab()[UNK_TOKEN]
        # Get vocabulary, sorted by the token ids
        vocab = [token for token in sorted(tokenizer.get_vocab().items(),
                                           key=lambda x: x[1])]
    else:
        tokenizer = None
        unk_token_id = 0
    TARGET_PAD = 0.0
    DEFAULT_UNK_ID = lambda: unk_token_id


First, we declare the constant variables in order to be able to import them from other files with e.g.:

In [6]:
import ProgressiveTransformersSLP.constants as constants

print("UNK token:", constants.UNK_TOKEN)
print("PAD token:", constants.PAD_TOKEN)
print("BOS token:", constants.BOS_TOKEN)
print("EOS token:", constants.EOS_TOKEN)

UNK token: None
PAD token: None
BOS token: None
EOS token: None


Then, we need to call the function `initialize_constants` to initialize these constants properly, depending on whether we are using BERT embeddings or not:

In [93]:
model = "none"
constants.initialize_constants(cfg={"model": {"encoder": {"embeddings": {"model": model}}}})

print("Special tokens when using no pretrained embeddings:")
print("UNK token:", constants.UNK_TOKEN)
print("PAD token:", constants.PAD_TOKEN)
print("BOS token:", constants.BOS_TOKEN)
print("EOS token:", constants.EOS_TOKEN)


model = "bert"
constants.initialize_constants(cfg={"model": {"encoder": {"embeddings": {"model": model}}}})

print("\nSpecial tokens when using BERT embeddings:")
print("UNK token:", constants.UNK_TOKEN)
print("PAD token:", constants.PAD_TOKEN)
print("BOS token:", constants.BOS_TOKEN)
print("EOS token:", constants.EOS_TOKEN)

Special tokens when using no pretrained embeddings:
UNK token: <unk>
PAD token: <pad>
BOS token: <s>
EOS token: </s>

Special tokens when using BERT embeddings:
UNK token: [UNK]
PAD token: [PAD]
BOS token: [CLS]
EOS token: [SEP]


Executing the previous cell might trigger the download of a Hugging Face model (if it's not already cached). That's why we also initialize the `tokenizer` in case we are using BERT embeddings:

```python
    if pretrained_model_str == "bert":
        tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-uncased")
```

This **avoids instanciating the tokenizer multiple times**, since we will need it in different places of our code.

> **📝 Note:** Another aspect that we would like to comment is that we had to choose the `[CLS]` token as the BOS (Beginning of Sequence) token and the `[SEP]` token as EOS. The meaning of these tokens goes a bit beyond what BOS and EOS signify, but they are also not too far off.

---

**ℹ️ Question: What do the `[CLS]` and `[SEP]` tokens mean in BERT?**

**💡 Answer:** Both these tokens have their roots in the way BERT was trained. If we remember from the seminar, BERT was trained on two tasks: Next Sentence Classification and Maked-Language Modeling. For the second of these tasks, the token `[MASK]` was used, but we don't really need it in our case. The `[CLS]` token comes always at the beginging of sequences and is meant to hold the meaning of the whole sentence. The `[SEP]` token acts as a separator when performing Next Sentence Classification, in order to distinguish the first and second sentences <a href="#ref_12">[12]</a>.

---

### Loading BERT's vocabulary and pretrained embeddings in `vocabulary.py`

Now that we have sorted out the correct definition of special tokens, it's time to load BERT's vocabulary and pretrained embeddings.

We have already showed how to do that before, but let's recap.

First, we need instances of both `tokenizer` and `model`. We have already instanciated the tokenizer in when initalizing the constants, so we only need to take care of the model. Something that we also do when this initialization happen is **load the vocabulary**. Again, we do it then since we will use this vocabulary in several places of the code.

In [94]:
tokenizer = constants.tokenizer

# Load the vocabulary, sorting by ascending token id
vocab = [token for token in sorted(tokenizer.get_vocab().items(), key=lambda x: x[1])]

Something very important is to **keep the order of the tokens** based on their id so that the positions of the tokens and their embeddings match. That is why we use `sorted` with `key=lambda x: x[1]`. Now, to get the embeddings, we run:

In [95]:
# Initialize the model
embed_model = AutoModelForMaskedLM.from_pretrained(
    "dbmdz/bert-base-german-uncased"
)
# Get the embeddings
embeddings = torch.stack([
    embed_model.get_input_embeddings()(torch.tensor(token[1]))
    for token in constants.vocab
])

In [68]:
# Print the 30 first dimensions of the first embedding
embeddings[0][:30]

tensor([ 0.0173, -0.0328, -0.0282, -0.0850, -0.0297, -0.0327, -0.0585,  0.0445,
         0.0197, -0.0125, -0.0108,  0.0288,  0.0280, -0.0394, -0.0669,  0.0133,
        -0.0620, -0.0023, -0.0277,  0.0127, -0.0095, -0.0883, -0.0386, -0.0213,
        -0.0116, -0.0738, -0.0120,  0.0074, -0.0370, -0.0299],
       grad_fn=<SliceBackward0>)

We use `torch.stack` to get the embeddings in a flat tensor, just as if it was a list.

In [72]:
embeddings.size()

torch.Size([31102, 768])

We can see that our vocabulary contains 31102 tokens, each of them with a corresponding embedding vector of 768 dimensions.

Next, the tokens are added to a `Vocabulary` class through its method `add_tokens`:

```python
class Vocabulary:
    ...
    
    def add_tokens(self, tokens: List[str]) -> None:
        """
        Add list of tokens to vocabulary

        :param tokens: list of tokens to add to the vocabulary
        """
        for t in tokens:
            new_index = len(self.itos)
            # add to vocab if not already there
            if t not in self.itos:
                self.itos.append(t)
                self.stoi[t] = new_index
    ...
```

The attribute `self.itos` stores the tokens (strings), and `self.stoi` serves as a look-up dictionary so that, e.g., `self.itos[13]` will give us the token with id 13.


As a side remark, the class `Vocabulary` was already implemented in the original project. We just adjusted it to be able to pass it our tokens.

### Initializing the embedding layer in `embeddings.py`

The original project implements a torch `nn.Module` called `Embeddings` that stores the embedding weights. Internally, this module contains a unique [`nn.Embedding`](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer.

Originally, this module was initialized like this:

In [97]:
vocab_size = len(vocab)
embedding_dim = embeddings.size()[1]
padding_idx = tokenizer.get_vocab()["[PAD]"]

embed_layer = nn.Embedding(vocab_size, embedding_dim, padding_idx=padding_idx)
print(embed_layer)

Embedding(31102, 768, padding_idx=0)


We can take a look at the created weights:

In [88]:
print(embed_layer.weight)
print()
print("Size:", embed_layer.weight.size())

Parameter containing:
tensor([[ 0.0000,  0.0000,  0.0000,  ...,  0.0000,  0.0000,  0.0000],
        [ 0.2951, -0.5569,  1.6718,  ...,  0.0194,  1.1920, -1.4272],
        [ 0.7343,  0.2580, -0.9960,  ...,  1.2261,  0.6621,  0.3058],
        ...,
        [-0.2210, -1.4156,  0.6138,  ...,  0.3274, -0.2015, -1.1469],
        [-0.7902, -1.5453, -0.9285,  ..., -0.9304, -1.2536, -0.0898],
        [-2.6411,  1.9783,  0.0937,  ..., -0.8000,  0.6434,  0.3154]],
       requires_grad=True)

Size: torch.Size([31102, 768])


As we can see, we have a 31102 embeddings of dimension 768. This embeddings, however, are initialized randomly (sampled from $N(0,1)$).

This layer can also be initalized with pretrained embeddings, which is just what we want:

In [100]:
pretrained_embed = embeddings

pretrained_embed_layer = nn.Embedding.from_pretrained(
    pretrained_embed, padding_idx=padding_idx)

In [102]:
print(pretrained_embed_layer.weight)
print()
print("Size:", pretrained_embed_layer.weight.size())

Parameter containing:
tensor([[ 1.7323e-02, -3.2809e-02, -2.8226e-02,  ...,  2.1750e-02,
         -8.7762e-05,  2.9073e-02],
        [-2.7637e-03, -6.4053e-02, -1.3206e-02,  ...,  1.7858e-02,
          6.6482e-03, -3.0267e-02],
        [ 1.3819e-02, -9.6506e-02,  8.0860e-04,  ...,  3.8678e-02,
          6.3529e-02, -5.4168e-02],
        ...,
        [ 1.6831e-02, -5.7545e-02,  1.3673e-02,  ...,  3.3795e-02,
          5.2264e-04, -3.5393e-02],
        [-3.9432e-03, -3.7646e-02, -3.6883e-02,  ..., -2.8921e-02,
          6.6119e-03, -3.0646e-02],
        [ 4.1762e-02, -2.6133e-02, -2.7677e-02,  ...,  2.4914e-02,
          3.1606e-02, -2.7268e-02]])

Size: torch.Size([31102, 768])


Just to check, let's see if the first embedding in the newly created layer coincides with the pretrained embeddings that we loaded previously:

In [107]:
eq = torch.equal(
    pretrained_embed_layer.weight[0],  # embedding layer
    embeddings[0]  # previously loaded embeddings
)

print("Great, they are equal!" if eq else "Nope, they're not equal :(")

Great, they are equal!


### When everything seemed to go great... Houston we have a problem.

At this point, we thought that our model was ready to be trained. So we did so and got **disastrous results**. "Why?" – we asked ourselves while we scratched our heads.

Then we realized...

Let's take a look at the input training data:

In [108]:
with open(Path("ProgressiveTransformersSLP/Data/train.text")) as f:
    pprint(f.readlines()[:10])

['und nun die wettervorhersage für morgen donnerstag den zwölften august\n',
 'mancherorts regnet es auch länger und ergiebig auch lokale überschwemmungen '
 'sind wieder möglich\n',
 'im nordwesten bleibt es heute nacht meist trocken sonst muss mit teilweise '
 'kräftigen schauern gerechnet werden örtlich mit blitz und donner\n',
 'auch am tag gibt es verbreitet zum teil kräftige schauer oder gewitter und '
 'in manchen regionen fallen ergiebige regenmengen\n',
 'größere wolkenlücken finden sich vor allem im nordwesten\n',
 'im emsland heute nacht nur neun am oberrhein bis siebzehn grad\n',
 'morgen ähnliche temperaturen wie heute neunzehn bis fünfundzwanzig in der '
 'lausitz bis siebenundzwanzig grad\n',
 'am freitag kann es in der osthälfte teilweise länger und kräftig regnen '
 'vorsicht hochwassergefahr\n',
 'sonst wechselhaft mit schauern und gewittern die uns auch am wochenende '
 'begleiten\n',
 'am temperaturniveau ändert sich wenig\n']


Achso! The data is all **lowercased**. At that moment, we were using the model [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased). Yes, a cased model. Therefore, it was missing basically every other word (i.e., all the nouns), since they weren't in their vocabulary.

We can make a quick test:

In [109]:
# Load the model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")
model = AutoModelForMaskedLM.from_pretrained("bert-base-german-cased")

In [110]:
# Let's use the word from the previous example
word = "frieden"

if word in tokenizer.get_vocab():
    print("Found!")
else:
    print("Not found!")

Not found!


However, if we use its cased form:

In [111]:
word = "Frieden"

if word in tokenizer.get_vocab():
    print("Found!")
else:
    print("Not found!")

Found!


### So we changed to an uncased model... But...

We trained again. And again things were not going well... So it was time to debug.

The **data loading and handling** is somehow **complex** (excessively perhaps?). It took us quite some time to realize where things were going wrong.

In summary, how data is managed is as follows:

1. First, the data is read from the input files (stored at `ProgressiveTransformersSLP/Data`). This is done via the function [`load_data`](https://github.com/dmlls/slp/blob/main/ProgressiveTransformersSLP/data.py#L29) in the `data.py` script. We modified this function so it also returns the pretrained embeddings, along with the loaded train data, dev data, test data, and the source and target vocabularies:

```python
 return train_data, dev_data, test_data, src_vocab, pretrained_embed, trg_vocab
```

2. The source and target vocabularies, as well as the embeddings, are passed to initialize the model. Here is also when the embedding layer is initialized, just as we saw before.


3. `SignProdDataset`s are initialized with the training, dev and test data. This class inherits from `torch.data.Dataset` and provides some utilities to handle the data. The data itself is read from the source files, separating lines by newline character, and then tokenizing words with a simple `string.split()`.


4. Batches are drawn from the `SignProdDataset`s. The original project implements a `Batch` class in `batch.py` for this. Again, this class includes some useful utilities.


5. The batches are passed to the model and the loss is calculated.


6. The loss is backpropagated and a new training step starts.

Looking at the step 3 is when **we realized**: the **words were simply split (tokenized) by whitespace**, but as we have seen before, the **tokenization in WordPiece is a little bit more complex**.

### Properly tokenizing the input data

So how do we properly tokenize the input data then? Luckily, the great Transformers 🤗 library is there to make our lives easier.

To tokenize a string with our tokenizer, we simply need to run:

In [135]:
string = "Es gibt keinen Weg zum Frieden, denn Frieden ist der Weg. – Mahatma Gandhi"

print(constants.tokenizer(string))

{'input_ids': [102, 233, 773, 2355, 1261, 348, 5654, 806, 990, 5654, 207, 127, 1261, 552, 798, 12103, 148, 608, 22906, 8291, 30939, 103], 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}


The only "problem" is that we get the input ids, but we can solve this easily with a dictionary lookup:

In [136]:
ids = constants.tokenizer(string)['input_ids']
pprint([constants.vocab[id_][0] for id_ in ids])

['[CLS]',
 'es',
 'gibt',
 'keinen',
 'weg',
 'zum',
 'frieden',
 ',',
 'denn',
 'frieden',
 'ist',
 'der',
 'weg',
 '.',
 '–',
 'mah',
 '##at',
 '##ma',
 'gan',
 '##dh',
 '##i',
 '[SEP]']


Cool! We can see that the words that are not in the vocabulary are split (e.g., Mahatma Gandhi). This way of back-off can help making sense out of unknown words. "Mahatma Gandhi" is not a very good example, but if we try for example:

In [137]:
string = "Speisekarte"

ids = constants.tokenizer(string)['input_ids']
pprint([constants.vocab[id_][0] for id_ in ids])

['[CLS]', 'speise', '##karte', '[SEP]']


We see that even thoug "Speisekarte" is not in the vocabulary "speise" and "karte" are there, which sometimes can help infer the meaning of the complete word (which in German often times works well).

### Using the tokenized the input data

One last thing we need to do is, when using the pretrained embeddings, to use our tokenization function above instead of `string.split`.

For that, we included our function in `data.py`:

In [133]:
def bert_tokenization(string: str):
    """Tokenize a string using the BERT tokenizer."""
    # Tokenize and remove [CLS] and [SEP] (first and last tokens),
    # since they will be added later.
    ids = constants.tokenizer(string)['input_ids'][1:-1]
    return [constants.vocab[id_][0] for id_ in ids]

Something that we also take care of is removing the [CLS] and [SEP] tokens, since they will be added afterwards when preparing the batches' examples.


Finally, the way we integrate our function in the existing code is as follows:

In [148]:
if constants.pretrained_model_str == "bert":  # using pretrained embeddings
    tok_fun = bert_tokenization
else:
    tok_fun = lambda s: list(s) if level == "char" else s.split()

# Source field is a tokenized version of the source words
src_field = data.Field(
    init_token=None,
    eos_token=constants.EOS_TOKEN,
    pad_token=constants.PAD_TOKEN,
    tokenize=tok_fun,
    batch_first=True,
    lower=False,  # data already lowercased
    unk_token=constants.UNK_TOKEN,
    include_lengths=True,
)

The [`torchtext.data.Field`](https://torchtext.readthedocs.io/en/latest/data.html#field) class will take care of the tokenization, we simply need to pass it the tokenizaiton function.

### Now we ARE ready! 🥳


---

#  Data Augmentation by backtranslation

Apart from pretrained embeddings, we also tried applying data augmentation to try to improve the baseline results.

Data Augmentation describes a set of algorithms that construct synthetic data from an available dataset. This synthetic data typically contains small changes in the data that the model’s predictions should be invariant to. Synthetic data can also represent combinations between distant examples that would be very difficult to infer otherwise. It's worth noting that data augmentation is a regularising approach, meaning that it tends to reduce model variance by making training harder. 

In our case, we propose the use of backtranslation technique in order to augment our dataset. This method is presented in various works [[12]](#ref_12), [[13]](#ref_13), [[15]](#ref_15), [[16]](#ref_16).

The objective is to generate paraphrasing, introduce synonyms or different grammatical structures into the data distribution to develop invariance to these changes.

In order to achieve this we leveraged models from the Hugging Face `transformers` library.

* **German to English translation:**: 

The model used for this task was based on a transformer-align architecture, which is an architecture that simultaneously learns how to translate and align text. This implementation was done by the NLP Lab at Helsinki University.

The dataset used is [opus](https://opus.nlpl.eu/), which is a collection of translated texts from the web. Also the translation pipeline includes text normalization and tokenization with *SentencePiece*, which is an unsupervised text tokenizer (https://github.com/google/sentencepiece).


* **English to Germas translation:**: 

The model used was a pretrained out-of-the-box pipeline from the `transformers` library. Specifically a *Text-to-Text Transfer Transformer*, which is a modification of the renowned BERT model which introduces a couple of architecture tweaks. The variant imlemented in this project is the [T5-base](https://huggingface.co/t5-base), which was pre-trained on an extended version of the well-known [Common Crawl dataset](https://commoncrawl.org).


Some examples:

*Original*
> und nun die wettervorhersage für morgen donnerstag den zwölften august.

*Backtranslated*
> und jetzt die wettervorhersage für morgen donnerstag, den zwölften august

*Original*
> starke winde sorgen zudem für schneeverwehungen es bestehen entsprechende unwetterwarnungen des deutschen wetterdienstes

*Backtranslated*
> starke winde sorgen auch für schneedriften, es gibt entsprechende wetterwarnungen des deutschen wetterdienstes

This process resulted in having double the training samples, as for each *phrase,skeleton* pair we have another *backtranslated,skeleton* pair. We additionally had to perform extra normalization to the generated text to have an equivalent format across all the dataset.

# Results

In the original paper they used back-translation in order to **evaluate** the performance of the model. This **back-translation** consists in taking the outputs of the model, i.e., the sign poses, and translating them back to words.

However, **the code for this back-translation is not available, so we couldn't use it** to measure our results. However, we still have the training logs, which can gives a good insight on whether the model is performing better with our modidifications or not.

One of the measures included in these logs is **Dynamic Time Warping (DTW)**, which measures the similarity between temporal sequences, in our case the ground truth pose and the predicted pose. We also have the loss value, for which they used **Mean Squared Error (MSE)** (keep in mind that the outputs are coordinate vectors, which we can compare with the ground truth vectors).

Apart that, we can also take a look at the training logs and at the produced videos.


All training logs can be found under `./training_logs`.

## Baseline results

We first trained the model "as-is" in the original project. We also left the hyperparameters unmodified, the most relevant being:

- Learning rate: 0.001
- Patience (number of epochs with no improvement until decreasing the LR): 7
- LR decrease factor: 0.7
- Minimum LR for early stopping: 0.0002
- Embedding dimensions: 512

We trained until the learning rate dropped under the minimum set (0.0002) for a total of 45 epochs. We used [Tensorboard](https://www.tensorflow.org/tensorboard/) to visualize the loss curve. On the training set, it looked like this:

<img width="500px" src="./training_logs/baseline/45_epochs/tensorboard/train-loss.png" alt="Loss on the training set."/>
<br>
<div align="center"><i>Loss on the training set.</i></div>

And on the validation set, the loss and the score developed as follows:

<img width="800px" src="./training_logs/baseline/45_epochs/tensorboard/dev-loss-and-score.png" alt="Loss and score on the dev set."/>
<br>
<div align="center"><i>Loss and score on the dev set.</i></div>

From these graphs, we can see that the training is running correctly, and we get nice curves in the case of the loss. However, the DTW score doesn't seem to improve during the training. What is more, the lowest results were achieved during the first epochs.

Let's also take a look at some of the generated videos.

This is the sign pose translation for the imput sentence "vom mittelmeer fließt feuchte luft heran in der sich zum teil kräftige gewitter entwickeln können."

In [5]:
Video(Path("./training_logs/baseline/45_epochs/test_videos/vom_mittelmeer_fließt_15_20.mp4"), width=900, embed=True)

As we can see, the baseline results are not very impressive. However, the model manages to produce a smooth result, with the pose "flowing" throughout the video without sudden jumps. The positioning of the joints are also sensible.

For comparison, we can take a look at a video of an untrained model:

In [8]:
Video(Path("training_logs/baseline/test_videos/am_sonntag_im_18_94.mp4"), width=900, embed=True)

The difference is significant.

The **best DTW** result with the baseline model was **14.592** and was achieved in the step 10620, although the best loss happened in the step 38940, being 0.00082. At this point the model was probably overfitting.


All the training logs, as well as other generated videos, can be found at `./training_logs/baseline/45_epochs`.

## Pretrained embeddings results

When using the pretrained embeddings, to draw a fair comparison, we left all hyperparameters untouched. Surprising or not, the results are actually pretty similar.

Let's take a look at the training losses:

<img width="500px" src="./training_logs/bert/52_epochs/tensorboard/train-loss.png" alt="Loss on the training set."/>
<br>
<div align="center"><i>Loss on the training set.</i></div>

<img width="800px" src="./training_logs/bert/52_epochs/tensorboard/dev-loss-and-score.png" alt="Loss and score on the dev set."/>
<br>
<div align="center"><i>Loss and score on the dev set.</i></div>

As we said, things look very similar to the baseline case. The best loss result was 0.00080 in the step 45135, and the best **DTW**, **14.569**, in the step 5310, only slightly lower to the baseline model.

This is one of the produced videos:

In [4]:
Video(Path("./training_logs/bert/52_epochs/test_videos/heute_nacht_ist_15_41.mp4"), width=900, embed=True)

The model still keeps its smoothness, although is still far from correctly matching the ground truth pose.


Since the best DTW were achieved around the epoch 6, we trained again with this number of epochs, in order to prevent the model from overfitting.

These are the Tensorflow graphs that we obtained:

<img width="500px" src="./training_logs/bert/06_epochs/tensorboard/train-loss.png" alt="loss on the training set."/>
<br>
<div align="center"><i>loss on the training set.</i></div>

<img width="800px" src="./training_logs/bert/06_epochs/tensorboard/dev-loss-and-score.png" alt="loss and score on the dev set."/>
<br>
<div align="center"><i>loss and score on the dev set.</i></div>

Let's see a video:

In [5]:
Video(Path("./training_logs/bert/06_epochs/test_videos/morgen_gibt_es_20_26.mp4"), width=900, embed=True)

It is hard to tell whether the model performs better or not. To our eyes, it performs similarly.

## Data Augmentation results

TODO

# References

<a id='ref_1'>[1]</a> WHO: World Health Organization. Deafness and hearing loss. http://www.who.int/mediacentre/factsheets/fs300/en/, 2021

<a id='ref_2'>[2]</a> Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Multimodal deep hand sign language recognition in still images using restricted boltzmann machine. Entropy, 20, 2018.

<a id='ref_3'>[3]</a> Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Hand sign language recognition using multi-view hand skeleton. Expert Systems With Applications, 150, 2020.

<a id='ref_4'>[4]</a> Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Video based isolated hand sign language recognition using a deep cascaded model. Multimedia Tools And Applications, 79:22965–22987, 2020.

<a id='ref_5'>[5]</a> Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Hand pose aware multimodal isolated sign language recognition. Multimedia Tools And Applications, 80:127–163, 2021

<a id='ref_6'>[6]</a> Mark Borg and Kenneth P. Camilleri. Phonologically-meaningful sub-units for deep learning-based sign language recognition. ECCV, 2020

<a id='ref_7'>[7]</a> Agelos Kratimenos, Georgios Pavlakos, and Petros Maragos. 3d hands, face and body extraction for sign language recognition. ECCV, 2020.

<a id='ref_8'>[8]</a> Razieh Rastgoo and Kourosh Kiani and Sergio Escalera and Mohammad Sabokrou. Sign Language Production: A Review. 2021.

<a id='ref_9'>[9]</a> Saunders, Ben and Camgoz, Necati Cihan and Bowden, Richard. Progressive Transformers for End-to-End Sign Language Production. https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123560664.pdf. ECCV, 2020.

<a id='ref_10'>[10]</a> J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. Piater, and H. Ney. RWTH-PHOENIX-Weather: A Large Vocabulary Sign Language Recognition and Translation Corpus. https://www-i6.informatik.rwth-aachen.de/publications/download/773/Forster-LREC-2012.pdf In Language Resources and Evaluation (LREC), pages 3785-3789, Istanbul, Turkey, May 2012. 

<a id='ref_11'>[11]</a> Attention Is All You Need. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. https://arxiv.org/pdf/1706.03762.pdf, June 2017.

<a id='ref_12'>[12]</a>  BET: A Backtranslation Approach for Easy Data Augmentation in Transformer-based Paraphrase Identification Context. Jean-Philippe Corbeil, Hadi Abdi Ghadivel. https://arxiv.org/abs/2009.12452, Sep 2020

<a id='ref_13'>[13]</a> Data augmentation using back-translation for context-aware neural machine translation. Sugiyama & Yoshinaga. https://aclanthology.org/D19-6504, EMNLP 2019)


<a id='ref_14'>[14]</a> Handspeak. Negation in Sign Language. https://www.handspeak.com/learn/index.php?id=156, 2022.

<a id='ref_15'>[15]</a> Data expansion using back translation and paraphrasing for hate speech detection. Djamila RomaissaBeddiar SaroarJahan, MouradOussalah. https://doi.org/10.1016/j.osnem.2021.100153, November 2019

<a id='ref_16'>[16]</a> Text Data Augmentation for Deep Learning. Shorten, C., Khoshgoftaar, T.M. & Furht, B. https://doi.org/10.1186/s40537-021-00492-0, July 2021
