<img vspace="33px" align="right" src="https://www.munich-startup.de/wp-content/uploads/2019/03/TUM_logo-440x236.png" width="120px"/>
<h1>Sign Language Production</h1>
<h3>Applied Deep Learning for NLP</h3>
<p><b>Diego Miguel Lozano</b> | <b>Wenceslao Villegas Marset</b></p>
<p>March 9<sup>th</sup>, 2022</p>

---

# Table of contents

> ### [1. Introduction](#section_1)
>> [**1.1 What is Sign Language Production (SLP)?**](#section_1_1)<br>
>> [**1.2 What is the starting point for our project?**](#section_1_2)

<a id='section_1'></a>
# 1. Introduction

<a id='section_1_1'></a>
### 1.1 What is Sign Language Production (SLP)?

Sign Language Production focuses on translating spoken languages into sign languages and viceversa. According to the World Health Organization (WHO), in 2020 there were more than 466 million deaf people in the world [[1]](#ref_1). This area could be of great help for the hearing-impared community, being for that necessary the development of techniques for both recognition and production of sign languages.

While the Sign Language Recognition has seen numerous advancements in the last years [[2](#ref_2), [3](#ref_3), [4](#ref_4), [5](#ref_5), [6](#ref_6), [7](#ref_7)], Sign Language Production is still a very challenging task, since it involves an interpretation between visual and linguistic information [[8]](#ref_8).

<a id='section_1_2'></a>
### 1.2 What is the starting point for our project?

As we just mentioned, SLP is complex and far from being solved. Nevertheless, there have recently been promising developments, such as the application of Transformer architectures to SLP, what has come to be called "Progressive Transformers."

In this project, we take the [source code](https://github.com/BenSaunders27/ProgressiveTransformersSLP) for the paper "Progressive Transformers for End-to-End Sign Language Production" [[9]](#ref_9) as the starting point.

We propose to test different improvement approaches to boost the model's performance like:

* Using pre-trained BERT embeddings and fine-tuning them during training.
* Leveraging different pre-trained models from the hugging-face ecosystem to perform data augmentation on the source senteces. 
* Testing out the improvement in performance with different hyperparameter configurations for the transformer architecture.


<a id='section_1_3'></a>
### 1.3 The data.

Source data stems from the RWTH-PHOENIX-Weather-2014T dataset.

*Dataset Infomation*: Over a period of three years (2009 - 2011) the daily news and weather forecast airings of the German public tv-station PHOENIX featuring sign language interpretation have been recorded and the weather forecasts of a subset of 386 editions have been transcribed using gloss notation. Furthermore, we used automatic speech recognition with manual cleaning to transcribe the original German speech. As such, this corpus allows to train end-to-end sign language translation systems from sign language video input to spoken language.

The signing is recorded by a stationary color camera placed in front of the sign language interpreters. Interpreters wear dark clothes in front of an artificial grey background with color transition. All recorded videos are at 25 frames per second and the size of the frames is 210 by 260 pixels. Each frame shows the interpreter box only.

* **Text data.**:

Consists of sequences of text that represent the transcription of each video recording.

Example: 
> das bedeutet viele wolken und immer wieder zum teil kräftige schauer und gewitter 

* **Gloss representation**:

Gloss equivalent for the text transcript. A gloss is a German word or words that are used to name the corresponding Sign Language signs.

Example: 
> ES-BEDEUTET VIEL WOLKE UND KOENNEN REGEN GEWITTER KOENNEN

* **Skeleton**:

Sequence of 3D skeletal poses that the model has to learn to produce (ground truth). The format of this data is as follows:

* Tuple format: (index of a start point, index of an end point, index of a bone)

                (0)
                 |
                 |
                 0
                 |
                 |
        (2)--1--(1)--1--(3)
         |               |
         |               |
         2               2
         |               |
         |               |
        (4)             (5)

      has this structure:

      (
        (0, 1, 0),
        (1, 2, 1),
        (1, 3, 1),
        (2, 4, 2),
        (3, 5, 2),
      )

Then a resulting skeleton pose on a frame would be composed of 150 values, since we have 25 joints with xyz coordinates. And each text sample would have N corresponding frames that would represent the corresponding sequence of skeleton poses. During training the shape of the network output is 151 since a counter is added for the network to predict (explained further down).


---

# A brief intro to the "Progressive Transformers for SLP" project

In this section, we will quickly explain the main aspects of Progressive Transformers project. If we had to summarize it in only three points, these would be the following: counter decoding, two different approaches –Text-to-Gloss-to-Pose (T2G2P) and Text-to-Pose (T2P)–, and data augmentation.

## Counter decoding

One of the main challenges of SLP is that the output has to maintain certain continuity. The predicted pose in a video frame has to flow naturally from the previous one, and analogously for the frames that follow. This is achieved in the following manner: the model not only predicts the sign pose, but also a "counter". This counter is nothing else but real number in the interval [0, 1]. This value increases monotonically from 0 to 1.0, marking thus the begining and end of sequence, respectively.

<img width="600px" src="./images/counter-decoding.jpg" alt="Counter Decoding"/>
<br>
<div align="center"><i>Representation of counter decoding.</i></div>

<br>

---

**ℹ️ Question: Why not simply use an BOS token and EOS?**

**💡 Answer:** Begining of Sentence (BOS) and End of Sentence (EOS) tokens work well with sentences, but when producing video as we mentioned before we need something more than just marking the beginning and end of it. Therefore, the counter serves both as an BOS and EOS and captures information about the flow of the video.

---


### Two different approaches

In the paper, they experimented with two different approaches: T2G2P and T2P:

<img width="600px" src="./images/T2G2P-vs-T2P.png" alt="T2G2P vs T2P Architectures"/>
<br>
<div align="center"><i>Architecture details of (a) Symbolic and (b) Progressive Transformers. (ST: Symbolic Transformer, PT: Progressive Transformer, PE: Positional Encoding, CE: Counter Embedding, MHA: Multi-Head Attention) <a href="#ref_10">[10]</a>.</i></div>

In both cases, the models follow the architecture introduced in "Attention is All You Need" <a href="#ref_11">[10]</a>.

In the first approach –T2G2P– glosses are produced from the input tokens in a first step. Then, this glosses serve as input for another transformer, which then translates the glosses into sign poses.

The second –T2P– is an end-to-end approach, in which the text is directly translated into sign poses.

### Data augmentation

Finally, the paper explores some data augmentation techniques to determine whether they improve the base model. These augmentations where only carried out with the T2G2P architecture.

- **Future Prediction**: this type of augmentation forces the model to predict the next 10 frames from the current time step, instead of just the next frame. In this way, the model cannot just copy the previous time step, which effectively improves performance over the base architecture.


- **Just Counter**: in this case only the counter values are provided as target input to the model, omitting the 3D skeleton joint coordinates. Again, this has shown to improve results.


- **Gaussian Noise**: the last augmentation method consists in adding Gaussian noise to the skeleton pose sequences during training. This makes the model more robust to prediction inputs.


The following table collects the results of the previous augmentation approaches:

<img width="700px" src="./images/augmentation-results.png" alt="Data Augmentation Results"/>
<br>
<div align="center"><i>The best BLEU-4 performance comes from a combination of future prediction and Gaussian noise augmentation. The model must learn to cope with both multi-frame prediction and a noisy input, building a firm robustness to drift <a href="#ref_10">[10]</a>.</i></div>

### Implementation description.

The model architecture is based on the "Attention is All You Need" <a href="#ref_11">[10]</a> paper. 

A relevant modification was introduced to adapt it and achieve good performance on the SLP task. 
* The final layer for the encoder is a Linear one with  512 + 1 units, which represent *coordinates of the output skeleton* + *counter decoding value*.



### Overall project structure.

Below a descripption of the modules present in the project after being further extended by us.

```
slp
│   README.md
│   Sign Language Production.ipynb    
└───images
└───ProgressiveTransformersSLP
│   └───Configs
│       │   Base.yaml - Config file to set model, data loading/processing and training parameters.
│       │   src_vocab.txt - GLOSS vocabulary.
│       │   ...
│   └───Data
│       │   train.text - Speech text for each training sample.
│       │   train.skels - Skeleton annotaions for each training sample.
│       │   train.gloss - Glosses for each training sample.
│       │   ...
│   └───external_metrics
│       │   mscoco_rouge.py - ROUGE-L metric implementation.
│       │   train.gloss - BLEU metric implementation like in https://github.com/mjpost/sacrebleu
|       └───optim - Some implementations of optimization algorithms such as: lamb, RAdam, Ranger, etc.
|       └───__main__.py - Main entrypoing to run model training.
|       └───batch.py - Wrapper over torch batch iterator, adding masking and other attributes.
|       └───builders.py - Assorted builder functions.
|       └───constants.py - Project wide constants.
|       └───data.py - Data loading utilities and main torchtext.data.Dataset class
|       └───decoders.py - Transformer decoder implementation.
|       └───dtw.py - Dynamic time warping imlementation as in https://github.com/pierre-rouanet/dtw.
|       └───decoders.py - Transformer decoder implementation.
|       └───embedding.py - Embedding class implementation with support for BERT pretrained ones.
|       └───encoders.py - Transformer encoder implementation.
|       └───helpers.py - Helper functions for logging, reporting, etc.
|       └───initialization.py - Custom NN parameter initialization functions.
|       └───loss.py - Loss function implementations.
|       └───metrics.py - Performance metric functions.
|       └───model.py - Main class assembling all the model's modules (decoder/encoder) and constructor functions.
|       └───plot_videos.py - Validataion video generation for skeleton predictions.
|       └───prediction.py - Code for running validation steps on dev set data (perform dtw then loss, etc).
|       └───search.py - Greedy hyperparameter search function.
|       └───training.py - Training loop implementation.
|       └───transformer_layers.py - Layer implementations from NMT toolkit.
|       └───vocabulary.py - Vocabulary loading and checking code.
└───augmentations
    │   backtranslation.py - Code to perform DE -> EN -> DE translation for augmentation purposes.
```



---

#  Using pre-trained embeddings

The original project trains embeddings from scratch. As we have learned during the seminar, the use of pretrained embeddings can effectively improve the performance of models, especially in situations where data is scarce (which is our case).

Let's first see how the embedding initialization is happening in the original source code.

The first thing that happens when beginning the training is the data loading. The vocabulary of the model is initalized differently depending if we are using T2G2P or T2P.

- **Bulding vocabulary in T2G2P**: in this case, the vocabulary is taken from a file [`src_vocab`](https://github.com/BenSaunders27/ProgressiveTransformersSLP/blob/master/Configs/src_vocab.txt) that we will analyze a bit more in depth later.


- **Bulding vocabulary in T2P**: when using the End-to-End approach (Text-to-Pose), the vocabulary is built from the training input data.

<br>

In both cases, the function [`build_vocab()`](https://github.com/BenSaunders27/ProgressiveTransformersSLP/blob/adbd3e9ea9f1b20063d84021a0d6eb9a124ebb87/vocabulary.py#L130-L187) is used. Let's take a look at it.


```python
if vocab_file is not None:
    # load it from file
    vocab = Vocabulary(file=vocab_file)
else:
    ...
```

<br>

First of all, if we have a vocabulary file (like in the case of T2G2P), we initialize the vocabulary from it, as we already mentioned.


```python
def _from_file(self, file: str) -> None:
        """
        Make vocabulary from contents of file.
        File format: token with index i is in line i.

        :param file: path to file where the vocabulary is loaded from
        """
        tokens = []
        with open(file, "r") as open_file:
            for line in open_file:
                tokens.append(line.strip("\n"))
        self._from_list(tokens)
```

This function simply reads the vocabulary file line by line. Since each line contains only one token, there is no further processing to be done.

If there is no input vocabulary file, the tokens are extracted from the training dataset.

Let's run some code to better visualize this.

## Inside the original SLP model vocab

As we have already mentioned, the original project that we use as starting point provides a plain-text file [`src_vocab`](https://github.com/BenSaunders27/ProgressiveTransformersSLP/blob/master/Configs/src_vocab.txt) containing the vocabulary for which embeddings will then be trained.

Before jumping in and trying to directly use our pretrained embeddings, it is sensible to first analyze a bit how things work in the original project.

In [None]:
code_dir = Path("./ProgressiveTransformersSLP")

In [None]:
# Imports
import sys
from pathlib import Path
from pprint import pprint
sys.path.insert(0, code_dir)  # just so that imports can be resolved

from ProgressiveTransformersSLP.vocabulary import Vocabulary, build_vocab

In [None]:
# Path to the vocabulary file
vocab_file = code_dir/Path("Configs/src_vocab.txt")

# Build vocabulary
vocabulary = Vocabulary(file=vocab_file)

# Get all the tokens in the built vocabulary
tokens = [token for token in vocabulary.itos]

# We will select some tokens that are worth analyzing
selected_tokens = (tokens[1:5] + [tokens[172]] + tokens[531:541] +
                   tokens[868:870] + tokens[718:725] + tokens[1085:1088])
pprint(selected_tokens)

From the previous vocabulary, there are three aspects that are worth mentioning:

- Words such as `AUSWAEHLEN`, `DUENN` and `HEISS` give us a hint that **normalization** is used. A popular algorithm for German normalization is the [German2 snowball algorithm](https://snowballstem.org/algorithms/german2/stemmer.html) which defines the following mappings:
  - 'ß' is replaced by 'ss'.
  - 'ä', 'ö', 'ü' are replaced by 'a', 'o', 'u', respectively.
  - 'ae' and 'oe' are replaced by 'a', and 'o', respectively.
  - 'ue' is replaced by 'u', when not following a vowel or q.


- As we saw during the seminar lectures, the **special tokens** `<unk>`, `<pad>`, `<s>`, `</s>` are also included in the dictionary. These tokens mark unknown words, padding, beginning of sequence (BOS), and end of sequence (EOS), respectively.


- Some of the words in the vocabulary include the prefixes `neg-` and `negalp-`. We could guess that `neg-` simply means that the word is negated, e.g., `neg-GENUG`≡ `NICHT GENUG`, but what about the `negalp-` prefix? And also, what do words such as `J+L+I` and `K+R+E+T+A` mean? A look to the paper of the RWTH-PHOENIX-Weather dataset [[10]](#ref_10) (the first version of the dataset used to train the model) gives us the answer:


<img width="400px" src="./images/RWTH-PHOENIX-Weather-Annotation-Scheme.png" alt="RWTH-PHOENIX-Weather Annotation Scheme"/>
<br>
<div align="center"><i>Source <a href="#ref_10">[10]</a>.</i></div>

So in reality `neg-` means "signs negated by headshake" and `negalp-` "signs negated by the alpha[betical] rule" <sup>[1](note_1)</sup>. Words such as `K+R+E+T+A` are words (finger) spelled letter by letter.

Interestingly enough, none of the other types of tokens appear in the source dictionary.

---

<a id='note_1'><sup>1</sup></a> In Sign Language, there are several ways of negating words. One of these ways is using a side-to-side headshake or a frown expression. Also, some verbs have their own negated forms, which is what `negalp-` indicates here [[11]](#ref_11).

---

**ℹ️ Question: But... where is this vocabulary coming from?**

**💡 Answer:** As it turns out, this vocabulary is simply made up of **glosses**. As we mentioned before, the original project proposes two ways of carrying out the translations from text to sign language. That also explains why when going for the T2P approach we don't use this file.

---

## BERT it up!

If something is clear is that BERT's dictionary will not contain glosses, let alone glosses specifically tailored to SLP.

But then, how does BERT's vocabulary looks like? Let's take a look!

Fortunately, Hugging Face's great API has got us covered: tokenizers expose their vocabulary through the method `get_vocab()`. Let's try with the model [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased).

In [None]:
from pprint import pprint
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load tokenizer from pretrained model
tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")

In [None]:
# Print some words of its dictionary
pprint(list(tokenizer.get_vocab().items())[:40])

Above, we can see each token with its corresponding ID (just as we saw in the seminar lectures). However, there are two things that catch our attention...

---

**ℹ️ Question: Why do some tokens start with "##"?**

**💡 Answer:** Well, this is just a way of indicating that this token is "non-initial", i.e., originally it belonged to a longer word (remember that usually Transformers work at a sub-word level).

---

**ℹ️ Question: What about the `[unused###]` tokens?**

**💡 Answer:** These are, unsurprisingly, tokens that are not used. However, they can come handy to add more words to the vocabulary:

> Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized. ([source](https://github.com/google-research/bert/issues/9#issuecomment-434796704))

---

As a side note, we would like to mention that someone took the time to explore BERT's vocabulary and wrote a great article about it. The article can be found at https://juditacs.github.io/2019/02/19/bert-tokenization-stats.html.

#  Data Augmentation by backtranslation.

Data Augmentation describes a set of algorithms that construct synthetic data from an available dataset. This synthetic data typically contains small changes in the data that the model’s predictions should be invariant to. Synthetic data can also represent combinations between distant examples that would be very difficult to infer otherwise. It's worth noting that data augmentation is a regularising approach, meaning that it tends to reduce model variance by making training harder. 

In our case, we propose the use of backtranslation technique in order to augment our dataset. This method is presented in various works [[12]](#ref_12), [[13]](#ref_13), [[15]](#ref_15), [[16]](#ref_16).

In order to achieve this we leveraged models from the `hugging-face` `transformers` library.

* **German to English translation:**: 



* **English to Germas translation:**: 



# References

<a id='ref_1'>[1]</a> WHO: World Health Organization. Deafness and hearing loss. http://www.who.int/mediacentre/factsheets/fs300/en/, 2021

<a id='ref_2'>[2]</a> Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Multimodal deep hand sign language recognition in still images using restricted boltzmann machine. Entropy, 20, 2018.

<a id='ref_3'>[3]</a> Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Hand sign language recognition using multi-view hand skeleton. Expert Systems With Applications, 150, 2020.

<a id='ref_4'>[4]</a> Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Video based isolated hand sign language recognition using a deep cascaded model. Multimedia Tools And Applications, 79:22965–22987, 2020.

<a id='ref_5'>[5]</a> Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Hand pose aware multimodal isolated sign language recognition. Multimedia Tools And Applications, 80:127–163, 2021

<a id='ref_6'>[6]</a> Mark Borg and Kenneth P. Camilleri. Phonologically-meaningful sub-units for deep learning-based sign language recognition. ECCV, 2020

<a id='ref_7'>[7]</a> Agelos Kratimenos, Georgios Pavlakos, and Petros Maragos. 3d hands, face and body extraction for sign language recognition. ECCV, 2020.

<a id='ref_8'>[8]</a> Razieh Rastgoo and Kourosh Kiani and Sergio Escalera and Mohammad Sabokrou. Sign Language Production: A Review. 2021.

<a id='ref_9'>[9]</a> Saunders, Ben and Camgoz, Necati Cihan and Bowden, Richard. Progressive Transformers for End-to-End Sign Language Production. https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123560664.pdf. ECCV, 2020.

<a id='ref_10'>[10]</a> J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. Piater, and H. Ney. RWTH-PHOENIX-Weather: A Large Vocabulary Sign Language Recognition and Translation Corpus. https://www-i6.informatik.rwth-aachen.de/publications/download/773/Forster-LREC-2012.pdf In Language Resources and Evaluation (LREC), pages 3785-3789, Istanbul, Turkey, May 2012. 

<a id='ref_11'>[11]</a> Attention Is All You Need. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. https://arxiv.org/pdf/1706.03762.pdf, June 2017.

<a id='ref_12'>[12]</a>  BET: A Backtranslation Approach for Easy Data Augmentation in Transformer-based Paraphrase Identification Context. Jean-Philippe Corbeil, Hadi Abdi Ghadivel. https://arxiv.org/abs/2009.12452, Sep 2020

<a id='ref_13'>[13]</a> Data augmentation using back-translation for context-aware neural machine translation. Sugiyama & Yoshinaga. https://aclanthology.org/D19-6504, EMNLP 2019)


<a id='ref_14'>[14]</a> Handspeak. Negation in Sign Language. https://www.handspeak.com/learn/index.php?id=156, 2022.

<a id='ref_15'>[15]</a> Data expansion using back translation and paraphrasing for hate speech detection. Djamila RomaissaBeddiar SaroarJahan, MouradOussalah. https://doi.org/10.1016/j.osnem.2021.100153, November 2019

<a id='ref_16'>[16]</a> Text Data Augmentation for Deep Learning. Shorten, C., Khoshgoftaar, T.M. & Furht, B. https://doi.org/10.1186/s40537-021-00492-0, July 2021

