<img vspace="33px" align="right" src="https://www.munich-startup.de/wp-content/uploads/2019/03/TUM_logo-440x236.png" width="120px"/>
<h1>Sign Language Production</h1>
<h3>Applied Deep Learning for NLP</h3>
<p><b>Diego Miguel Lozano</b> | <b>Wenceslao Villegas Marset</b></p>
<p>March 9<sup>th</sup>, 2022</p>

---

# Table of contents

> ### [1. Introduction](#section_1)
>> [**1.1 What is Sign Language Production (SLP)?**](#section_1_1)<br>
>> [**1.2 What is the starting point for our project?**](#section_1_2)

<a id='section_1'></a>
# 1. Introduction

<a id='section_1_1'></a>
### 1.1 What is Sign Language Production (SLP)?

Sign Language Production focuses on translating spoken languages into sign languages and viceversa. According to the World Health Organization (WHO), in 2020 there were more than 466 million deaf people in the world [[1]](#ref_1). This area could be of great help for the hearing-impared community, being for that necessary the development of techniques for both recognition and production of sign languages.

While the Sign Language Recognition has seen numerous advancements in the last years [[2](#ref_2), [3](#ref_3), [4](#ref_4), [5](#ref_5), [6](#ref_6), [7](#ref_7)], Sign Language Production is still a very challenging task, since it involves an interpretation between visual and linguistic information [[8]](#ref_8).

<a id='section_1_2'></a>
### 1.2 What is the starting point for our project?

As we just mentioned, SLP is complex and far from being solved. Nevertheless, there have recently been promising developments, such as the application of Transformer architectures to SLP, what has come to be called "Progressive Transformers."

In this project, we take the [source code](https://github.com/BenSaunders27/ProgressiveTransformersSLP) for the paper "Progressive Transformers for End-to-End Sign Language Production" [[9]](#ref_9) as the starting point.

**TODO**: define exactly what the scope of our project is.

---

# A brief intro to the "Progressive Transformers for SLP" project

In this section, we will quickly explain the main aspects of Progressive Transformers project. If we had to summarize it in only three points, these would be the following: counter decoding, two different approaches ‚ÄìText-to-Gloss-to-Pose (T2G2P) and Text-to-Pose (T2P)‚Äì, and data augmentation.

## Counter decoding

One of the main challenges of SLP is that the output has to maintain certain continuity. The predicted pose in a video frame has to flow naturally from the previous one, and analogously for the frames that follow. This is achieved in the following manner: the model not only predicts the sign pose, but also a "counter". This counter is nothing else but real number in the interval [0, 1]. This value increases monotonically from 0 to 1.0, marking thus the begining and end of sequence, respectively.

<img width="600px" src="./images/counter-decoding.jpg" alt="Counter Decoding"/>
<br>
<div align="center"><i>Representation of counter decoding.</i></div>

<br>

---

**‚ÑπÔ∏è Question:** Why not simply use an BOS token and EOS?

**üí° Answer:** Begining of Sentence (BOS) and End of Sentence (EOS) tokens work well with sentences, but when producing video as we mentioned before we need something more than just marking the begging and end of it. Therefore, the counter serves both as an BOS and EOS and captures information about the flow of the video.

---

### Two different approaches

In the paper, they experimented with two different approaches: T2G2P and T2P:

<img width="600px" src="./images/T2G2P-vs-T2P.png" alt="T2G2P vs T2P Architectures"/>
<br>
<div align="center"><i>Architecture details of (a) Symbolic and (b) Progressive Transformers. (ST: Symbolic Transformer, PT: Progressive Transformer, PE: Positional Encoding, CE: Counter Embedding, MHA: Multi-Head Attention) <a href="#ref_10">[10]</a>.</i></div>

In both cases, the models follow the architecture introduced in "Transfomers is All You Need" <a href="#ref_11">[10]</a>.

In the first approach ‚ÄìT2G2P‚Äì glosses are produced from the input tokens in a first step. Then, this glosses serve as input for another transformer, which then translates the glosses into sign poses.

The second ‚ÄìT2P‚Äì is an end-to-end approach, in which the text is directly translated into sign poses.

### Data augmentation

Finally, the paper explores some data augmentation techniques to determine whether they improve the base model. These augmentations where only carried out with the T2G2P architecture.

- **Future Prediction**: this type of augmentation forces the model to predict the next 10 frames from the current time step, instead of just the next frame. In this way, the model cannot just copy the previous time step, which effectively improves performance over the base architecture.


- **Just Counter**: in this case only the counter values are provided as target input to the model, omitting the 3D skeleton joint coordinates. Again, this has shown to improve results.


- **Gaussian Noise**: the last augmentation method consists in adding Gaussian noise to the skeleton pose sequences during training. This makes the model more robust to prediction inputs.


The following table collects the results of the previous augmentation approaches:

<img width="700px" src="./images/augmentation-results.png" alt="Data Augmentation Results"/>
<br>
<div align="center"><i>The best BLEU-4 performance comes from a combination of future prediction and Gaussian noise augmentation. The model must learn to cope with both multi-frame prediction and a noisy input, building a firm robustness to drift <a href="#ref_10">[10]</a>.</i></div>

### #TODO: document main aspects of their code (Wences?)

---

#  Using pre-trained embeddings

**TODO**: mention that the original project trains embeddings from scratch and that we could leverage pre-trained embeddings (e.g., BERT) to achieve better scores.

## Inside the SLP pretrained model vocab

The original project that we use as starting point provides a plain-text file [`src_vocab`](https://github.com/BenSaunders27/ProgressiveTransformersSLP/blob/master/Configs/src_vocab.txt) containing the vocabulary for which embeddings will then be trained. Here is a snippet of it:

```
<unk>
<pad>
<s>
</s>
...
AUSWAEHLEN
BALD
BEKOMMEN
BITTE
BODENSEE
BRITANNIEN
CHAOS
DAMEN
DAUERND
DUENN
...
IRGENDWO
J+L+I
K+R+E+T+A
...
neg-DEUTSCH
neg-FUENF
neg-GEMUETLICH
neg-GENUG
neg-GLEICH
neg-HART
neg-HEISS
...
negalp-MUSS
negalp-PASSEN
negalp-STIMMT

```

Before jumping in and trying to directly use our pretrained embeddings, it is sensible to first analyze a bit how things work in the original project.

From the previous vocabulary, there are three aspects that are worth mentioning:

- Words such as `AUSWAEHLEN`, `DUENN` and `HEISS` give us a hint that **normalization** is used. A popular algorithm for German normalization is the [German2 snowball algorithm](https://snowballstem.org/algorithms/german2/stemmer.html) which defines the following mappings:
  - '√ü' is replaced by 'ss'.
  - '√§', '√∂', '√º' are replaced by 'a', 'o', 'u', respectively.
  - 'ae' and 'oe' are replaced by 'a', and 'o', respectively.
  - 'ue' is replaced by 'u', when not following a vowel or q.


- As we saw during the seminar lectures, the **special tokens** `<unk>`, `<pad>`, `<s>`, `</s>` are also included in the dictionary. These tokens mark unknown words, padding, beginning of sequence (BOS), and end of sequence (EOS), respectively.


- Some of the words in the vocabulary include the prefixes `neg-` and `negalp-`. We could guess that `neg-` simply means that the word is negated, e.g., `neg-GENUG`‚â° `NICHT GENUG`, but what about the `negalp-` prefix? And also, what do words such as `J+L+I` and `K+R+E+T+A` mean? A look to the paper of the RWTH-PHOENIX-Weather dataset [[10]](#ref_10) (the first version of the dataset used to train the model) gives us the answer:


<img width="400px" src="./images/RWTH-PHOENIX-Weather-Annotation-Scheme.png" alt="RWTH-PHOENIX-Weather Annotation Scheme"/>
<br>
<div align="center"><i>Source <a href="#ref_10">[10]</a>.</i></div>

So in reality `neg-` means "signs negated by headshake" and `negalp-` "signs negated by the alpha[betical] rule" <sup>[1](note_1)</sup>. Words such as `K+R+E+T+A` are words (finger) spelled letter by letter.

Interestingly enough, none of the other types of tokens appear in the source dictionary.

---

<a id='note_1'><sup>1</sup></a> In Sign Language, there are several ways of negating words. One of these ways is using a side-to-side headshake or a frown expression. Also, some verbs have their own negated forms, which is what `negalp-` indicates here [[11]](#ref_11).

### But... where is this vocabulary coming from?

As it turns out, this vocabulary is simply made up of **glosses**. As we mentioned before, the original project proposes two ways of carrying out the translations from text to sign language.

The first one consists in predicting glosses from text, and the translating the glosses to sign language (more precisely, coordinates).

The second approach directly translates from text to sign language, in an end-to-end fashion.

## BERT it up!

If something is clear is that BERT's dictionary will not contain glosses, let alone glosses specifically tailored to SLP.

But then, how does BERT's vocabulary looks like? Let's take a look!

Fortunately, Hugging Face's great API has got us covered: tokenizers expose their vocabulary through the method `get_vocab()`. Let's try with the model [`bert-base-german-cased`](https://huggingface.co/bert-base-german-cased).

In [6]:
from pprint import pprint
from transformers import AutoTokenizer, AutoModelForMaskedLM

# Load tokenizer from pretrained model
tokenizer = AutoTokenizer.from_pretrained("bert-base-german-cased")

In [8]:
# Print some words of its dictionary
pprint(list(tokenizer.get_vocab().items())[:40])

[('[unused1974]', 28973),
 ('Andy', 22652),
 ('Konkur', 4558),
 ('Ferdinand', 8715),
 ('Besondere', 21453),
 ('##ago', 5572),
 ('01.', 9792),
 ('Pok', 11441),
 ('fordert', 8559),
 ('58', 8393),
 ('Rezens', 14475),
 ('klass', 4457),
 ('√ñsterreich', 2661),
 ('Anh√§ngern', 23532),
 ('[unused546]', 27545),
 ('Beschleun', 21506),
 ('Kaufpreis', 14774),
 ('bewirken', 22453),
 ('##f√§r', 25424),
 ('Honorar', 14227),
 ('bestehende', 7726),
 ('Personal', 3959),
 ('Verhandlungs', 16663),
 ('Rese', 14429),
 ('177', 18927),
 ('wirkte', 6420),
 ('schien', 12867),
 ('ungl√ºck', 21829),
 ('legitim', 20663),
 ('[unused1068]', 28067),
 ('##bek', 6295),
 ('##fahrts', 13135),
 ('W√∂rter', 14944),
 ('Abk', 14423),
 ('Rechnungen', 17913),
 ('[unused65]', 27064),
 ('kurze', 7478),
 ('[unused343]', 27342),
 ('Prinzessin', 15653),
 ('Periode', 21859)]


Above, we can see each token with its corresponding ID (just as we saw in the seminar lectures). However, there are two things that catch our attention...

<br>

**Why do some tokens start with "##"?**

Well, this is just a way of indicating that this token is "non-initial", i.e., originally it belonged to a longer word (remember that usually Transformers work at a sub-word level).

<br>


**What about the `[unused###]` tokens?**

These are, unsurprisingly, tokens that are not used. However, they can come handy to add more words to the vocabulary:

> Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized. ([source](https://github.com/google-research/bert/issues/9#issuecomment-434796704))

As a side note, we would like to mention that someone took the time to explore BERT's vocabulary and wrote a great article about it. The article can be found at https://juditacs.github.io/2019/02/19/bert-tokenization-stats.html.

# References

<a id='ref_1'>[1]</a> WHO: World Health Organization. Deafness and hearing loss. http://www.who.int/mediacentre/factsheets/fs300/en/, 2021

<a id='ref_2'>[2]</a> Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Multimodal deep hand sign language recognition in still images using restricted boltzmann machine. Entropy, 20, 2018.

<a id='ref_3'>[3]</a> Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Hand sign language recognition using multi-view hand skeleton. Expert Systems With Applications, 150, 2020.

<a id='ref_4'>[4]</a> Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Video based isolated hand sign language recognition using a deep cascaded model. Multimedia Tools And Applications, 79:22965‚Äì22987, 2020.

<a id='ref_5'>[5]</a> Razieh Rastgoo, Kourosh Kiani, and Sergio Escalera. Hand pose aware multimodal isolated sign language recognition. Multimedia Tools And Applications, 80:127‚Äì163, 2021

<a id='ref_6'>[6]</a> Mark Borg and Kenneth P. Camilleri. Phonologically-meaningful sub-units for deep learning-based sign language recognition. ECCV, 2020

<a id='ref_7'>[7]</a> Agelos Kratimenos, Georgios Pavlakos, and Petros Maragos. 3d hands, face and body extraction for sign language recognition. ECCV, 2020.

<a id='ref_8'>[8]</a> Razieh Rastgoo and Kourosh Kiani and Sergio Escalera and Mohammad Sabokrou. Sign Language Production: A Review. 2021.

<a id='ref_9'>[9]</a> Saunders, Ben and Camgoz, Necati Cihan and Bowden, Richard. Progressive Transformers for End-to-End Sign Language Production. https://www.ecva.net/papers/eccv_2020/papers_ECCV/papers/123560664.pdf. ECCV, 2020.

<a id='ref_10'>[10]</a> J. Forster, C. Schmidt, T. Hoyoux, O. Koller, U. Zelle, J. Piater, and H. Ney. RWTH-PHOENIX-Weather: A Large Vocabulary Sign Language Recognition and Translation Corpus. https://www-i6.informatik.rwth-aachen.de/publications/download/773/Forster-LREC-2012.pdf In Language Resources and Evaluation (LREC), pages 3785-3789, Istanbul, Turkey, May 2012. 

<a id='ref_11'>[11]</a> Attention Is All You Need. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. https://arxiv.org/pdf/1706.03762.pdf, June 2017.

<a id='ref_11'>[12]</a> 

<a id='ref_11'>[13]</a> 

<a id='ref_11'>[14]</a> Handspeak. Negation in Sign Language. https://www.handspeak.com/learn/index.php?id=156, 2022.