Add TinyBERT data augmentation #1923

MichelBartels · 2021-12-23T17:01:32Z

Proposed changes:
This adds TinyBERT data augmentation as described in #1874.
In its current form, it is very separate from the rest of the codebase as there wasn't an existing haystack abstraction which seemed appropriate. DataSilo was considered, but this didn't seem to be in its spirit. Integrating it in distil_from, for example, would also probably need a lot of additional parameters that you wouldn't expect in this method.

Currently the following works:

it can augment the SQuAD dataset with the same technique used in the paper
in contrast to the original implementation it is using batching leading to better performance (about 8 times the speed)

Status (please check what you already did):

First draft (up for discussions & feedback)
Final code
Added tests
Updated documentation

closes #1874

julian-risch

Very much looking forward to the results of the first experiments! I thought I leave some feedback although I understand this is still a draft.
Let's compare different glove embedding models and maybe even fasttext in the experiments too at some point later. I would also be interested to learn how often the replaced words are single-piece words (BERT is used) or multiple-piece words (glove is used).

julian-risch · 2021-12-28T12:47:17Z

haystack/utils/augment_squad.py

@@ -0,0 +1,178 @@
+"""
+Script to perform data augmentation on a SQuAD like dataset to increase training data. It follows the approach oultined in the TinyBERT paper.


let's add a link to the paper here as well

https://arxiv.org/pdf/1909.10351.pdf

I have added the link now.

julian-risch · 2021-12-28T12:47:43Z

haystack/utils/augment_squad.py

+"""
+Script to perform data augmentation on a SQuAD like dataset to increase training data. It follows the approach oultined in the TinyBERT paper.
+Usage:
+    python augment_squad.py --squad_path <squad_path> --output_path <output_patn> \


typo: patn -> path

This is now fixed.

julian-risch · 2021-12-28T13:07:21Z

haystack/utils/augment_squad.py

+def load_glove(glove_path: Path = Path("glove.txt"), vocab_size: int = 100_000):
+    if not glove_path.exists():
+        zip_path = glove_path.parent / (glove_path.name + ".zip")
+        request = requests.get("https://nlp.stanford.edu/data/glove.42B.300d.zip", allow_redirects=True)


Let's compare performance of https://nlp.stanford.edu/data/glove.840B.300d.zip and https://nlp.stanford.edu/data/glove.6B.zip
It would be also interesting to see whether fasttext performs better than glove when used for data augmentation. We need a way to use data augmentation for non-English datasets as well and fasttext could help with that.

import fasttext ft = fasttext.load_model('cc.en.300.bin') #german model: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.bin.gz ft.get_nearest_neighbors('hello')

I agree that this would be interesting to see. I think with a model trained using whole word masking we could perhaps even try not using this additional step at all, however, it would probably be quite costly to test this.

Distilling SQuAD the way they did for the original results would be equivalent timewise to about 100 epochs on an unaugmented dataset. 100 of those epochs would take about 100*45min=75h. In addition to that, data augmentation takes about 45h because you need to do a forward pass for each word.

Although I am trying to improve data augmentation speed and we could use a smaller model as student, trying out different tokenizer is probably not worth it as about 94% of all words can be replaced using BERT (tested with this dataset).

julian-risch · 2021-12-28T13:12:20Z

haystack/utils/augment_squad.py

+            possible_words.append([word] + tokenizer.convert_ids_to_tokens(ranking))
+
+            batch_index += 1
+        elif word in glove_word_id_mapping:


this is where we could try fasttext. The elif will become an else then because there are no out-of-vocabulary issues with fasttext

import fasttext ft = fasttext.load_model('cc.en.300.bin') #german model: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.de.300.bin.gz ft.get_nearest_neighbors('hello')

That could be useful as 43% of the times when BERT can't be used glove doesn't work either. (Same test on this dataset as above)

However, I'm not so sure how great the difference would really be as this would still only be the case in 2.6% of all cases at the cost of being comparable to the original paper or at the cost of additional distillation runs (as explained above they take up a lot of time).

julian-risch · 2021-12-28T13:14:33Z

haystack/utils/augment_squad.py

+    parser.add_argument("--replace_probability", type=float, default=0.4, help="Probability of replacing a word")
+    parser.add_argument("--glove_path", type=Path, default="glove.txt", help="Path to the glove file")
+
+    model = BertForMaskedLM.from_pretrained("bert-base-uncased")


Let's pass the model/tokenizer name as an argument to the script as well.

I have added these arguments.

julian-risch

There are two points that I would like to talk about before merging this PR. First, let's talk about whether is_impossible should be set to True and why. Second, I came up with an idea for a test case to test the script end-to-end (at least the number of generated questions and the format of the generated squad file).

julian-risch · 2022-01-04T13:38:05Z

haystack/utils/augment_squad.py

+    for topic in tqdm(squad["data"]):
+        paragraphs = []
+        for paragraph in topic["paragraphs"]:
+            # make every question unanswerable as answer strings will probably match and aren't relevant for distillation


I am not sure I understand this comment correctly. I understand that answer strings won't be relevant for distillation because we will make predictions with the teacher model anyway. However, what do you mean with "answer strings will probably match"? Why do we want to set is_impossible to True? That would result in this question being handled as not answerable. Couldn't we leave question["answers"] = [] as is but have question["is_impossible"] = False?

julian-risch · 2022-01-04T13:44:23Z

haystack/utils/augment_squad.py

+
+    args = parser.parse_args()
+
+    augment_squad(**vars(args))


Regarding testing: I would suggest that for a test we load a small/tiny squad file with SquadData and count the number of questions with

haystack/haystack/utils/squad_data.py

Line 146 in 13510aa

def count(self, unit="questions"):

Next step is to run the augment_squad() method and in the end load the result again with SquadData and count again to see whether the size of the dataset was multiplied as expected by multiplication_factor. What do you think?

julian-risch

LGTM! Let's wait for the tests and merge if all of them are green.

MichelBartels added 3 commits December 23, 2021 16:38

add tinybert data augmentation

080b80c

don't reload glove in tinybert data augmentation

6543bfc

fix unnecessary load_glove call

40ac53f

MichelBartels marked this pull request as draft December 23, 2021 17:01

MichelBartels assigned MichelBartels and julian-risch and unassigned MichelBartels Dec 23, 2021

julian-risch added the topic:modeling label Dec 28, 2021

julian-risch reviewed Dec 28, 2021

View reviewed changes

MichelBartels added 12 commits January 3, 2022 10:52

fix type hints

77f81fd

add comments and type hints

38bfe2d

add batch_size argument

0aecd64

don't predict subwords as alternative for words

c183702

fix subword predictions

88e2c72

limit sequence length

1b84fd9

actually limit sequence length

7b6db37

improve performance by calculating nearest glove vector on gpu

62f9565

add model and tokenizer parameter

c36ce6d

fix type hints

872600f

improve data augmentation performance

7ec0c6b

explained limits of script

fa62cac

MichelBartels marked this pull request as ready for review January 4, 2022 13:23

corrected comment

4ae1cc0

julian-risch requested changes Jan 4, 2022

View reviewed changes

MichelBartels added 2 commits January 4, 2022 16:20

added data augmentation test

3c2f08e

don't label every question in augmented dataset as impossible

8f6f2e7

julian-risch approved these changes Jan 4, 2022

View reviewed changes

MichelBartels added 2 commits January 4, 2022 16:58

add sample glove

ca64d14

better handling of downloading of glove

e9ecdd3

fix typo of last commit

b33897f

MichelBartels merged commit 0b0b968 into master Jan 4, 2022

MichelBartels deleted the tinybert_data_augmentation branch January 4, 2022 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TinyBERT data augmentation #1923

Add TinyBERT data augmentation #1923

MichelBartels commented Dec 23, 2021 •

edited

julian-risch left a comment

julian-risch Dec 28, 2021

julian-risch Dec 28, 2021

MichelBartels Jan 3, 2022

julian-risch Dec 28, 2021

MichelBartels Jan 3, 2022

julian-risch Dec 28, 2021

MichelBartels Jan 3, 2022 •

edited

julian-risch Dec 28, 2021

MichelBartels Jan 3, 2022 •

edited

julian-risch Dec 28, 2021

MichelBartels Jan 3, 2022

julian-risch left a comment

julian-risch Jan 4, 2022

julian-risch Jan 4, 2022

julian-risch left a comment

		@@ -0,0 +1,178 @@
		"""
		Script to perform data augmentation on a SQuAD like dataset to increase training data. It follows the approach oultined in the TinyBERT paper.

Add TinyBERT data augmentation #1923

Add TinyBERT data augmentation #1923

Conversation

MichelBartels commented Dec 23, 2021 • edited

julian-risch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichelBartels Jan 3, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

MichelBartels Jan 3, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julian-risch left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

julian-risch left a comment

Choose a reason for hiding this comment

MichelBartels commented Dec 23, 2021 •

edited

MichelBartels Jan 3, 2022 •

edited

MichelBartels Jan 3, 2022 •

edited