Lm dataset improve #2203

r0mainK · 2021-04-08T14:01:20Z

Hey there, this is the PR related to this issue I raised earlier. I split the PR in 4 commits:

remove the unused TextDataset.tokenize method, which was used nowhere in the code
merge the TextDataset.charsplit into TextDataset.__getitem__ as it was ugly to have these separated, and again the method TextDataset.charsplit was not used elsewhere
apply the random case changes before expanding the vocab and counting the tokens, so as to remove two potential errors:
- a case change which modified a sentence length (happens on rare tokens), which would lead to text shifting and a corrupt input - this was not caught due to this statement: if token >= tokens: break
- a case change which led to a character being added to the vocab but not used, and the actual character buing UNKed
split the lines only once (instead of twice if vocab expansion enabled) and use list comprehension instead of nested for loops to create the ids

I ran tests and all seems to work. I also tested the code on mock inputs on my CPU, and to give you an idea of the speedup with an input of 500,000 lines of length 20, the tensor is created in 1s now, vs over a minute before.

Now obviously during training this won't be a huge gain as we load data asynchronously, however the first batch of each epoch is always loaded in real time, so this will represent a net gain at each epoch. I was training on the recently released Wiki40b, and loading time of each split I had went from ~20min to ~1min, and I trained for over 30 epochs :p

Anyway tell me what you think !

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

alanakbik · 2021-05-01T08:51:49Z

@r0mainK wow the code is much more succinct and faster, thanks for adding this! And sorry for only getting around to review it now!

r0mainK force-pushed the lm-dataset-improve branch from 3fd43ca to de9fec1 Compare April 8, 2021 14:13

r0mainK added 4 commits April 8, 2021 16:18

Remove unused TextDataset.tokenize method

b9e9c6e

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

Merge TextDataset.charsplit method with TextDataset.__getitem__

f979138

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

Apply random case changes before expanding vocab and counting tokens

5f4748b

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

Apply splitting only once and use comprehension to create ids

cd99988

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>

r0mainK force-pushed the lm-dataset-improve branch from de9fec1 to cd99988 Compare April 8, 2021 14:18

alanakbik merged commit 78e2239 into flairNLP:master May 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lm dataset improve #2203

Lm dataset improve #2203

r0mainK commented Apr 8, 2021

alanakbik commented May 1, 2021

Lm dataset improve #2203

Lm dataset improve #2203

Conversation

r0mainK commented Apr 8, 2021

alanakbik commented May 1, 2021