Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lm dataset improve #2203

Merged
merged 4 commits into from May 1, 2021
Merged

Lm dataset improve #2203

merged 4 commits into from May 1, 2021

Conversation

r0mainK
Copy link
Contributor

@r0mainK r0mainK commented Apr 8, 2021

Hey there, this is the PR related to this issue I raised earlier. I split the PR in 4 commits:

  • remove the unused TextDataset.tokenize method, which was used nowhere in the code
  • merge the TextDataset.charsplit into TextDataset.__getitem__ as it was ugly to have these separated, and again the method TextDataset.charsplit was not used elsewhere
  • apply the random case changes before expanding the vocab and counting the tokens, so as to remove two potential errors:
    • a case change which modified a sentence length (happens on rare tokens), which would lead to text shifting and a corrupt input - this was not caught due to this statement: if token >= tokens: break
    • a case change which led to a character being added to the vocab but not used, and the actual character buing UNKed
  • split the lines only once (instead of twice if vocab expansion enabled) and use list comprehension instead of nested for loops to create the ids

I ran tests and all seems to work. I also tested the code on mock inputs on my CPU, and to give you an idea of the speedup with an input of 500,000 lines of length 20, the tensor is created in 1s now, vs over a minute before.

Now obviously during training this won't be a huge gain as we load data asynchronously, however the first batch of each epoch is always loaded in real time, so this will represent a net gain at each epoch. I was training on the recently released Wiki40b, and loading time of each split I had went from ~20min to ~1min, and I trained for over 30 epochs :p

Anyway tell me what you think !

Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
Signed-off-by: Romain Keramitas <r.keramitas@gmail.com>
@alanakbik
Copy link
Collaborator

@r0mainK wow the code is much more succinct and faster, thanks for adding this! And sorry for only getting around to review it now!

@alanakbik alanakbik merged commit 78e2239 into flairNLP:master May 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants