Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hey there, this is the PR related to this issue I raised earlier. I split the PR in 4 commits:
TextDataset.tokenize
method, which was used nowhere in the codeTextDataset.charsplit
intoTextDataset.__getitem__
as it was ugly to have these separated, and again the methodTextDataset.charsplit
was not used elsewhereif token >= tokens: break
for
loops to create theids
I ran tests and all seems to work. I also tested the code on mock inputs on my CPU, and to give you an idea of the speedup with an input of 500,000 lines of length 20, the tensor is created in 1s now, vs over a minute before.
Now obviously during training this won't be a huge gain as we load data asynchronously, however the first batch of each epoch is always loaded in real time, so this will represent a net gain at each epoch. I was training on the recently released Wiki40b, and loading time of each split I had went from ~20min to ~1min, and I trained for over 30 epochs :p
Anyway tell me what you think !