Enhance TextDataset for LM training #2202

r0mainK · 2021-04-08T11:39:48Z

Please add the appropriate label to this ticket: enhancement.

Is your feature/enhancement request related to a problem? Please describe.

Currently the TextDataset class in the language_model_trainer.py file is suboptimal, and quite slow. This is due to the usage of for loops instead of comprehension. Furthermore, the way random case flipping is done is error prone for weird character, something which was swept under the rug by the code. Specifically, if random casing changes the length of a sentence, then it is not caught due to to the if token >= tokens: break statement, and results in tokens being lost. Since we do it after expanding the vocabulary, it can also lead to token being UNKed for no reason, or not added.

Describe the solution you'd like

I would like to refactor the code to make it faster and remove the errors related to casing mentionned above. Specifically, it would entail applying the case changes directly when reading the text, and then doing everything with comprehension. Unless there is reason to do otherwise, I would also like to remove the unused (to my knowledge) tokenize function, and merge the __getitem__ and charsplit methods, as I don't quite get why they are split.

Additional context

Although I did not directy use your trainer as it was a bit too much for my needs, I initially was using a close replica of the TextDataset. Modifying it as I described reduced loading time by an order of magnitude, simplified the code, and removed all errors. For reference, here is a snippet of what my __getitem__ currently looks like:

        with gzip.open(self.files[split_id]) as fin:
            lines = list(map(self.apply_random_case_change, jsonlines.Reader(fin)))
        if self.shuffle_lines:
            random.shuffle(lines)
        lines = (
            [self.char_to_ids.get(char.encode("utf-8"), self.unk_id) for char in line]
            + [self.delimiter_id]
            for line in lines
        )
        ids = torch.tensor([char_id for line in lines for char_id in line], dtype=torch.uint8)
        if not self.is_forward_lm:
            ids = ids.flip(0)

The text was updated successfully, but these errors were encountered:

alanakbik · 2021-04-08T11:55:34Z

Hello @r0mainK looks very interesting. If you'd like to prepare a PR for this we'd appreciate it. Would be great to speed up training the LM!

r0mainK · 2021-04-08T14:01:57Z

Hey @alanakbik I just prepared the PR for this, feel free to review whenever :)

stale · 2021-08-06T15:40:05Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

r0mainK mentioned this issue Apr 8, 2021

Lm dataset improve #2203

Merged

stale bot added the wontfix This will not be worked on label Aug 6, 2021

stale bot closed this as completed Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance TextDataset for LM training #2202

Enhance TextDataset for LM training #2202

r0mainK commented Apr 8, 2021 •

edited

alanakbik commented Apr 8, 2021

r0mainK commented Apr 8, 2021

stale bot commented Aug 6, 2021

Enhance TextDataset for LM training #2202

Enhance TextDataset for LM training #2202

Comments

r0mainK commented Apr 8, 2021 • edited

alanakbik commented Apr 8, 2021

r0mainK commented Apr 8, 2021

stale bot commented Aug 6, 2021

r0mainK commented Apr 8, 2021 •

edited