You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Please add the appropriate label to this ticket: enhancement.
Is your feature/enhancement request related to a problem? Please describe.
Currently the TextDataset class in the language_model_trainer.py file is suboptimal, and quite slow. This is due to the usage of for loops instead of comprehension. Furthermore, the way random case flipping is done is error prone for weird character, something which was swept under the rug by the code. Specifically, if random casing changes the length of a sentence, then it is not caught due to to the if token >= tokens: break statement, and results in tokens being lost. Since we do it after expanding the vocabulary, it can also lead to token being UNKed for no reason, or not added.
Describe the solution you'd like
I would like to refactor the code to make it faster and remove the errors related to casing mentionned above. Specifically, it would entail applying the case changes directly when reading the text, and then doing everything with comprehension. Unless there is reason to do otherwise, I would also like to remove the unused (to my knowledge) tokenize function, and merge the __getitem__ and charsplit methods, as I don't quite get why they are split.
Additional context
Although I did not directy use your trainer as it was a bit too much for my needs, I initially was using a close replica of the TextDataset. Modifying it as I described reduced loading time by an order of magnitude, simplified the code, and removed all errors. For reference, here is a snippet of what my __getitem__ currently looks like:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.
Please add the appropriate label to this ticket: enhancement.
Is your feature/enhancement request related to a problem? Please describe.
Currently the
TextDataset
class in thelanguage_model_trainer.py
file is suboptimal, and quite slow. This is due to the usage of for loops instead of comprehension. Furthermore, the way random case flipping is done is error prone for weird character, something which was swept under the rug by the code. Specifically, if random casing changes the length of a sentence, then it is not caught due to to theif token >= tokens: break
statement, and results in tokens being lost. Since we do it after expanding the vocabulary, it can also lead to token being UNKed for no reason, or not added.Describe the solution you'd like
I would like to refactor the code to make it faster and remove the errors related to casing mentionned above. Specifically, it would entail applying the case changes directly when reading the text, and then doing everything with comprehension. Unless there is reason to do otherwise, I would also like to remove the unused (to my knowledge)
tokenize
function, and merge the__getitem__
andcharsplit
methods, as I don't quite get why they are split.Additional context
Although I did not directy use your trainer as it was a bit too much for my needs, I initially was using a close replica of the
TextDataset
. Modifying it as I described reduced loading time by an order of magnitude, simplified the code, and removed all errors. For reference, here is a snippet of what my__getitem__
currently looks like:The text was updated successfully, but these errors were encountered: