-
Notifications
You must be signed in to change notification settings - Fork 147
Small Training Dataset #21
Copy link
Copy link
Closed
Description
Since the tokenization on all the dataset takes a lot of time, I have decided to create a small dataset with only 10-20 of the json.gz files. Once training starts, it gives the following error. Is it because the tokenization/BPE have not seen this character?
File "/CodeGen/codegen_sources/model/train.py", line 701, in <module> main(params) File "/CodeGen/codegen_sources/model/train.py", line 609, in main trainer.mlm_step( File
"/CodeGen/codegen_sources/model/src/trainer.py", line 1005, in mlm_step show_batch( File
"/CodeGen/codegen_sources/model/src/utils.py", line 74, in show_batch f"{label} sent:
{restore_segmentation_sentence(source_sentence, roberta_mode)}" File "/CodeGen/codegen_sources/model/src/utils.py",
line 563, in restore_segmentation_sentence return restore_roberta_segmentation_sentence(sentence) File
"/CodeGen/codegen_sources/model/src/utils.py", line 601, in restore_roberta_segmentation_sentence res =
bytearray([byte_decoder[c] for c in text]).decode("utf-8", errors="replace") File
"/CodeGen/codegen_sources/model/src/utils.py", line 601, in <listcomp> res = bytearray([byte_decoder[c] for c in
text]).decode("utf-8", errors="replace") KeyError: '郞'
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels