Small Training Dataset

Since the tokenization on all the dataset takes a lot of time, I have decided to create a small dataset with only 10-20 of the json.gz files. Once training starts, it gives the following error. Is it because the tokenization/BPE have not seen this character?

```
File "/CodeGen/codegen_sources/model/train.py", line 701, in <module> main(params) File "/CodeGen/codegen_sources/model/train.py", line 609, in main trainer.mlm_step( File 
"/CodeGen/codegen_sources/model/src/trainer.py", line 1005, in mlm_step show_batch( File 
"/CodeGen/codegen_sources/model/src/utils.py", line 74, in show_batch f"{label} sent: 
{restore_segmentation_sentence(source_sentence, roberta_mode)}" File "/CodeGen/codegen_sources/model/src/utils.py", 
line 563, in restore_segmentation_sentence return restore_roberta_segmentation_sentence(sentence) File 
"/CodeGen/codegen_sources/model/src/utils.py", line 601, in restore_roberta_segmentation_sentence res = 
bytearray([byte_decoder[c] for c in text]).decode("utf-8", errors="replace") File 
"/CodeGen/codegen_sources/model/src/utils.py", line 601, in <listcomp> res = bytearray([byte_decoder[c] for c in 
text]).decode("utf-8", errors="replace") KeyError: '郞'
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Small Training Dataset #21

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Small Training Dataset #21

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions