Tokenization strategy #9

IlyasMoutawwakil · 2022-06-24T21:27:32Z

Hi and thanks for the awesome repo. Did you try any other tokenization strategies (sentencepiece, wordpiece or bpe). I see you use a character level tokenization which is nice but probably dosen't make full use of the language model capabilities. I would love to get some insights.

YongWookHa · 2022-06-27T06:12:30Z

Hello.
Thank you for having an interest in this repository.
Unfortunately, I had used only simple tokenization methods because I used this model to recognize Korean letters.
I'm pretty sure that you would get better performance when you use more sophisticated methods like BPE or sentencepiece.

WongVi · 2022-07-19T01:12:39Z

@YongWookHa could you please let me know how can I make a tokenizer for the Japanese dataset?

IlyasMoutawwakil closed this as completed Jun 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenization strategy #9

Tokenization strategy #9

IlyasMoutawwakil commented Jun 24, 2022

YongWookHa commented Jun 27, 2022

WongVi commented Jul 19, 2022

Tokenization strategy #9

Tokenization strategy #9

Comments

IlyasMoutawwakil commented Jun 24, 2022

YongWookHa commented Jun 27, 2022

WongVi commented Jul 19, 2022