Implement custom Chinese tokenizer. #3

emfomy · 2021-01-22T07:00:26Z

We may implement our own tokenizer rather than using BertTokenizerFast.
Our own tokenizer should have the following features:

Disable word piece. Convert text to token ID character by character (e.g. tokenizer.convert_tokens_to_ids(list(input_text)))
Reimplement clean_up_tokenization method. The default method is implemented for English only. Our method may remove whitespaces and convert half-width punctuations to full-width ones.

The text was updated successfully, but these errors were encountered:

emfomy · 2021-02-26T04:50:43Z

Consider using the 🤗Tokenizers after it release a stable version instead.

emfomy added the Priority: High second priority label Jan 22, 2021

emfomy self-assigned this Jan 22, 2021

emfomy added this to the 0.3.0 milestone Jan 22, 2021

emfomy added Status: 1-Assigned assigned an assignee Type: Enhancement request new feature labels Jan 22, 2021

This comment has been minimized.

Sign in to view

emfomy mentioned this issue Feb 1, 2021

Unable to load weights from pytorch checkpoint. #4

Closed

emfomy added Status: 4-Blocked unsolvable and removed Status: 1-Assigned assigned an assignee labels Feb 26, 2021

emfomy removed this from the 0.3.0 milestone Feb 26, 2021

emfomy closed this as completed Feb 26, 2021

Provide feedback