Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement custom Chinese tokenizer. #3

Closed
emfomy opened this issue Jan 22, 2021 · 2 comments
Closed

Implement custom Chinese tokenizer. #3

emfomy opened this issue Jan 22, 2021 · 2 comments
Assignees
Labels
Priority: High second priority Status: 4-Blocked unsolvable Type: Enhancement request new feature

Comments

@emfomy
Copy link
Member

emfomy commented Jan 22, 2021

We may implement our own tokenizer rather than using BertTokenizerFast.
Our own tokenizer should have the following features:

  • Disable word piece. Convert text to token ID character by character (e.g. tokenizer.convert_tokens_to_ids(list(input_text)))
  • Reimplement clean_up_tokenization method. The default method is implemented for English only. Our method may remove whitespaces and convert half-width punctuations to full-width ones.
@emfomy emfomy added the Priority: High second priority label Jan 22, 2021
@emfomy emfomy self-assigned this Jan 22, 2021
@emfomy emfomy added this to the 0.3.0 milestone Jan 22, 2021
@emfomy emfomy added Status: 1-Assigned assigned an assignee Type: Enhancement request new feature labels Jan 22, 2021
@DrTataChacha

This comment has been minimized.

@emfomy
Copy link
Member Author

emfomy commented Feb 26, 2021

Consider using the 🤗Tokenizers after it release a stable version instead.

@emfomy emfomy added Status: 4-Blocked unsolvable and removed Status: 1-Assigned assigned an assignee labels Feb 26, 2021
@emfomy emfomy removed this from the 0.3.0 milestone Feb 26, 2021
@emfomy emfomy closed this as completed Feb 26, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Priority: High second priority Status: 4-Blocked unsolvable Type: Enhancement request new feature
Projects
None yet
Development

No branches or pull requests

2 participants