New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add Chinese NLP support #1210
add Chinese NLP support #1210
Conversation
Thanks for this! I've replied on the other thread about the data issue. One thing to note here is that the tokenizer exceptions mechanism won't work in the ordinary way, because the tokenizer is replaced by Jieba. Here's what I'd like to happen for Chinese, given effort :). I think we should probably subclass the Tokenizer class, and have it segment each character into a token. However, it should group numbers and Latin-alphabet words into single tokens. It seems to me that a lot of Chinese text does include at least some non-Chinese tokens. Once the text is segmented into tokens, we should be able to use either the parser or tagger classes to predict the token boundaries. We'd then use the |
Thanks for your help, I will try to figure it out. Chinese language seems to be much more complicated. |
If you need, I'll provide some corpus for what you do. |
@wilddylan That's great. thank you, my email: benderpan@163.com |
@BenDerPan I think you should know that spacy break |
This feature is very useful! Any progress on this PR now? @BenDerPan |
With Chinese support moving forward in the v2.1.0a0, I think it's time to finally get this merged. Sorry about the long delay on this --- but looking forward to finally produce some models, hopefully by September or October. Thanks for your patience! |
See #2591! |
@honnibal @ines I already build a complete SpaCy model for Chinese at https://github.com/howl-anderson/Chinese_models_for_SpaCy, I am currently working on improving model performance. I am looking forward to working together to make an official Chinese model for SpaCy. |
Add Chinese NLP implement code files , follow the step Tutorials
Description
Added the stopwords of Chinese. and make a spacy style structure of Chinese support.
I tried to make Chinese NLP data model, but something failed, more details see: #1209
Types of changes
Checklist: