Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add Chinese NLP support #1210

Closed
wants to merge 2 commits into from
Closed

add Chinese NLP support #1210

wants to merge 2 commits into from

Conversation

BenDerPan
Copy link

Add Chinese NLP implement code files , follow the step Tutorials

Description

Added the stopwords of Chinese. and make a spacy style structure of Chinese support.
I tried to make Chinese NLP data model, but something failed, more details see: #1209

Types of changes

  • Bug fix (non-breaking change fixing an issue)
  • New feature (non-breaking change adding functionality to spaCy)
  • Breaking change (fix or feature causing change to spaCy's existing functionality)
  • Documentation (addition to documentation of spaCy)

Checklist:

  • My change requires a change to spaCy's documentation.
  • I have updated the documentation accordingly.
  • I have added tests to cover my changes.
  • All new and existing tests passed.

@honnibal
Copy link
Member

Thanks for this! I've replied on the other thread about the data issue.

One thing to note here is that the tokenizer exceptions mechanism won't work in the ordinary way, because the tokenizer is replaced by Jieba.

Here's what I'd like to happen for Chinese, given effort :). I think we should probably subclass the Tokenizer class, and have it segment each character into a token. However, it should group numbers and Latin-alphabet words into single tokens. It seems to me that a lot of Chinese text does include at least some non-Chinese tokens.

Once the text is segmented into tokens, we should be able to use either the parser or tagger classes to predict the token boundaries. We'd then use the merge() method to join them up.

@ines ines added the lang / zh Chinese language data and models label Jul 22, 2017
@BenDerPan
Copy link
Author

Thanks for your help, I will try to figure it out. Chinese language seems to be much more complicated.

@wilddylan
Copy link

wilddylan commented Jul 25, 2017

If you need, I'll provide some corpus for what you do.

@BenDerPan
Copy link
Author

@wilddylan That's great. thank you, my email: benderpan@163.com

@eromoe
Copy link

eromoe commented Aug 3, 2017

@BenDerPan I think you should know that spacy break tokenize and pos tagging in two pipeline steps, which make it very difficult to integrate Chinese tokenizer. I opened a issue #854

@howl-anderson
Copy link
Contributor

This feature is very useful! Any progress on this PR now? @BenDerPan

@honnibal
Copy link
Member

With Chinese support moving forward in the v2.1.0a0, I think it's time to finally get this merged. Sorry about the long delay on this --- but looking forward to finally produce some models, hopefully by September or October.

Thanks for your patience!

@ines ines added enhancement Feature requests and improvements and removed v1 spaCy v1.x labels Jul 24, 2018
@ines
Copy link
Member

ines commented Jul 24, 2018

See #2591!

@ines ines closed this Jul 24, 2018
@howl-anderson
Copy link
Contributor

@honnibal @ines I already build a complete SpaCy model for Chinese at https://github.com/howl-anderson/Chinese_models_for_SpaCy, I am currently working on improving model performance. I am looking forward to working together to make an official Chinese model for SpaCy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements lang / zh Chinese language data and models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants