add Chinese NLP support #1210

BenDerPan · 2017-07-21T02:52:48Z

Add Chinese NLP implement code files , follow the step Tutorials

Description

Added the stopwords of Chinese. and make a spacy style structure of Chinese support.
I tried to make Chinese NLP data model, but something failed, more details see: #1209

Types of changes

Bug fix (non-breaking change fixing an issue)
New feature (non-breaking change adding functionality to spaCy)
Breaking change (fix or feature causing change to spaCy's existing functionality)
Documentation (addition to documentation of spaCy)

Checklist:

My change requires a change to spaCy's documentation.
I have updated the documentation accordingly.
I have added tests to cover my changes.
All new and existing tests passed.

honnibal · 2017-07-21T23:24:08Z

Thanks for this! I've replied on the other thread about the data issue.

One thing to note here is that the tokenizer exceptions mechanism won't work in the ordinary way, because the tokenizer is replaced by Jieba.

Here's what I'd like to happen for Chinese, given effort :). I think we should probably subclass the Tokenizer class, and have it segment each character into a token. However, it should group numbers and Latin-alphabet words into single tokens. It seems to me that a lot of Chinese text does include at least some non-Chinese tokens.

Once the text is segmented into tokens, we should be able to use either the parser or tagger classes to predict the token boundaries. We'd then use the merge() method to join them up.

BenDerPan · 2017-07-23T13:06:19Z

Thanks for your help, I will try to figure it out. Chinese language seems to be much more complicated.

wilddylan · 2017-07-25T12:17:12Z

If you need, I'll provide some corpus for what you do.

BenDerPan · 2017-07-27T03:30:31Z

@wilddylan That's great. thank you, my email: benderpan@163.com

eromoe · 2017-08-03T09:30:12Z

@BenDerPan I think you should know that spacy break tokenize and pos tagging in two pipeline steps, which make it very difficult to integrate Chinese tokenizer. I opened a issue #854

howl-anderson · 2017-11-23T04:55:58Z

This feature is very useful! Any progress on this PR now? @BenDerPan

honnibal · 2018-07-24T21:41:10Z

With Chinese support moving forward in the v2.1.0a0, I think it's time to finally get this merged. Sorry about the long delay on this --- but looking forward to finally produce some models, hopefully by September or October.

Thanks for your patience!

ines · 2018-07-24T22:04:26Z

See #2591!

howl-anderson · 2018-08-02T03:44:34Z

@honnibal @ines I already build a complete SpaCy model for Chinese at https://github.com/howl-anderson/Chinese_models_for_SpaCy, I am currently working on improving model performance. I am looking forward to working together to make an official Chinese model for SpaCy.

BenDerPan added 2 commits July 21, 2017 10:38

add template files for Chinese

4d06821

add template files for Chinese, and test directory .

631d390

ines added the lang / zh Chinese language data and models label Jul 22, 2017

ines added the v2 port ❎ label Sep 26, 2017

zqhZY mentioned this pull request Dec 28, 2017

missing lang value in meta.json when pickling zh lang #1774

Closed

ines added v1 spaCy v1.x and removed v2 port ❎ labels Mar 27, 2018

wrathagom mentioned this pull request Apr 9, 2018

does rasa_nlu support chinese? RasaHQ/rasa#975

Closed

ines added enhancement Feature requests and improvements and removed v1 spaCy v1.x labels Jul 24, 2018

honnibal mentioned this pull request Jul 24, 2018

Port BenDerPan's Chinese changes to v2 (finally) #2591

Merged

3 tasks

ines closed this Jul 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add Chinese NLP support #1210

add Chinese NLP support #1210

BenDerPan commented Jul 21, 2017

honnibal commented Jul 21, 2017

BenDerPan commented Jul 23, 2017

wilddylan commented Jul 25, 2017 •

edited

BenDerPan commented Jul 27, 2017

eromoe commented Aug 3, 2017 •

edited

howl-anderson commented Nov 23, 2017

honnibal commented Jul 24, 2018

ines commented Jul 24, 2018

howl-anderson commented Aug 2, 2018

add Chinese NLP support #1210

add Chinese NLP support #1210

Conversation

BenDerPan commented Jul 21, 2017

Description

Types of changes

Checklist:

honnibal commented Jul 21, 2017

BenDerPan commented Jul 23, 2017

wilddylan commented Jul 25, 2017 • edited

BenDerPan commented Jul 27, 2017

eromoe commented Aug 3, 2017 • edited

howl-anderson commented Nov 23, 2017

honnibal commented Jul 24, 2018

ines commented Jul 24, 2018

howl-anderson commented Aug 2, 2018

wilddylan commented Jul 25, 2017 •

edited

eromoe commented Aug 3, 2017 •

edited