New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chinese support for spacy #2483

Open
farhaanbukhsh opened this Issue Jun 27, 2018 · 14 comments

Comments

Projects
None yet
4 participants
@farhaanbukhsh

farhaanbukhsh commented Jun 27, 2018

I have been trying to train the machine for chinese language, I used https://github.com/UniversalDependencies/UD_Chinese-PUD/tree/master
and converted into the JSON format using converter and stored it.

Trained the model since there is no other set such such as dev or train I used the same set for both after all this done I imported and started using the model.

In [2]: import spacy
   ...: nlp = spacy.load('models/model-final/')


In [3]: doc = nlp(u"嘿,你怎麼樣?")
   ...: for chunk in doc.noun_chunks:
   ...:     print(chunk.text, chunk.root.text, chunk.root.dep_,
   ...:           chunk.root.head.text)
Building prefix dict from the default dictionary ...
Loading model from cache /tmp/jieba.cache
Loading model cost 0.222 seconds.
Prefix dict has been built succesfully.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-730a638fc8fd> in <module>()
      1 doc = nlp(u"嘿,你怎麼樣?")
----> 2 for chunk in doc.noun_chunks:
      3     print(chunk.text, chunk.root.text, chunk.root.dep_,
      4           chunk.root.head.text)
      5 

doc.pyx in __get__()

TypeError: 'NoneType' object is not callable

Your Environment

  • Operating System: Linux
  • Python Version Used: Venv 2.7 and 3.x
  • spaCy Version Used: latest
  • Environment Information:
@farhaanbukhsh

This comment has been minimized.

farhaanbukhsh commented Jun 27, 2018

Am I doing something wrong or what could be the issue?

@honnibal

This comment has been minimized.

Member

honnibal commented Jun 27, 2018

I'm not immediately sure what commands you've run. The best way to train a Chinese model at the moment is to use the ud-train command on the develop branch. See here: #2011

@ines

This comment has been minimized.

Member

ines commented Jun 27, 2018

About the doc.noun_chunks error you're seeing: The noun chunks are implemented via language-specific rules – see here for an example for English. Rules like this currently don't exist for Chinese. It should definitely fail more gracefully here, though – I'm pretty sure this has been fixed already for the upcoming version, but I'll double-check!

@farhaanbukhsh

This comment has been minimized.

farhaanbukhsh commented Jun 27, 2018

Hey @ines what if I want to impose the same rule for chinese how do I do it?

@ines

This comment has been minimized.

Member

ines commented Jun 27, 2018

You could either copy the syntax iterators over and implement them the way it's done in English, or you could just call the English noun_chunks function on your Chinese doc:

from spacy.lang.en.syntax_iterators import noun_chunks

doc = nlp(u"嘿,你怎麼樣?")
chunks = noun_chunks(doc) 

Not sure if this all works as expected, though, and if the label scheme matches etc.

@farhaanbukhsh

This comment has been minimized.

farhaanbukhsh commented Jun 27, 2018

I am getting an empty list for all the inputs I am giving, can you guide me on how to do write one, I am not a native chinese speaker.

@ines

This comment has been minimized.

Member

ines commented Jun 27, 2018

Maybe the Chinese corpus you trained on uses a different dependency label scheme? You might have to adjust the labels that are used in that function, and change them to the equivalent used in the corpus. And as I said before, I'm not sure the English function generalises well for Chinese.

@farhaanbukhsh

This comment has been minimized.

farhaanbukhsh commented Jun 28, 2018

Hey I am using a Universal Dependency Chinese Corpora, will the label scheme be still different?

@farhaanbukhsh

This comment has been minimized.

farhaanbukhsh commented Jun 28, 2018

also if not that which language comes closes to it 😄

@farhaanbukhsh

This comment has been minimized.

farhaanbukhsh commented Jun 29, 2018

hey @ines any take on this? Sorry to disturb you 😄

@farhaanbukhsh

This comment has been minimized.

farhaanbukhsh commented Jul 8, 2018

hey I would love to work on this feature.

@howl-anderson

This comment has been minimized.

Contributor

howl-anderson commented Jul 9, 2018

Since the maintainer is very busy and I am also working on Chinese feature, I'd like to help you with this issue, I will work on it later and try to give you a solution. @farhaanbukhsh

@farhaanbukhsh

This comment has been minimized.

farhaanbukhsh commented Jul 10, 2018

@howl-anderson can we collaborate on it? It will be really awesome if we can

@howl-anderson

This comment has been minimized.

Contributor

howl-anderson commented Jul 16, 2018

@ines ines added the enhancement label Jul 30, 2018

@ines ines referenced this issue Jul 30, 2018

Closed

Chinese support #2586

@ines ines added the models label Jul 30, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment