Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NER & Parsing not working for new language #82

Closed
bablf opened this issue Jan 17, 2022 · 2 comments
Closed

NER & Parsing not working for new language #82

bablf opened this issue Jan 17, 2022 · 2 comments

Comments

@bablf
Copy link

bablf commented Jan 17, 2022

I am currently trying to import stanza's NER and dependency parsing for Arabic into spacy.
As mentioned in a different issue, there seems to be an issue with "mwt" and importing the named entities into the spacy object. The "Arabic"-Pipeline is no different from this and I have to deal with the same problem.

To deal with this I thought of the following workaround:

  • CoreNLP supports tokenization and sentence splitting for Arabic.
  • Let's call the CoreNLP Server and generate the words and sent_starts from the returned object.
  • Then we create a Doc:
nlp = spacy_stanza.load_pipeline("xx", lang="ar", processors='tokenize, pos, lemma, depparse, ner',
                                  use_gpu=True, tokenize_pretokenized=True)
doc = Doc(nlp.vocab, words=words, sent_starts=sent_starts)

So far everything works fine.

But once I call the nlp pipeline the returned object does not have entities and has_annotation is also False.
There are no error messages. So I don't know what I am doing wrong.
But it seems like the stanza pipeline is not even called. It isn't because of tokenize_pretokenized either.

Are there just missing error messages and it is just the same problem as #32 ?

Minimal working example (without entities). Translation is: "I am hungry. I am going home."

import spacy_stanza
stanza.download("ar")
nlp = spacy_stanza.load_pipeline("xx", lang="ar", processors='tokenize, pos, lemma, depparse, ner',
                                  use_gpu=True, tokenize_pretokenized=True)
words = ['انا', 'جائع', '.', 'أنا', 'ذاهب', 'إلى', 'المنزل']
sent_starts = [True, False, False, True, False, False, False]
doc = Doc(nlp.vocab, words=words, sent_starts=sent_starts)
nlp(doc).has_annotation("DEP")
@bablf
Copy link
Author

bablf commented Jan 18, 2022

tokenize_pretokenized and calling the pipeline with a Doc seemed to be the issue. See quoted issue for solution. Only sentence splitting does not work. Sentence splitting also works if you follow this input format and add the tokenize_pretokenized option:

'This is token.ization done my way!\nSentence split, too!'

@bablf bablf closed this as completed Jan 18, 2022
@adrianeboyd
Copy link
Contributor

In case someone comes across this in the future:

The issue is that the whole stanza pipeline is integrated as the tokenizer in the spacy pipeline (which is a bit unexpected) and you're not running the tokenizer when you call:

doc = Doc(words=words)
doc = nlp(doc)

Starting with a text, doc = nlp(text) does this:

doc = nlp.make_doc(text)
doc = nlp(doc)

With tokenize_pretokenized=True (which splits tokens on whitespace instead of running tokenize and mwt) and tokens from another source, you would want this to run the stanza pipeline on the tokens:

doc = nlp(" ".join(words))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants