NER & Parsing not working for new language #82

bablf · 2022-01-17T16:16:10Z

I am currently trying to import stanza's NER and dependency parsing for Arabic into spacy.
As mentioned in a different issue, there seems to be an issue with "mwt" and importing the named entities into the spacy object. The "Arabic"-Pipeline is no different from this and I have to deal with the same problem.

To deal with this I thought of the following workaround:

CoreNLP supports tokenization and sentence splitting for Arabic.
Let's call the CoreNLP Server and generate the words and sent_starts from the returned object.
Then we create a Doc:

nlp = spacy_stanza.load_pipeline("xx", lang="ar", processors='tokenize, pos, lemma, depparse, ner',
                                  use_gpu=True, tokenize_pretokenized=True)
doc = Doc(nlp.vocab, words=words, sent_starts=sent_starts)

So far everything works fine.

But once I call the nlp pipeline the returned object does not have entities and has_annotation is also False.
There are no error messages. So I don't know what I am doing wrong.
But it seems like the stanza pipeline is not even called. It isn't because of tokenize_pretokenized either.

Are there just missing error messages and it is just the same problem as #32 ?

Minimal working example (without entities). Translation is: "I am hungry. I am going home."

import spacy_stanza
stanza.download("ar")
nlp = spacy_stanza.load_pipeline("xx", lang="ar", processors='tokenize, pos, lemma, depparse, ner',
                                  use_gpu=True, tokenize_pretokenized=True)
words = ['انا', 'جائع', '.', 'أنا', 'ذاهب', 'إلى', 'المنزل']
sent_starts = [True, False, False, True, False, False, False]
doc = Doc(nlp.vocab, words=words, sent_starts=sent_starts)
nlp(doc).has_annotation("DEP")

The text was updated successfully, but these errors were encountered:

bablf · 2022-01-18T06:51:41Z

tokenize_pretokenized and calling the pipeline with a Doc seemed to be the issue. See quoted issue for solution. ~~Only sentence splitting does not work.~~ Sentence splitting also works if you follow this input format and add the tokenize_pretokenized option:

'This is token.ization done my way!\nSentence split, too!'

adrianeboyd · 2022-01-18T07:13:09Z

In case someone comes across this in the future:

The issue is that the whole stanza pipeline is integrated as the tokenizer in the spacy pipeline (which is a bit unexpected) and you're not running the tokenizer when you call:

doc = Doc(words=words)
doc = nlp(doc)

Starting with a text, doc = nlp(text) does this:

doc = nlp.make_doc(text)
doc = nlp(doc)

With tokenize_pretokenized=True (which splits tokens on whitespace instead of running tokenize and mwt) and tokens from another source, you would want this to run the stanza pipeline on the tokens:

doc = nlp(" ".join(words))

Refs: explosion/spacy-stanza#82 Refs: https://stanfordnlp.github.io/stanza/tokenize.html#use-spacy-for-fast-tokenization-and-sentence-segmentation

bablf mentioned this issue Jan 18, 2022

Offset misalignment in NER using the Stanza tokenizer for French #32

Open

bablf closed this as completed Jan 18, 2022

andrewalkermo added a commit to andrewalkermo/PTOIE-Dep that referenced this issue Jun 26, 2023

fix: Evita problema do SpaCy com mwt

4c5d5a5

Refs: explosion/spacy-stanza#82 Refs: https://stanfordnlp.github.io/stanza/tokenize.html#use-spacy-for-fast-tokenization-and-sentence-segmentation

andrewalkermo added a commit to andrewalkermo/PTOIE-Dep that referenced this issue Jun 27, 2023

fix: Evita problema do SpaCy com mwt

b7437e5

Refs: explosion/spacy-stanza#82 Refs: https://stanfordnlp.github.io/stanza/tokenize.html#use-spacy-for-fast-tokenization-and-sentence-segmentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NER & Parsing not working for new language #82

NER & Parsing not working for new language #82

bablf commented Jan 17, 2022

bablf commented Jan 18, 2022 •

edited

Loading

adrianeboyd commented Jan 18, 2022

NER & Parsing not working for new language #82

NER & Parsing not working for new language #82

Comments

bablf commented Jan 17, 2022

bablf commented Jan 18, 2022 • edited Loading

adrianeboyd commented Jan 18, 2022

bablf commented Jan 18, 2022 •

edited

Loading