Offset misalignment in NER StanzaLanguage Tokenizer #33

aishwarya-agrawal · 2020-04-29T15:19:07Z

text = """ Tobacco/Smoke Exposure Family members smoke indoors, Daily. Caffeine use Coffee,"""
doc = snlp(text)
print([(e.text, e.label_, text[e.start_char:e.end_char]) for e in doc.ents])

Gives the output:

UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Ents: [('Caffeine use', 'disease', 61, 73)]
doc = snlp(text)
[]

On printing the two texts i.e. snlp_doc.text, doc.text
Getting following texts:

snlp_doc.text = " Tobacco/Smoke Exposure Family members smoke indoors, Daily. Caffeine use Coffee,"
doc.text =   "  Tobacco / Smoke Exposure Family members smoke indoors , Daily . Caffeine use Coffee ,"

Because of which above error is coming and we are losing the identified entities
Even with basic configs mentioned in readme:

The text was updated successfully, but these errors were encountered:

redadmiral · 2020-05-28T15:50:40Z

I encounter the same problem using the german language model. When there are synaeresises in the input text the model replaces them by the two originating words, but apparently doesn't update the input's character offset.

The input Hans Müller isst gerne Vanilleeis am Hamburger Dom. returns the entity dem Hamburger instead of Hamburger Dom:

In [21]: doc = nlp("Hans Müller isst gerne Vanilleeis am Hamburger Dom.")                        

In [22]: doc.ents                                                                                
Out[22]: (Hans Müller, dem Hamburger)

While everything is fine as long as the synaeresis am is already split up to an dem in the input:

In [23]: doc = nlp("Hans Müller isst gerne Vanilleeis an dem Hamburger Dom.")                    

In [24]: doc.ents                                                                                
Out[24]: (Hans Müller, Hamburger Dom)

This seems to be a problem with the spacy wrapper since the stanza package itself provides the correct output:

In [3]: doc = nlp("Hans Müller isst gerne Vanilleeis am Hamburger Dom")                          

In [4]: doc.ents                                                                                 
Out[4]: 
[{
   "text": "Hans Müller",
   "type": "PER",
   "start_char": 0,
   "end_char": 11
 },
 {
   "text": "Hamburger Dom",
   "type": "LOC",
   "start_char": 37,
   "end_char": 50
 }]

It seems like the german model does not have any issues with special characters/punctuaion as @aishwarya-agrawal has encountered.

In [28]: doc = nlp("Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")             

In [29]: doc.text                                                                                
Out[29]: 'Hans Müller isst gerne Vanilleeis/Himbeereis an dem Hamburger Dom '

aishwarya-agrawal · 2020-05-28T15:57:16Z

@redadmiral Please try this
doc = nlp(" Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")

Please notice the space at the beginning of the sentence

redadmiral · 2020-05-28T16:01:04Z

Oh, okay – this leads to the same warning you encountered:

In [30]: doc = nlp(" Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")            
<ipython-input-30-b238c1353442>:1: UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Hans', 'Müller', 'isst', 'gerne', 'Vanilleeis/Himbeereis', 'an', 'dem', 'Hamburger', 'Dom']
Entities: [('Hans Müller', 'PER', 1, 12), ('Hamburger Dom', 'LOC', 49, 62)]
  doc = nlp(" Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")

adrianeboyd mentioned this issue Jun 25, 2020

Rewrite alignment to preserve whitespace tokens #41

Merged

ines closed this as completed in #41 Jun 26, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Offset misalignment in NER StanzaLanguage Tokenizer #33

Offset misalignment in NER StanzaLanguage Tokenizer #33

aishwarya-agrawal commented Apr 29, 2020 •

edited

Loading

redadmiral commented May 28, 2020 •

edited

Loading

aishwarya-agrawal commented May 28, 2020

redadmiral commented May 28, 2020

Offset misalignment in NER StanzaLanguage Tokenizer #33

Offset misalignment in NER StanzaLanguage Tokenizer #33

Comments

aishwarya-agrawal commented Apr 29, 2020 • edited Loading

redadmiral commented May 28, 2020 • edited Loading

aishwarya-agrawal commented May 28, 2020

redadmiral commented May 28, 2020

aishwarya-agrawal commented Apr 29, 2020 •

edited

Loading

redadmiral commented May 28, 2020 •

edited

Loading