Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Offset misalignment in NER StanzaLanguage Tokenizer #33

Closed
aishwarya-agrawal opened this issue Apr 29, 2020 · 3 comments · Fixed by #41
Closed

Offset misalignment in NER StanzaLanguage Tokenizer #33

aishwarya-agrawal opened this issue Apr 29, 2020 · 3 comments · Fixed by #41

Comments

@aishwarya-agrawal
Copy link

aishwarya-agrawal commented Apr 29, 2020

text = """ Tobacco/Smoke Exposure Family members smoke indoors, Daily. Caffeine use Coffee,"""
doc = snlp(text)
print([(e.text, e.label_, text[e.start_char:e.end_char]) for e in doc.ents])

Gives the output:

UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Ents: [('Caffeine use', 'disease', 61, 73)]
doc = snlp(text)
[]

On printing the two texts i.e. snlp_doc.text, doc.text
Getting following texts:

snlp_doc.text = " Tobacco/Smoke Exposure Family members smoke indoors, Daily. Caffeine use Coffee,"
doc.text =   "  Tobacco / Smoke Exposure Family members smoke indoors , Daily . Caffeine use Coffee ,"

Because of which above error is coming and we are losing the identified entities
Even with basic configs mentioned in readme:

image

@redadmiral
Copy link

redadmiral commented May 28, 2020

I encounter the same problem using the german language model. When there are synaeresises in the input text the model replaces them by the two originating words, but apparently doesn't update the input's character offset.

The input Hans Müller isst gerne Vanilleeis am Hamburger Dom. returns the entity dem Hamburger instead of Hamburger Dom:

In [21]: doc = nlp("Hans Müller isst gerne Vanilleeis am Hamburger Dom.")                        

In [22]: doc.ents                                                                                
Out[22]: (Hans Müller, dem Hamburger)

While everything is fine as long as the synaeresis am is already split up to an dem in the input:

In [23]: doc = nlp("Hans Müller isst gerne Vanilleeis an dem Hamburger Dom.")                    

In [24]: doc.ents                                                                                
Out[24]: (Hans Müller, Hamburger Dom)

This seems to be a problem with the spacy wrapper since the stanza package itself provides the correct output:

In [3]: doc = nlp("Hans Müller isst gerne Vanilleeis am Hamburger Dom")                          

In [4]: doc.ents                                                                                 
Out[4]: 
[{
   "text": "Hans Müller",
   "type": "PER",
   "start_char": 0,
   "end_char": 11
 },
 {
   "text": "Hamburger Dom",
   "type": "LOC",
   "start_char": 37,
   "end_char": 50
 }]

It seems like the german model does not have any issues with special characters/punctuaion as @aishwarya-agrawal has encountered.

In [28]: doc = nlp("Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")             

In [29]: doc.text                                                                                
Out[29]: 'Hans Müller isst gerne Vanilleeis/Himbeereis an dem Hamburger Dom '

@aishwarya-agrawal
Copy link
Author

@redadmiral Please try this
doc = nlp(" Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")

Please notice the space at the beginning of the sentence

@redadmiral
Copy link

Oh, okay – this leads to the same warning you encountered:

In [30]: doc = nlp(" Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")            
<ipython-input-30-b238c1353442>:1: UserWarning: Can't set named entities because the character offsets don't map to valid tokens produced by the Stanza tokenizer:
Words: ['Hans', 'Müller', 'isst', 'gerne', 'Vanilleeis/Himbeereis', 'an', 'dem', 'Hamburger', 'Dom']
Entities: [('Hans Müller', 'PER', 1, 12), ('Hamburger Dom', 'LOC', 49, 62)]
  doc = nlp(" Hans Müller isst gerne Vanilleeis/Himbeereis am Hamburger Dom.")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants