Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suffix doesn't match for sentence ending in uppercase. #6695

Open
jdupl123 opened this issue Jan 8, 2021 · 3 comments
Open

Suffix doesn't match for sentence ending in uppercase. #6695

jdupl123 opened this issue Jan 8, 2021 · 3 comments
Labels
feat / tokenizer Feature: Tokenizer lang / en English language data and models

Comments

@jdupl123
Copy link

jdupl123 commented Jan 8, 2021

How to reproduce the behaviour

import spacy
nlp = spacy.load("en_core_web_sm")
list(nlp.tokenizer("about the P&L."))

I get

[about, the, P&L.]

The . should be separated from P&L here.

This behaviour comes from,

r"(?<=[{au}][{au}])\.".format(au=ALPHA_UPPER),

the requirement for double uppercase is likely for acronyms but perhaps an ampersand is acceptable.

eg r"(?<=&[{au}])\.".format(au=ALPHA_UPPER)

Your Environment

  • spaCy version: 2.3.2
  • Platform: Darwin-19.6.0-x86_64-i386-64bit
  • Python version: 3.6.12
@svlandeg svlandeg added feat / tokenizer Feature: Tokenizer lang / en English language data and models labels Jan 8, 2021
@svlandeg
Copy link
Member

Yes, I see your point. I think you'd want to add the rule

r"(?<=[{au}]&[{au}])\.".format(au=ALPHA_UPPER),

to the _suffixes. Unfortunately you can't just put an optional & in the existing rule, because the look-behind can't be variable-width.

If you're training a custom model, you could modify this behaviour for your own custom tokenizer, cf https://spacy.io/usage/linguistic-features#native-tokenizer-additions. You could also replace the tokenizer of a pretrained model with your own custom tokenizer, though that may impact accuracy slightly (though maybe not so much in this case).

We're typically hesitant to change the punctuation rules in the core library though, because there may be unwanted side effects, especially when changing the lang/punctuation.py file that is used as base for many other languages. On spaCy's develop branch, we have a specific punctuation file for English, https://github.com/explosion/spaCy/blob/develop/spacy/lang/en/punctuation.py, where we could consider adding this change for English only.

I've been trying to think of "bad" consequences of adding your proposed "ampersand" rule to the English tokenizer and can't immediately think of one. I'm less sure about other languages. Would be interested to hear what my colleagues think - e.g. @adrianeboyd ?

@adrianeboyd
Copy link
Contributor

I can't think of anything major, but to be on the safe side we should test it with all the internal training corpora. Let me see...

@MucAlex
Copy link

MucAlex commented Jan 27, 2021

I am experiencing a similar behavior with the German word "GmbH".

nlp = spacy.lang.de.German() 
[tok for tok in nlp("Herr Bert ist Geschäftsführer der Ernie GmbH.")]

Results in

[Herr, Bert, ist, Geschäftsführer, der, Ernie, GmbH.]

I followed the example above and added a specific rule to _suffixes

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feat / tokenizer Feature: Tokenizer lang / en English language data and models
Projects
None yet
Development

No branches or pull requests

4 participants