Skip to content

Informal contractions are not lemmatized properly #9980

@jambran

Description

@jambran

How to reproduce the behaviour

I'm seeking to parse sentences that have informal contractions like wanna, gonna, gimme.

I'd like wanna to be tokenized as wan, na and lemmatized as want, to.
I'd like wanna to be tokenized as gon, na and lemmatized as go, to.
I'd like gimme to be tokenized as gim, me and lemmatized as give, I.

Out-of-the-box behavior

Out of the box, we have this request

nlp = spacy.load('en_core_web_sm')
sentence = "we're gonna have a great day"
print(f'For sentence {sentence}, out of the box, we have lemmas')
print([token.lemma_ for token in nlp(sentence)])

outputting

For sentence we're gonna have a great day, out of the box, we have lemmas
['we', 'be', 'gon', 'na', 'have', 'a', 'great', 'day']

No good, we have gon and na, tokenized properly but not lemmatized properly.

Adding exceptions to the lemmatizer

I've looked into options for customizing the lemmatization process, and found this stack overflow post about adding rules using the following code snippet.

nlp.get_pipe('lemmatizer').lookups.get_table("lemma_exc")["noun"]["data"] = ["data"]

Adding these exceptions works fine for gonna but not wanna. The difference in POS is what causes the difference, and the rules based lemmatization system sees wan as the base form of the word, so skips the lemmatization process.

### Let's add custom exceptions for these words
exceptions = [("gon", "go"),  # `gonna`
              ("gim", "give"),  # `gimme`
              ("wan", "want"),  # `wanna`
              ]
lemmatizer = nlp.get_pipe('lemmatizer')
for slang, lemma in exceptions:
    lemmatizer.lookups.get_table("lemma_exc")['verb'][slang] = [lemma]
    
print('adding exceptions for informal contractions yields:')
print([token.lemma_ for token in nlp(sentence)])

output

adding exceptions for informal contractions yields:
['we', 'be', 'gon', 'na', 'have', 'a', 'great', 'day']

Still, we see gon and na, not the lemmas we expected.

Customize the lemmatizer

@English.factory(
    "custom_english_lemmatizer",
    assigns=["token.lemma"],
    default_config={},
    default_score_weights={"lemma_acc": 1.0},
)
def make_lemmatizer(
    nlp: Language,
    name: str = 'custom_english_lemmatizer',
):
    return CustomEnglishLemmatizer(nlp, name, mode='rule')


class CustomEnglishLemmatizer(EnglishLemmatizer):
    """
    In `en_core_web_sm`, words like "gonna" are getting lemmatized as "gon", "na"
    
    This custom lemmatizer allows us more control over the lemmatization process

    Only overrides is_base_form.
    """

    def is_base_form(self, token: Token) -> bool:
        """
        Check whether we're dealing with an uninflected paradigm, so we can
        avoid lemmatization entirely.

        univ_pos (unicode / int): The token's universal part-of-speech tag.
        morphology (dict): The token's morphological features following the
            Universal Dependencies scheme.
        """
        # add additional check for slang words
        # that aren't base form but do match the above condition
        # words like "wanna", "gimme"
        if token.norm_ != token.text:
            return False
        super(CustomEnglishLemmatizer, self).is_base_form(token)

Now running nlp as below outputs the proper lemma for gon

# Let's try customizing the lemmatizer
  nlp = spacy.load('en_core_web_sm', exclude='lemmatizer')
  custom_lemmatizer = nlp.add_pipe("custom_english_lemmatizer",
                                   name='lemmatizer',
                                   last=True)
  custom_lemmatizer.initialize()
  exceptions = [("gon", "go"),  # `gonna`
                ("gim", "give"),  # `gimme`
                ("wan", "want"),  # `wanna`
                ]
  for slang, lemma in exceptions:
      # this table has entries for verb, noun, adjective, adverb,
      # but not part, which is what we need for -na in gonna, wanna
      custom_lemmatizer.lookups.get_table("lemma_exc")['verb'][slang] = [lemma]
      
  print('with custom lemmatizer, we have: ')
  print([token.lemma_ for token in nlp(sentence)])
with custom lemmatizer, we have: 
['we', 'be', 'go', 'na', 'have', 'a', 'great', 'day']

Woohoo! We have the proper lemmatization for gon!

But what about na?

This gets us almost there, with the correct verb form for gon. So how do we get na to lemmatize to to?
Adding another exception for the POS part raises an error.

Traceback (most recent call last):
  File "/Users/jamiebrandon/Code/api-writing-tool/scripts/bug_report.py", line 92, in <module>
    custom_lemmatizer.lookups.get_table("lemma_exc")['part']['na'] = ['to']
  File "/Users/jamiebrandon/Code/api-writing-tool/venv/lib/python3.8/site-packages/spacy/lookups.py", line 109, in __getitem__
    return OrderedDict.__getitem__(self, key)
KeyError: 4485934323942657167

It seems that there can only be exceptions for NOUN, VERB, ADJECTIVE, and ADVERB.

How can I handle informal contractions like wanna, gonna, gimme such that they are lemmatized properly? It seems there is some work on this, as I'm seeing code snippets to handle gonna here. Perhaps I am not using this as intended?

Appreciate any help. Thank you for building an incredible tool!

Your Environment

  • spaCy version: 3.0.1
  • Platform: macOS-11.6.1-x86_64-i386-64bit
  • Python version: 3.8.2
  • Pipelines: en_core_web_sm (3.0.0)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions