Training a parser for custom semantics with Doc.retokenize applyied to Named Entities example #5921

danilohfmenezes · 2020-08-15T15:37:53Z

danilohfmenezes
Aug 15, 2020

In the docs about Training a parser for custom semantics there is a tip that says:

"To achieve even better accuracy, try merging multi-word tokens and entities specific to your domain into one token before parsing your text. You can do this by running the entity recognizer or rule-based matcher to find relevant spans, and merging them using Doc.retokenize. You could even add your own custom pipeline component to do this automatically – just make sure to add it before='parser'."

I tried to implement this using a custom pipeline component before the parser, but couldn't make it work. I am probably doing something wrong. I would like to know if someone has already done this or if it would be possible to include an example of in the docs.

What I tried (and is probably wrong. It is basically the docs with my new component before the parser and changed london (in training set) and berlin (in dev set) for new york:

#!/usr/bin/env python
# coding: utf-8
"""Using the parser to recognise your own semantics

spaCy's parser component can be trained to predict any type of tree
structure over your input text. You can also predict trees over whole documents
or chat logs, with connections between the sentence-roots used to annotate
discourse structure. In this example, we'll build a message parser for a common
"chat intent": finding local businesses. Our message semantics will have the
following types of relations: ROOT, PLACE, QUALITY, ATTRIBUTE, TIME, LOCATION.

"show me the best hotel in berlin"
('show', 'ROOT', 'show')
('best', 'QUALITY', 'hotel') --> hotel with QUALITY best
('hotel', 'PLACE', 'show') --> show PLACE hotel
('berlin', 'LOCATION', 'hotel') --> hotel with LOCATION berlin

Compatible with: spaCy v2.0.0+
"""
from __future__ import unicode_literals, print_function

import plac
import random
from pathlib import Path
import spacy
from spacy.util import minibatch, compounding


# training data: texts, heads and dependency labels
# for no relation, we simply chose an arbitrary dependency label, e.g. '-'
TRAIN_DATA = [
    (
        "find a cafe with great wifi",
        {
            "heads": [0, 2, 0, 5, 5, 2],  # index of token head
            "deps": ["ROOT", "-", "PLACE", "-", "QUALITY", "ATTRIBUTE"],
        },
    ),
    (
        "find a hotel near the beach",
        {
            "heads": [0, 2, 0, 5, 5, 2],
            "deps": ["ROOT", "-", "PLACE", "QUALITY", "-", "ATTRIBUTE"],
        },
    ),
    (
        "find me the closest gym that's open late",
        {
            "heads": [0, 0, 4, 4, 0, 6, 4, 6, 6],
            "deps": [
                "ROOT",
                "-",
                "-",
                "QUALITY",
                "PLACE",
                "-",
                "-",
                "ATTRIBUTE",
                "TIME",
            ],
        },
    ),
    (
        "show me the cheapest store that sells flowers",
        {
            "heads": [0, 0, 4, 4, 0, 4, 4, 4],  # attach "flowers" to store!
            "deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "-", "PRODUCT"],
        },
    ),
    (
        "find a nice restaurant in new york",
        {
            "heads": [0, 3, 3, 0, 3, 3],
            "deps": ["ROOT", "-", "QUALITY", "PLACE", "-", "LOCATION"],
        },
    ),
    (
        "show me the coolest hostel in berlin",
        {
            "heads": [0, 0, 4, 4, 0, 4, 4],
            "deps": ["ROOT", "-", "-", "QUALITY", "PLACE", "-", "LOCATION"],
        },
    ),
    (
        "find a good italian restaurant near work",
        {
            "heads": [0, 4, 4, 4, 0, 4, 5],
            "deps": [
                "ROOT",
                "-",
                "QUALITY",
                "ATTRIBUTE",
                "PLACE",
                "ATTRIBUTE",
                "LOCATION",
            ],
        },
    ),
]


def merge_new_york(doc):
    for token in doc:
        if token.text == 'new':
            if token.i + 1 < len(doc) and doc[token.i + 1].text == 'york':
                with doc.retokenize() as retokenizer:
                    retokenizer.merge(doc[token.i:token.i+2])


@plac.annotations(
    model=("Model name. Defaults to blank 'en' model.", "option", "m", str),
    output_dir=("Optional output directory", "option", "o", Path),
    n_iter=("Number of training iterations", "option", "n", int),
)
def main(model=None, output_dir=None, n_iter=15):
    """Load the model, set up the pipeline and train the parser."""
    if model is not None:
        nlp = spacy.load(model)  # load existing spaCy model
        print("Loaded model '%s'" % model)
    else:
        nlp = spacy.blank("en")  # create blank Language class
        print("Created blank 'en' model")

    # We'll use the built-in dependency parser class, but we want to create a
    # fresh instance – just in case.
    if "parser" in nlp.pipe_names:
        nlp.remove_pipe("parser")
    parser = nlp.create_pipe("parser")
    nlp.add_pipe(parser, first=True)
    nlp.add_pipe(merge_new_york, before="parser")

    for text, annotations in TRAIN_DATA:
        for dep in annotations.get("deps", []):
            parser.add_label(dep)

    pipe_exceptions = ["parser", "trf_wordpiecer", "trf_tok2vec"]
    other_pipes = [
        pipe for pipe in nlp.pipe_names if pipe not in pipe_exceptions]
    with nlp.disable_pipes(*other_pipes):  # only train parser
        optimizer = nlp.begin_training()
        for itn in range(n_iter):
            random.shuffle(TRAIN_DATA)
            losses = {}
            # batch up the examples using spaCy's minibatch
            batches = minibatch(TRAIN_DATA, size=compounding(4.0, 32.0, 1.001))
            for batch in batches:
                texts, annotations = zip(*batch)
                nlp.update(texts, annotations, sgd=optimizer, losses=losses)
            print("Losses", losses)

    # test the trained model
    test_model(nlp)

    # save model to output directory
    if output_dir is not None:
        output_dir = Path(output_dir)
        if not output_dir.exists():
            output_dir.mkdir()
        nlp.to_disk(output_dir)
        print("Saved model to", output_dir)

        # test the saved model
        print("Loading from", output_dir)
        nlp2 = spacy.load(output_dir)
        test_model(nlp2)


def test_model(nlp):
    texts = [
        "find a hotel with good wifi",
        "find me the cheapest gym near work",
        "show me the best hotel in new york",
    ]
    docs = nlp.pipe(texts)
    for doc in docs:
        print(doc.text)
        print([(t.text, t.dep_, t.head.text) for t in doc if t.dep_ != "-"])


if __name__ == "__main__":
    plac.call(main)

    # Expected output:
    # find a hotel with good wifi
    # [
    #   ('find', 'ROOT', 'find'),
    #   ('hotel', 'PLACE', 'find'),
    #   ('good', 'QUALITY', 'wifi'),
    #   ('wifi', 'ATTRIBUTE', 'hotel')
    # ]
    # find me the cheapest gym near work
    # [
    #   ('find', 'ROOT', 'find'),
    #   ('cheapest', 'QUALITY', 'gym'),
    #   ('gym', 'PLACE', 'find'),
    #   ('near', 'ATTRIBUTE', 'gym'),
    #   ('work', 'LOCATION', 'near')
    # ]
    # show me the best hotel in berlin
    # [
    #   ('show', 'ROOT', 'show'),
    #   ('best', 'QUALITY', 'hotel'),
    #   ('hotel', 'PLACE', 'show'),
    #   ('new york', 'LOCATION', 'hotel')
    # ]

I get the following error (which I suspect occurs because it isn't merging the two tokens in one somehow):

Training a new parser or NER using a model with no lexeme normalization table. This may degrade the performance of the model to some degree. If this is intentional or the language you're using doesn't have a normalization table, please ignore this warning. If this is surprising, make sure you have the spacy-lookups-data package installed. The languages with lexeme normalization tables are currently: da, de, el, en, id, lb, pt, ru, sr, ta, th.
  **kwargs
Traceback (most recent call last):
  File "train_relations.py", line 181, in <module>
    plac.call(main)
  File "/home/danilohfm/.local/lib/python3.6/site-packages/plac_core.py", line 367, in call
    cmd, result = parser.consume(arglist)
  File "/home/danilohfm/.local/lib/python3.6/site-packages/plac_core.py", line 232, in consume
    return cmd, self.func(*(args + varargs + extraopts), **kwargs)
  File "train_relations.py", line 148, in main
    nlp.update(texts, annotations, sgd=optimizer, losses=losses)
  File "/home/danilohfm/.local/lib/python3.6/site-packages/spacy/language.py", line 510, in update
    docs, golds = self._format_docs_and_golds(docs, golds)
  File "/home/danilohfm/.local/lib/python3.6/site-packages/spacy/language.py", line 482, in _format_docs_and_golds
    gold = GoldParse(doc, **gold)
  File "gold.pyx", line 777, in spacy.gold.GoldParse.__init__
IndexError: list index out of range

Appreciate any help.

Your Environment

Operating System: Ubuntu 18.04
Python Version Used: 3.6.9
spaCy Version Used: 2.3.0
Environment Information:

Answered by svlandeg

Aug 17, 2020

You have to consider the difference between training your pipeline, and applying it. Adding a custom merging component before the parser, is useful when applying the pipeline as a whole, because then the parser will only see the one merged token.

However during training, this custom component is not run on your texts, as you disable all components when training the parser and feeding the raw texts as input. This means that the training step will still see "new york" as two tokens, and it will want annotations for both, thus resulting in this out-of-bounds indexing error.

During training, you could remedy this by specifically telling the parser which are the (merged) words in the texts:

"f…

View full answer

svlandeg · 2020-08-17T09:05:32Z

svlandeg
Aug 17, 2020
Maintainer

You have to consider the difference between training your pipeline, and applying it. Adding a custom merging component before the parser, is useful when applying the pipeline as a whole, because then the parser will only see the one merged token.

However during training, this custom component is not run on your texts, as you disable all components when training the parser and feeding the raw texts as input. This means that the training step will still see "new york" as two tokens, and it will want annotations for both, thus resulting in this out-of-bounds indexing error.

During training, you could remedy this by specifically telling the parser which are the (merged) words in the texts:

"find a nice restaurant in new york",
{
    "words": ["find", "a", "nice", "restaurant", "in", "new york"],
    "heads": [0, 3, 3, 0, 3, 3],
    "deps": ["ROOT", "-", "QUALITY", "PLACE", "-", "LOCATION"],
},

0 replies

danilohfmenezes · 2020-08-17T16:14:05Z

danilohfmenezes
Aug 17, 2020
Author

Thank you @svlandeg! I'll try that!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training a parser for custom semantics with Doc.retokenize applyied to Named Entities example #5921

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Training a parser for custom semantics with Doc.retokenize applyied to Named Entities example #5921

danilohfmenezes Aug 15, 2020

Your Environment

Replies: 2 comments

svlandeg Aug 17, 2020 Maintainer

danilohfmenezes Aug 17, 2020 Author

danilohfmenezes
Aug 15, 2020

svlandeg
Aug 17, 2020
Maintainer

danilohfmenezes
Aug 17, 2020
Author