Creating and loading custom pipeline components example #2947

nyejon · 2018-11-19T15:55:53Z

Hi, I have been struggling to deal with the loading of a custom pipeline component after saving the model. I found this solution here: #2682 (comment)

Would it be a good idea to update the docs to suggest users do something like this to prevent the issue of requiring the "nlp" object to initialize the pipe when loading from disk? (Kind of a chicken and egg situation) Perhaps I am overlooking something and there is already a better solution.

Therefore this:

class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, terms, label):
        patterns = [nlp(text) for text in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            span = Span(doc, start, end, label=match_id)
            doc.ents = list(doc.ents) + [span]
        return doc

class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, terms, label):
        self.nlp = nlp
        self.terms = terms
        self.label = label
        self.initialized = False
        self.matcher = PhraseMatcher(self.nlp.vocab)

    def __call__(self, doc):
        # Initialize the pipeline component on the first call
        if not self.initialized:
            self.initialized = True
            patterns = [self.nlp(text) for text in self.labelterms]
            self.matcher.add(label, None, *patterns)
            
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            span = Span(doc, start, end, label=match_id)
            doc.ents = list(doc.ents) + [span]
        return doc

The source can be found here:
https://spacy.io/usage/processing-pipelines#custom-components

Edit: The self.initialized = True needs to be set directly after the if statement before nlp is called.

The text was updated successfully, but these errors were encountered:

ines · 2018-11-19T21:11:50Z

Yes, that's definitely an option!

Alternatively, you could also add an entry to the Language.factories, which is a writable dictionary where spaCy looks up how to initialize pipeline components. For example:

from spacy.language import Language

Language.factories['entity_matcher'] = lambda nlp, **cfg: EntityMatcher(nlp)

The factory receives the shared nlp object, as well as the config keyword arguments that are passed to spacy.load. So you could even do something like this to make the entity_matcher_label available as a keyword argument in your EntityMatcher factory:

nlp = spacy.load('some_custom_model', entity_matcher_label='SOME_LABEL')

For a more detailed end-to-end example of how to package custom models or components in a model, you might find my comment on this thread helpful. We're also hoping that we can make this process easier in the upcoming version(s).

Btw, another related thing to check out: If you're working with rule-based NER (or a combination of rule-based and statistical NER), you can also try my new EntityRuler component in the currently v2.1.x nightly version. See #2513 for details and documentation. The nightly build is available as spacy-nightly, or on the develop branch if you want to build the latest state from source.

nyejon · 2018-11-20T07:08:54Z

Hi Ines,

Thanks, I had already added the entity_matcher to the Language.factories and that is where it required the nlp object just to initialise the entity matcher and run the terms through the nlp object.

The problem is this: patterns = [nlp(text) for text in terms]

In #2682 you stated that the weights are not loaded yet and that only after everything is initialised the weights would be loaded. If I just added the entity_matcher to the Language.factories then I would get a "bool has no tok2vec" error or something similar. This is not a problem when using the nlp.add_pipe() method as the nlp object already has the weights loaded.

I will definitely look into 2.1 and the EntityRuler - thanks for the tip!

ines · 2018-11-20T11:51:59Z

Ahh, sorry, I think I missed that. (That bool has no xxx stuff will also be fixed in 2.1 btw.)

I wonder what happens if you just use nlp.make_doc(text) to create the patterns? In fact, this should probably be fixed in the docs as well, since the PhraseMatcher really only needs a tokenized Doc object, not a fully processed one. Tagging and parsing it is unnecessary and just takes longer.

nyejon · 2018-11-21T10:27:18Z

Thanks nlp.make_doc(text) seems to do the trick!

In general how stable is the nightly? Would it be ok for development purposes until 2.1 is officially released?

Edit: Ok, I can confirm that 2.1 solves many issues... Will continue developing with it and am looking forward to the stable release!

I have just followed the notes https://github.com/explosion/spaCy/releases/tag/v2.1.0a1 and I assume those are the latest changes published?

ines · 2018-11-24T14:44:56Z

Yes, the nightly should be good in development – in fact, it's always super helpful to have more people test it in "real life" so we can make sure there are no bugs or regressions.

The latest alpha version is v2.1.0a2 (see diff) and mostly includes internals, e.g. allowing pre-training (#2931). I'm not 100% sure if the alpha models currently work with it or if we have to train new ones, so v2.1.0a1 should be safest for now.

## Description The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on. This PR also includes various new docs pages and content. Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837. ### Types of change enhancement ## Checklist  - [x] I have submitted the spaCy Contributor Agreement. - [x] I ran the tests, and all new and existing tests passed. - [x] My changes don't require a change to the documentation, or if they do, I've added all required information.

lock · 2019-03-19T18:43:52Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

ines added usage General spaCy usage docs Documentation and website feat / pipeline Feature: Processing pipeline and components labels Nov 19, 2018

ines mentioned this issue Feb 17, 2019

💫 Update website #3285

Merged

3 tasks

ines closed this as completed Feb 17, 2019

lock bot locked as resolved and limited conversation to collaborators Mar 19, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating and loading custom pipeline components example #2947

Creating and loading custom pipeline components example #2947

nyejon commented Nov 19, 2018 •

edited

Loading

ines commented Nov 19, 2018

nyejon commented Nov 20, 2018 •

edited

Loading

ines commented Nov 20, 2018

nyejon commented Nov 21, 2018 •

edited

Loading

ines commented Nov 24, 2018

lock bot commented Mar 19, 2019

Creating and loading custom pipeline components example #2947

Creating and loading custom pipeline components example #2947

Comments

nyejon commented Nov 19, 2018 • edited Loading

ines commented Nov 19, 2018

nyejon commented Nov 20, 2018 • edited Loading

ines commented Nov 20, 2018

nyejon commented Nov 21, 2018 • edited Loading

ines commented Nov 24, 2018

lock bot commented Mar 19, 2019

nyejon commented Nov 19, 2018 •

edited

Loading

nyejon commented Nov 20, 2018 •

edited

Loading

nyejon commented Nov 21, 2018 •

edited

Loading