Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating and loading custom pipeline components example #2947

Closed
nyejon opened this issue Nov 19, 2018 · 6 comments
Closed

Creating and loading custom pipeline components example #2947

nyejon opened this issue Nov 19, 2018 · 6 comments
Labels
docs Documentation and website feat / pipeline Feature: Processing pipeline and components usage General spaCy usage

Comments

@nyejon
Copy link

nyejon commented Nov 19, 2018

Hi, I have been struggling to deal with the loading of a custom pipeline component after saving the model. I found this solution here: #2682 (comment)

Would it be a good idea to update the docs to suggest users do something like this to prevent the issue of requiring the "nlp" object to initialize the pipe when loading from disk? (Kind of a chicken and egg situation) Perhaps I am overlooking something and there is already a better solution.

Therefore this:

class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, terms, label):
        patterns = [nlp(text) for text in terms]
        self.matcher = PhraseMatcher(nlp.vocab)
        self.matcher.add(label, None, *patterns)

    def __call__(self, doc):
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            span = Span(doc, start, end, label=match_id)
            doc.ents = list(doc.ents) + [span]
        return doc
class EntityMatcher(object):
    name = 'entity_matcher'

    def __init__(self, nlp, terms, label):
        self.nlp = nlp
        self.terms = terms
        self.label = label
        self.initialized = False
        self.matcher = PhraseMatcher(self.nlp.vocab)

    def __call__(self, doc):
        # Initialize the pipeline component on the first call
        if not self.initialized:
            self.initialized = True
            patterns = [self.nlp(text) for text in self.labelterms]
            self.matcher.add(label, None, *patterns)
            
        matches = self.matcher(doc)
        for match_id, start, end in matches:
            span = Span(doc, start, end, label=match_id)
            doc.ents = list(doc.ents) + [span]
        return doc

The source can be found here:
https://spacy.io/usage/processing-pipelines#custom-components

Edit: The self.initialized = True needs to be set directly after the if statement before nlp is called.

@ines ines added usage General spaCy usage docs Documentation and website feat / pipeline Feature: Processing pipeline and components labels Nov 19, 2018
@ines
Copy link
Member

ines commented Nov 19, 2018

Yes, that's definitely an option!

Alternatively, you could also add an entry to the Language.factories, which is a writable dictionary where spaCy looks up how to initialize pipeline components. For example:

from spacy.language import Language

Language.factories['entity_matcher'] = lambda nlp, **cfg: EntityMatcher(nlp)

The factory receives the shared nlp object, as well as the config keyword arguments that are passed to spacy.load. So you could even do something like this to make the entity_matcher_label available as a keyword argument in your EntityMatcher factory:

nlp = spacy.load('some_custom_model', entity_matcher_label='SOME_LABEL')

For a more detailed end-to-end example of how to package custom models or components in a model, you might find my comment on this thread helpful. We're also hoping that we can make this process easier in the upcoming version(s).

Btw, another related thing to check out: If you're working with rule-based NER (or a combination of rule-based and statistical NER), you can also try my new EntityRuler component in the currently v2.1.x nightly version. See #2513 for details and documentation. The nightly build is available as spacy-nightly, or on the develop branch if you want to build the latest state from source.

@nyejon
Copy link
Author

nyejon commented Nov 20, 2018

Hi Ines,

Thanks, I had already added the entity_matcher to the Language.factories and that is where it required the nlp object just to initialise the entity matcher and run the terms through the nlp object.

The problem is this: patterns = [nlp(text) for text in terms]

In #2682 you stated that the weights are not loaded yet and that only after everything is initialised the weights would be loaded. If I just added the entity_matcher to the Language.factories then I would get a "bool has no tok2vec" error or something similar. This is not a problem when using the nlp.add_pipe() method as the nlp object already has the weights loaded.

I will definitely look into 2.1 and the EntityRuler - thanks for the tip!

@ines
Copy link
Member

ines commented Nov 20, 2018

Ahh, sorry, I think I missed that. (That bool has no xxx stuff will also be fixed in 2.1 btw.)

I wonder what happens if you just use nlp.make_doc(text) to create the patterns? In fact, this should probably be fixed in the docs as well, since the PhraseMatcher really only needs a tokenized Doc object, not a fully processed one. Tagging and parsing it is unnecessary and just takes longer.

@nyejon
Copy link
Author

nyejon commented Nov 21, 2018

Thanks nlp.make_doc(text) seems to do the trick!

In general how stable is the nightly? Would it be ok for development purposes until 2.1 is officially released?

Edit: Ok, I can confirm that 2.1 solves many issues... Will continue developing with it and am looking forward to the stable release!

I have just followed the notes https://github.com/explosion/spaCy/releases/tag/v2.1.0a1 and I assume those are the latest changes published?

@ines
Copy link
Member

ines commented Nov 24, 2018

Yes, the nightly should be good in development – in fact, it's always super helpful to have more people test it in "real life" so we can make sure there are no bugs or regressions.

The latest alpha version is v2.1.0a2 (see diff) and mostly includes internals, e.g. allowing pre-training (#2931). I'm not 100% sure if the alpha models currently work with it or if we have to train new ones, so v2.1.0a1 should be safest for now.

@ines ines mentioned this issue Feb 17, 2019
3 tasks
ines added a commit that referenced this issue Feb 17, 2019
<!--- Provide a general summary of your changes in the title. -->

## Description

The new website is implemented using [Gatsby](https://www.gatsbyjs.org) with [Remark](https://github.com/remarkjs/remark) and [MDX](https://mdxjs.com/). This allows authoring content in **straightforward Markdown** without the usual limitations. Standard elements can be overwritten with powerful [React](http://reactjs.org/) components and wherever Markdown syntax isn't enough, JSX components can be used. Hopefully, this update will also make it much easier to contribute to the docs. Once this PR is merged, I'll implement auto-deployment via [Netlify](https://netlify.com) on a specific branch (to avoid building the website on every PR). There's a bunch of other cool stuff that the new setup will allow us to do – including writing front-end tests, service workers, offline support, implementing a search and so on.

This PR also includes various new docs pages and content.
Resolves #3270. Resolves #3222. Resolves #2947. Resolves #2837.


### Types of change
enhancement

## Checklist
<!--- Before you submit the PR, go over this checklist and make sure you can
tick off all the boxes. [] -> [x] -->
- [x] I have submitted the spaCy Contributor Agreement.
- [x] I ran the tests, and all new and existing tests passed.
- [x] My changes don't require a change to the documentation, or if they do, I've added all required information.
@ines ines closed this as completed Feb 17, 2019
@lock
Copy link

lock bot commented Mar 19, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Mar 19, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
docs Documentation and website feat / pipeline Feature: Processing pipeline and components usage General spaCy usage
Projects
None yet
Development

No branches or pull requests

2 participants