Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow loading when the nlp object used is a StanfordNLPLanguage instance #6

Open
buhrmann opened this issue Apr 24, 2019 · 2 comments
Open

Comments

@buhrmann
Copy link
Contributor

This is probably a rare case occurring only when adding a spacymoji step to the pipeline of a StanfordNLPLanguage instance. However, what happens is that the spacymoji constructor uses the StanfordNLPLanguage's tokenizer to convert emoji to Docs for the PhraseMatcher. For standard Spacy Language instances this is pretty optimal, since the tokenizer does the minimal work necessary here. But the StanfordNLPLanguage's tokenizer executes the whole StanfordNLP pipeline at once, making this step very slow (>2min vs < 1s in the normal case on my laptop simply to create a spacymoji instance).

Not sure how to fix this elegantly. Making the code conditional on the class of the nlp object could be one option. Another might be to always simply load the default English tokenizer and use this to process the Emoji, instead of the passed nlp object's tokenizer.

I'm happy to create a PR if we agree on the best solution.

@buhrmann
Copy link
Contributor Author

Oops, actually using a StanfordNLPLanguage instance will break right now, since the tokenizer doesn't have a pipe() method. I was testing with my own wrapper which replaces a nlp's tokenizer with my own preprocessing tokenizer (and which has a pipe method). I will add the pipe() method to StanfordNLPLanguage's Tokenizer for duck-type compatibility with Spacy tokenizers...

@buhrmann
Copy link
Contributor Author

Hmm, this seems to be more complicated than I thought. I was thinking that maybe just using some default tokenizer would do the trick to quickly create the match patterns. But it seems the tokenizer needs to use the same vocabulary used later by the matcher, otherwise they'll get out of sync.

So unless I'm overlooking something the only solution I can see would be to break up the StanfordNLPLanguage into individual pipeline steps (each wrapping a StanfordNLP step), so that the Tokenizer can be called on its own, but that'd be a bigger change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant