Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phrase Matcher fails on OOV tokens #4473

Closed
tnmcneil opened this issue Oct 18, 2019 · 3 comments
Closed

Phrase Matcher fails on OOV tokens #4473

tnmcneil opened this issue Oct 18, 2019 · 3 comments
Labels
feat / matcher Feature: Token, phrase and dependency matcher more-info-needed This issue needs more information

Comments

@tnmcneil
Copy link

How to reproduce the behaviour

I created a phrasematcher to match titles (eg: queen, manager, mayor, etc.) and it fails when applied to a document containing out of vocabulary tokens.

The error it throws is:
ERROR:root:error: "[E018] Can't retrieve string for hash '4332798303416328849'."

I got around this by creating a "clean doc" from the original doc to feed through the phrase matcher like so:

if any([t.is_oov for t in doc]):
        clean_toks = [t.text_with_ws if not t.is_oov else 'OOV ' if t.text_with_ws != t.text or re.match('\s', t.text) else 'OOV' for t in doc]
        clean_doc = nlp(''.join(clean_toks))
matches = phrase_matcher(clean_doc)
spans = [doc[start:end] for match_id, start, end in matches]

(I added string 'OOV' to replace the oov tokens because I needed the token indices to match the original doc)

I am wondering if there is a better way around this or a way for the phrase matcher code to inherently ignore oov tokens rather than trying to process them

Info about spaCy

  • Models: en
  • Python version: 3.5.1
  • spaCy version: 2.1.6
  • Platform: Darwin-18.6.0-x86_64-i386-64bit
  • Operating System: Mac OS
@adrianeboyd adrianeboyd added bug Bugs and behaviour differing from documentation feat / matcher Feature: Token, phrase and dependency matcher labels Oct 19, 2019
@adrianeboyd
Copy link
Contributor

Hmm, I wouldn't have expected OOV to interact with the vocab/StringStore in the PhraseMatcher. Do you have a short example that reproduces the error? Or at least a sketch of where/how the error crops up, since it can sometimes be hard to create short test cases for some of the PhraseMatcher issues?

I would try v2.1.8 to see if there were any bugfixes related to this and I would also strongly recommend trying the PhraseMatcher in v2.2 instead. It was completely rewritten and obviously there's no guarantee that it's bug-free, but at least I'm sure that they are different bugs than in v2.1. :) It can handle large lists of phrases better than the PhraseMatcher in v2.1.

@ines ines added more-info-needed This issue needs more information and removed bug Bugs and behaviour differing from documentation labels Oct 19, 2019
@no-response
Copy link

no-response bot commented Nov 2, 2019

This issue has been automatically closed because there has been no response to a request for more information from the original author. With only the information that is currently in the issue, there's not enough information to take action. If you're the original author, feel free to reopen the issue if you have or find the answers needed to investigate further.

@no-response no-response bot closed this as completed Nov 2, 2019
@lock
Copy link

lock bot commented Dec 2, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked as resolved and limited conversation to collaborators Dec 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
feat / matcher Feature: Token, phrase and dependency matcher more-info-needed This issue needs more information
Projects
None yet
Development

No branches or pull requests

3 participants