Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to restrict fuzzy search #18

Closed
kaykhancheckpoint opened this issue Aug 17, 2020 · 1 comment
Closed

How to restrict fuzzy search #18

kaykhancheckpoint opened this issue Aug 17, 2020 · 1 comment

Comments

@kaykhancheckpoint
Copy link

kaykhancheckpoint commented Aug 17, 2020

Is it possible to restrict the fuzzy search because in my example it is returning unwanted entities.

Text: Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options""" # Spelling errors intentional.

patterns = [
{'label': 'PERSON', 'pattern': 'DrDisrespect', 'type': 'fuzzy'},
{'label': 'PERSON', 'pattern': 'JZRyoutube', 'type': 'fuzzy'}
]

('Dr Disrespect', 'PERSON')
('YouTube', 'PERSON')

The unwanted entity here is ('YouTube', 'PERSON'), is there some way to restrict the fuzzy search so that it does not identify YouTube in the text to be a person?

Full Code:

      import spacy
        from spaczz.pipeline import SpaczzRuler

        nlp = spacy.blank("en")
        text = """Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options""" # Spelling errors intentional.
        doc = nlp(text)

        patterns = [
            {'label': 'PERSON', 'pattern': 'DrDisrespect', 'type': 'fuzzy'},
            {'label': 'PERSON', 'pattern': 'JZRyoutube', 'type': 'fuzzy'}
        ]

        ruler = SpaczzRuler(nlp)
        ruler.add_patterns(patterns)
        doc = ruler(doc)

        data = [{
            "label": ent.label_,
            "name": ent.text,
        } for ent in doc.ents]

        for ent in doc.ents:
            print((ent.text, ent.label_))

EDIT:

i noticed rapidfuzz library provides a score_cutoff as a parameter im looking to set this to 95 so it's strict. I was hoping something like this could be exposed.

@gandersen101
Copy link
Owner

gandersen101 commented Aug 17, 2020

Hi again @kaykhancheckpoint. The docs actually make this unclear, but the current version of spaczz on pypi (v0.1.1) actually still uses fuzzywuzzy instead of rapidfuzz, but that will change in the next release (v0.2.0 which will also include the ent_id enhancement you asked for).

Regarding your current ask however, fuzzywuzzy vs rapidfuzz shouldn't be an issue. Spaczz's fuzzy matching optional kwargs already expose two fuzzy ratio cutoffs (see min_r1 and min_r2 in the spaczz.fuzzysearcher.match docstring.) min_r1 is mostly a tradeoff in speed vs number of comparisons, but min_r2 is analogous to the score_cutoff parameter in rapidfuzz.

If you want a minimum fuzzy ratio of 95 like you're asking for, in your patterns list, you can change individual patterns match behavior. Let's say you want 'JZRyoutube' to only match at a min_r2>= 95, but are okay with 'DrDisrespect' matching at the default min_r2 of >= 75:

         patterns = [
            {'label': 'PERSON', 'pattern': 'DrDisrespect', 'type': 'fuzzy'},
            {'label': 'PERSON', 'pattern': 'JZRyoutube', 'type': 'fuzzy', 'kwargs': [{'min_r2': 95}]}
        ]

If you wanted to change the default minimum fuzzy ratio for all fuzzy matches to 95 you could instantiate the SpaczzRuler like follows:

ruler = SpaczzRuler(nlp, spaczz_fuzzy_defaults={'min_r2': 95})

Then you wouldn't have to add the kwargs to each pattern.

The current methods to optimize fuzzy matches in spaczz are available through the optional kwargs you can pass to patterns (the keyword arguments in spaczz.fuzzysearcher.match), namely min_r1, min_r2, fuzzy_func, flex, and ignore_case.

There is more granular match filtering I would like to implement, but that is already part of issue #14 (will provide more details in that issue soon). Due to the fact that that issue already exists and methods for solving this issue are already implemented in spaczz, I'm going to close this issue.

If you feel that you cannot solve your current issue with the methods I've outlined, and issue #14 will not address them, please let me know.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants