How to restrict fuzzy search #18

kaykhancheckpoint · 2020-08-17T08:14:29Z

Is it possible to restrict the fuzzy search because in my example it is returning unwanted entities.

Text: Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options""" # Spelling errors intentional.

patterns = [
{'label': 'PERSON', 'pattern': 'DrDisrespect', 'type': 'fuzzy'},
{'label': 'PERSON', 'pattern': 'JZRyoutube', 'type': 'fuzzy'}
]

('Dr Disrespect', 'PERSON')
('YouTube', 'PERSON')

The unwanted entity here is ('YouTube', 'PERSON'), is there some way to restrict the fuzzy search so that it does not identify YouTube in the text to be a person?

Full Code:

      import spacy
        from spaczz.pipeline import SpaczzRuler

        nlp = spacy.blank("en")
        text = """Dr Disrespect to Returns Aug. 7 With YouTube Stream, Will Explore Other Platform Options""" # Spelling errors intentional.
        doc = nlp(text)

        patterns = [
            {'label': 'PERSON', 'pattern': 'DrDisrespect', 'type': 'fuzzy'},
            {'label': 'PERSON', 'pattern': 'JZRyoutube', 'type': 'fuzzy'}
        ]

        ruler = SpaczzRuler(nlp)
        ruler.add_patterns(patterns)
        doc = ruler(doc)

        data = [{
            "label": ent.label_,
            "name": ent.text,
        } for ent in doc.ents]

        for ent in doc.ents:
            print((ent.text, ent.label_))

EDIT:

i noticed rapidfuzz library provides a score_cutoff as a parameter im looking to set this to 95 so it's strict. I was hoping something like this could be exposed.

The text was updated successfully, but these errors were encountered:

gandersen101 · 2020-08-17T14:01:42Z

Hi again @kaykhancheckpoint. The docs actually make this unclear, but the current version of spaczz on pypi (v0.1.1) actually still uses fuzzywuzzy instead of rapidfuzz, but that will change in the next release (v0.2.0 which will also include the ent_id enhancement you asked for).

Regarding your current ask however, fuzzywuzzy vs rapidfuzz shouldn't be an issue. Spaczz's fuzzy matching optional kwargs already expose two fuzzy ratio cutoffs (see min_r1 and min_r2 in the spaczz.fuzzysearcher.match docstring.) min_r1 is mostly a tradeoff in speed vs number of comparisons, but min_r2 is analogous to the score_cutoff parameter in rapidfuzz.

If you want a minimum fuzzy ratio of 95 like you're asking for, in your patterns list, you can change individual patterns match behavior. Let's say you want 'JZRyoutube' to only match at a min_r2>= 95, but are okay with 'DrDisrespect' matching at the default min_r2 of >= 75:

         patterns = [
            {'label': 'PERSON', 'pattern': 'DrDisrespect', 'type': 'fuzzy'},
            {'label': 'PERSON', 'pattern': 'JZRyoutube', 'type': 'fuzzy', 'kwargs': [{'min_r2': 95}]}
        ]

If you wanted to change the default minimum fuzzy ratio for all fuzzy matches to 95 you could instantiate the SpaczzRuler like follows:

ruler = SpaczzRuler(nlp, spaczz_fuzzy_defaults={'min_r2': 95})

Then you wouldn't have to add the kwargs to each pattern.

The current methods to optimize fuzzy matches in spaczz are available through the optional kwargs you can pass to patterns (the keyword arguments in spaczz.fuzzysearcher.match), namely min_r1, min_r2, fuzzy_func, flex, and ignore_case.

There is more granular match filtering I would like to implement, but that is already part of issue #14 (will provide more details in that issue soon). Due to the fact that that issue already exists and methods for solving this issue are already implemented in spaczz, I'm going to close this issue.

If you feel that you cannot solve your current issue with the methods I've outlined, and issue #14 will not address them, please let me know.

Thanks.

gandersen101 closed this as completed Aug 17, 2020

lfoppiano mentioned this issue Mar 22, 2021

SpaczzRuler configuration #56

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to restrict fuzzy search #18

How to restrict fuzzy search #18

kaykhancheckpoint commented Aug 17, 2020 •

edited

Loading

gandersen101 commented Aug 17, 2020 •

edited

Loading

How to restrict fuzzy search #18

How to restrict fuzzy search #18

Comments

kaykhancheckpoint commented Aug 17, 2020 • edited Loading

gandersen101 commented Aug 17, 2020 • edited Loading

kaykhancheckpoint commented Aug 17, 2020 •

edited

Loading

gandersen101 commented Aug 17, 2020 •

edited

Loading