Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve cpredicates.pyx #1145

Merged
merged 4 commits into from
Jan 29, 2023
Merged

Improve cpredicates.pyx #1145

merged 4 commits into from
Jan 29, 2023

Conversation

lmores
Copy link
Contributor

@lmores lmores commented Jan 22, 2023

Changes:

  • Avoid nested for loop in ngrams() function inside cpredicates.pyx.
  • Fix docstrings to match actual implementation.
  • Add tests.

Not sure how to check runtime improvement, using python benchmarks/benchmarks/canonical.py execution time is 11,xxx seconds both before and after the change.

@fgregg: I am likely to open many more PR like this. Please tell me if you are fine with them. Of course, if I plan to submit bigger changes I will open a thread to discuss them before actually implementing them.

@codecov
Copy link

codecov bot commented Jan 22, 2023

Codecov Report

Base: 73.84% // Head: 73.84% // No change to project coverage 👍

Coverage data is based on head (e88470b) compared to base (baa6071).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1145   +/-   ##
=======================================
  Coverage   73.84%   73.84%           
=======================================
  Files          28       28           
  Lines        2294     2294           
=======================================
  Hits         1694     1694           
  Misses        600      600           
Impacted Files Coverage Δ
dedupe/predicates.py 83.95% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@fgregg
Copy link
Contributor

fgregg commented Jan 22, 2023

the uniqueness guarantee is required

@lmores
Copy link
Contributor Author

lmores commented Jan 22, 2023

the uniqueness guarantee is required

But was not enforced, right?

@fgregg
Copy link
Contributor

fgregg commented Jan 22, 2023

it was enforced by the set object

@fgregg
Copy link
Contributor

fgregg commented Jan 22, 2023

hmm that’s true

@lmores
Copy link
Contributor Author

lmores commented Jan 22, 2023

Sorry, but I don't understand. At the moment I see not set() object inside cpredicates.pyx, in particular ngrams is a list.
I can make it a set if necessary.

@fgregg
Copy link
Contributor

fgregg commented Jan 22, 2023

you are right that the current code doesn’t enforce uniqueness, i’ll have to check where that is enforced

@fgregg
Copy link
Contributor

fgregg commented Jan 23, 2023

everywhere we call this in predicates.py, we call set on it.

it's a bit silly to do this. let's have cpredicates fill out a set and then not have those set calls in predicates.py.

@lmores
Copy link
Contributor Author

lmores commented Jan 23, 2023

Actually there is one exception:

class TfidfNGramPredicate(IndexPredicate):
    def preprocess(self, doc: str) -> Sequence[str]:
        return tuple(sorted(ngrams(" ".join(strip_punc(doc).split()), 2)))

But we probably want the ngrams to be unique also here?

@fgregg
Copy link
Contributor

fgregg commented Jan 23, 2023

ah.. we actually don't want ngrams to be unique there.

@lmores
Copy link
Contributor Author

lmores commented Jan 23, 2023

How about the unique_ngrams function I added in the last commit?

@fgregg
Copy link
Contributor

fgregg commented Jan 29, 2023

looks good!

@fgregg fgregg merged commit 1f5dfbc into dedupeio:main Jan 29, 2023
@lmores lmores deleted the fix/cpredicates branch February 17, 2023 08:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants