-
Notifications
You must be signed in to change notification settings - Fork 543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve cpredicates.pyx #1145
Improve cpredicates.pyx #1145
Conversation
Codecov ReportBase: 73.84% // Head: 73.84% // No change to project coverage 👍
Additional details and impacted files@@ Coverage Diff @@
## main #1145 +/- ##
=======================================
Coverage 73.84% 73.84%
=======================================
Files 28 28
Lines 2294 2294
=======================================
Hits 1694 1694
Misses 600 600
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here. ☔ View full report at Codecov. |
the uniqueness guarantee is required |
But was not enforced, right? |
it was enforced by the set object |
hmm that’s true |
Sorry, but I don't understand. At the moment I see not set() object inside cpredicates.pyx, in particular ngrams is a list. |
you are right that the current code doesn’t enforce uniqueness, i’ll have to check where that is enforced |
everywhere we call this in predicates.py, we call set on it. it's a bit silly to do this. let's have cpredicates fill out a set and then not have those set calls in predicates.py. |
Actually there is one exception: class TfidfNGramPredicate(IndexPredicate):
def preprocess(self, doc: str) -> Sequence[str]:
return tuple(sorted(ngrams(" ".join(strip_punc(doc).split()), 2))) But we probably want the ngrams to be unique also here? |
ah.. we actually don't want ngrams to be unique there. |
How about the |
looks good! |
Changes:
Not sure how to check runtime improvement, using
python benchmarks/benchmarks/canonical.py
execution time is11,xxx
seconds both before and after the change.@fgregg: I am likely to open many more PR like this. Please tell me if you are fine with them. Of course, if I plan to submit bigger changes I will open a thread to discuss them before actually implementing them.