-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FR] Allow inequality conditions in block_by
#100
Comments
This is a very interesting feature request, but I think that it could be tough to implement. It might also be too far outside what I see as the main goals for the package, namely to accelerate big-data joins with locality sensitive hashing techniques. I am ambivalent about the current blocking functionality in the package and if I could go back in time, I might not have included it as it creates extra code to maintain. If it is useful to people (and it sounds like some are using it), I would also be happy to change my mind about it. The reason I originally included the blocking is that it's easy to implement / add on to the existing Locality Sensitive Hashing algorithms. Internally the blocking variables are called "salts" because they are literally used to salt hashes. This ensures that only units with the same salting / blocking variables hash collisions, which is a prerequisite for being compared as a potential match. If we want to add an inequality constraint it seems much tougher to do this within the locality sensitive hashing framework. If we can find away to do it, I would love to have a try at implementing it. I can ask a professor I know about this possibility. If we can't find a way to implement the inequality blocking as part of the locality sensitive hashing, we would likely have to add on the inequality as a filter step once the matching rows have been returned from the LSH algorithm. In this case, I think we shouldn't call it blocking as much as we should include it as part of the Best, |
Thanks for your detailed answer! I have to think more about this (also curious to know what the prof you mention would say).
I don't think this is worth implementing in the package. Blocking on inequality conditions would be a nice feature of the package IMO, because it reduces the work that the LSH has to do, but filtering afterwards based on some condition is a simple operation and is not related to LSH anymore. |
Gotcha. Sounds like we are totally of the same mind about this. I will continue searching to see if there's a way to integrate the inequality constraints with the LSH and will keep you posted if I find anything. Until then, I won't try and add any filters to the package. |
Is your feature request related to a problem? Please describe.
It is nice to be able to use
block_by
to filter out some comparisons before computing the string similarity. Currently, it is limited to equality conditions (e.g rows must have the same year to be considered for matching). I have a setting in which I don't want to compare rows ifyear.y
is larger thanyear.x
, i.e I'd like to block matching by the conditionyear.x > year.y
.I don't know how hard that would be to implement. I could track the usage of
block_by
(renamedsalt
in the internals) until the methodnew()
for theShingleset
struct in Rust, but I don't know how that works exactly.Describe the solution you'd like
Not sure of the syntax this should take. It could be something similar to
dplyr::join_by()
, so that one could doblock_by = block_by(year.x > year.y)
. Also, that only works if the variable has a different name in each dataset.Describe alternatives you've considered
Currently I don't use
block_by
. Instead I match on the full data and filter out by my condition afterwards.Additional context
/
The text was updated successfully, but these errors were encountered: