Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FR] Allow inequality conditions in block_by #100

Open
etiennebacher opened this issue Feb 9, 2024 · 3 comments
Open

[FR] Allow inequality conditions in block_by #100

etiennebacher opened this issue Feb 9, 2024 · 3 comments

Comments

@etiennebacher
Copy link
Collaborator

etiennebacher commented Feb 9, 2024

Is your feature request related to a problem? Please describe.
It is nice to be able to use block_by to filter out some comparisons before computing the string similarity. Currently, it is limited to equality conditions (e.g rows must have the same year to be considered for matching). I have a setting in which I don't want to compare rows if year.y is larger than year.x, i.e I'd like to block matching by the condition year.x > year.y.

I don't know how hard that would be to implement. I could track the usage of block_by (renamed salt in the internals) until the method new() for the Shingleset struct in Rust, but I don't know how that works exactly.

Describe the solution you'd like
Not sure of the syntax this should take. It could be something similar to dplyr::join_by(), so that one could do block_by = block_by(year.x > year.y). Also, that only works if the variable has a different name in each dataset.

Describe alternatives you've considered
Currently I don't use block_by. Instead I match on the full data and filter out by my condition afterwards.

Additional context
/

@beniaminogreen
Copy link
Owner

This is a very interesting feature request, but I think that it could be tough to implement. It might also be too far outside what I see as the main goals for the package, namely to accelerate big-data joins with locality sensitive hashing techniques.

I am ambivalent about the current blocking functionality in the package and if I could go back in time, I might not have included it as it creates extra code to maintain. If it is useful to people (and it sounds like some are using it), I would also be happy to change my mind about it.

The reason I originally included the blocking is that it's easy to implement / add on to the existing Locality Sensitive Hashing algorithms. Internally the blocking variables are called "salts" because they are literally used to salt hashes. This ensures that only units with the same salting / blocking variables hash collisions, which is a prerequisite for being compared as a potential match.

If we want to add an inequality constraint it seems much tougher to do this within the locality sensitive hashing framework. If we can find away to do it, I would love to have a try at implementing it. I can ask a professor I know about this possibility.

If we can't find a way to implement the inequality blocking as part of the locality sensitive hashing, we would likely have to add on the inequality as a filter step once the matching rows have been returned from the LSH algorithm. In this case, I think we shouldn't call it blocking as much as we should include it as part of the by argument (i.e. by = by(name_1 == name_2, year_1 > year_2)). I think that I would be very slightly less enthusiastic about this approach, but I would also be happy to implement it if you think there is sufficient demand for this kind of functionality.

Best,
Ben

@etiennebacher
Copy link
Collaborator Author

etiennebacher commented Feb 14, 2024

Thanks for your detailed answer! I have to think more about this (also curious to know what the prof you mention would say).

If we can't find a way to implement the inequality blocking as part of the locality sensitive hashing, we would likely have to add on the inequality as a filter step once the matching rows have been returned from the LSH algorithm.

I don't think this is worth implementing in the package. Blocking on inequality conditions would be a nice feature of the package IMO, because it reduces the work that the LSH has to do, but filtering afterwards based on some condition is a simple operation and is not related to LSH anymore.

@beniaminogreen
Copy link
Owner

Gotcha. Sounds like we are totally of the same mind about this. I will continue searching to see if there's a way to integrate the inequality constraints with the LSH and will keep you posted if I find anything. Until then, I won't try and add any filters to the package.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants