[FR] Allow inequality conditions in `block_by` #100

etiennebacher · 2024-02-09T16:32:03Z

Is your feature request related to a problem? Please describe.
It is nice to be able to use block_by to filter out some comparisons before computing the string similarity. Currently, it is limited to equality conditions (e.g rows must have the same year to be considered for matching). I have a setting in which I don't want to compare rows if year.y is larger than year.x, i.e I'd like to block matching by the condition year.x > year.y.

I don't know how hard that would be to implement. I could track the usage of block_by (renamed salt in the internals) until the method new() for the Shingleset struct in Rust, but I don't know how that works exactly.

Describe the solution you'd like
Not sure of the syntax this should take. It could be something similar to dplyr::join_by(), so that one could do block_by = block_by(year.x > year.y). Also, that only works if the variable has a different name in each dataset.

Describe alternatives you've considered
Currently I don't use block_by. Instead I match on the full data and filter out by my condition afterwards.

Additional context
/

The text was updated successfully, but these errors were encountered:

beniaminogreen · 2024-02-14T17:40:53Z

This is a very interesting feature request, but I think that it could be tough to implement. It might also be too far outside what I see as the main goals for the package, namely to accelerate big-data joins with locality sensitive hashing techniques.

I am ambivalent about the current blocking functionality in the package and if I could go back in time, I might not have included it as it creates extra code to maintain. If it is useful to people (and it sounds like some are using it), I would also be happy to change my mind about it.

The reason I originally included the blocking is that it's easy to implement / add on to the existing Locality Sensitive Hashing algorithms. Internally the blocking variables are called "salts" because they are literally used to salt hashes. This ensures that only units with the same salting / blocking variables hash collisions, which is a prerequisite for being compared as a potential match.

If we want to add an inequality constraint it seems much tougher to do this within the locality sensitive hashing framework. If we can find away to do it, I would love to have a try at implementing it. I can ask a professor I know about this possibility.

If we can't find a way to implement the inequality blocking as part of the locality sensitive hashing, we would likely have to add on the inequality as a filter step once the matching rows have been returned from the LSH algorithm. In this case, I think we shouldn't call it blocking as much as we should include it as part of the by argument (i.e. by = by(name_1 == name_2, year_1 > year_2)). I think that I would be very slightly less enthusiastic about this approach, but I would also be happy to implement it if you think there is sufficient demand for this kind of functionality.

Best,
Ben

etiennebacher · 2024-02-14T21:13:34Z

Thanks for your detailed answer! I have to think more about this (also curious to know what the prof you mention would say).

If we can't find a way to implement the inequality blocking as part of the locality sensitive hashing, we would likely have to add on the inequality as a filter step once the matching rows have been returned from the LSH algorithm.

I don't think this is worth implementing in the package. Blocking on inequality conditions would be a nice feature of the package IMO, because it reduces the work that the LSH has to do, but filtering afterwards based on some condition is a simple operation and is not related to LSH anymore.

beniaminogreen · 2024-02-15T16:14:52Z

Gotcha. Sounds like we are totally of the same mind about this. I will continue searching to see if there's a way to integrate the inequality constraints with the LSH and will keep you posted if I find anything. Until then, I won't try and add any filters to the package.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FR] Allow inequality conditions in `block_by` #100

[FR] Allow inequality conditions in `block_by` #100

etiennebacher commented Feb 9, 2024 •

edited

beniaminogreen commented Feb 14, 2024

etiennebacher commented Feb 14, 2024 •

edited

beniaminogreen commented Feb 15, 2024

[FR] Allow inequality conditions in block_by #100

[FR] Allow inequality conditions in block_by #100

Comments

etiennebacher commented Feb 9, 2024 • edited

beniaminogreen commented Feb 14, 2024

etiennebacher commented Feb 14, 2024 • edited

beniaminogreen commented Feb 15, 2024

[FR] Allow inequality conditions in `block_by` #100

[FR] Allow inequality conditions in `block_by` #100

etiennebacher commented Feb 9, 2024 •

edited

etiennebacher commented Feb 14, 2024 •

edited