-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comparison tasks now based on blocks #527
Conversation
Consider using |
80ae8a7
to
42a2da3
Compare
During review this is deployed at https://blocked-anonlink.easd.data61.xyz/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you wrapped some logic around anonlink.concurrency.split_to_chunks
. The alternative would be to modify/extend this functionality in anonlink itself. Did you consider that and what's your reasoning?
5c8b6a2
to
5467362
Compare
Most of the extra stuff relates to the database fields dataprovider id and block id which anonlink doesn't need to worry about. That said |
yield block_name.strip(), count | ||
|
||
|
||
def iterate_cursor_results(cur, one=True, page_size=4096): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 'one' argument is a bit confusing. From the name alone it is not obvious what it does.
Do we really need that?
Wouldn't it be cleaner to always yield the full results? That's what the function name says.
We could just call it like this:
for block_name, _ in iterate_cursor_results(cur):
# Need to read/remove the Postgres Binary Header, Trailer, and the per tuple info | ||
# https://www.postgresql.org/docs/current/sql-copy.html | ||
ignored_header = raw_data[:15] | ||
header_extension = raw_data[16:20] | ||
assert header_extension == b'\x00\x00\x00\x00', "Need to implement skipping postgres binary header extension" | ||
binary_trailer = raw_data[-2:] | ||
assert binary_trailer == b'\xff\xff', "Corrupt COPY of binary data from postgres" | ||
raw_data = raw_data[19:-2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code to handle the Postgres binary format could be a helper function. We might need it again somewhere else.
It would also clean up the code in here a bit, and make this function more single purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point, will do.
Builds on from #522 which moved encodings to the database
This PR changes the comparison tasks to use blocking info. Where the individual blocks are too large we break them into "chunks".
This has a negative performance impact on small record linkage jobs. Many small blocks are compared slower than larger blocks. When the jobs get really massive though it is infeasible without blocking.