Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Comparison tasks now based on blocks #527

Merged
merged 27 commits into from
Mar 24, 2020

Conversation

hardbyte
Copy link
Collaborator

@hardbyte hardbyte commented Mar 9, 2020

Builds on from #522 which moved encodings to the database

This PR changes the comparison tasks to use blocking info. Where the individual blocks are too large we break them into "chunks".

This has a negative performance impact on small record linkage jobs. Many small blocks are compared slower than larger blocks. When the jobs get really massive though it is infeasible without blocking.

@hardbyte hardbyte requested a review from wilko77 March 11, 2020 00:22
@hardbyte
Copy link
Collaborator Author

Consider using anonlink.candidate_generation._merge_similarities instead for deduplication.

@hardbyte hardbyte changed the base branch from feature-use-encodings-from-db to develop March 16, 2020 01:17
@hardbyte hardbyte force-pushed the create-comparisons-to-use-blocking branch from 80ae8a7 to 42a2da3 Compare March 16, 2020 21:24
@hardbyte
Copy link
Collaborator Author

During review this is deployed at https://blocked-anonlink.easd.data61.xyz/

Copy link
Collaborator

@wilko77 wilko77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you wrapped some logic around anonlink.concurrency.split_to_chunks. The alternative would be to modify/extend this functionality in anonlink itself. Did you consider that and what's your reasoning?

backend/entityservice/database/selections.py Outdated Show resolved Hide resolved
backend/entityservice/tasks/comparing.py Show resolved Hide resolved
backend/entityservice/tasks/comparing.py Outdated Show resolved Hide resolved
backend/entityservice/tasks/comparing.py Outdated Show resolved Hide resolved
backend/entityservice/tasks/comparing.py Outdated Show resolved Hide resolved
backend/entityservice/tasks/comparing.py Outdated Show resolved Hide resolved
backend/entityservice/tasks/run.py Show resolved Hide resolved
backend/entityservice/utils.py Outdated Show resolved Hide resolved
@hardbyte hardbyte force-pushed the create-comparisons-to-use-blocking branch from 5c8b6a2 to 5467362 Compare March 18, 2020 03:51
@hardbyte
Copy link
Collaborator Author

you wrapped some logic around anonlink.concurrency.split_to_chunks. The alternative would be to modify/extend this functionality in anonlink itself. Did you consider that and what's your reasoning?

Most of the extra stuff relates to the database fields dataprovider id and block id which anonlink doesn't need to worry about. That said _create_work_chunks could, with some modification, find its way into the anonlink library at which point we would update the entity service to use the higher level functionality.

@hardbyte
Copy link
Collaborator Author

I wasn't that happy with the time taken to retrieve the encodings from the database so I've changed to using a binary COPY command.

image

@hardbyte hardbyte requested a review from wilko77 March 22, 2020 21:59
yield block_name.strip(), count


def iterate_cursor_results(cur, one=True, page_size=4096):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 'one' argument is a bit confusing. From the name alone it is not obvious what it does.
Do we really need that?
Wouldn't it be cleaner to always yield the full results? That's what the function name says.
We could just call it like this:
for block_name, _ in iterate_cursor_results(cur):

Comment on lines 354 to 361
# Need to read/remove the Postgres Binary Header, Trailer, and the per tuple info
# https://www.postgresql.org/docs/current/sql-copy.html
ignored_header = raw_data[:15]
header_extension = raw_data[16:20]
assert header_extension == b'\x00\x00\x00\x00', "Need to implement skipping postgres binary header extension"
binary_trailer = raw_data[-2:]
assert binary_trailer == b'\xff\xff', "Corrupt COPY of binary data from postgres"
raw_data = raw_data[19:-2]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code to handle the Postgres binary format could be a helper function. We might need it again somewhere else.
It would also clean up the code in here a bit, and make this function more single purpose.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, will do.

@hardbyte hardbyte merged commit 7dbcf2d into develop Mar 24, 2020
@hardbyte hardbyte deleted the create-comparisons-to-use-blocking branch March 22, 2021 01:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants