Comparison tasks now based on blocks #527

hardbyte · 2020-03-09T22:23:18Z

Builds on from #522 which moved encodings to the database

This PR changes the comparison tasks to use blocking info. Where the individual blocks are too large we break them into "chunks".

This has a negative performance impact on small record linkage jobs. Many small blocks are compared slower than larger blocks. When the jobs get really massive though it is infeasible without blocking.

hardbyte · 2020-03-16T01:15:47Z

Consider using anonlink.candidate_generation._merge_similarities instead for deduplication.

hardbyte · 2020-03-16T22:01:46Z

During review this is deployed at https://blocked-anonlink.easd.data61.xyz/

wilko77

you wrapped some logic around anonlink.concurrency.split_to_chunks. The alternative would be to modify/extend this functionality in anonlink itself. Did you consider that and what's your reasoning?

backend/entityservice/database/selections.py

backend/entityservice/tasks/comparing.py

backend/entityservice/tasks/run.py

backend/entityservice/utils.py

hardbyte · 2020-03-18T04:18:37Z

you wrapped some logic around anonlink.concurrency.split_to_chunks. The alternative would be to modify/extend this functionality in anonlink itself. Did you consider that and what's your reasoning?

Most of the extra stuff relates to the database fields dataprovider id and block id which anonlink doesn't need to worry about. That said _create_work_chunks could, with some modification, find its way into the anonlink library at which point we would update the entity service to use the higher level functionality.

hardbyte · 2020-03-22T21:46:41Z

I wasn't that happy with the time taken to retrieve the encodings from the database so I've changed to using a binary COPY command.

wilko77 · 2020-03-23T00:28:24Z

backend/entityservice/database/selections.py

+        yield block_name.strip(), count
+
+
+def iterate_cursor_results(cur, one=True, page_size=4096):


the 'one' argument is a bit confusing. From the name alone it is not obvious what it does.
Do we really need that?
Wouldn't it be cleaner to always yield the full results? That's what the function name says.
We could just call it like this:
for block_name, _ in iterate_cursor_results(cur):

wilko77 · 2020-03-23T01:05:49Z

backend/entityservice/database/selections.py

+    # Need to read/remove the Postgres Binary Header, Trailer, and the per tuple info
+    # https://www.postgresql.org/docs/current/sql-copy.html
+    ignored_header = raw_data[:15]
+    header_extension = raw_data[16:20]
+    assert header_extension == b'\x00\x00\x00\x00', "Need to implement skipping postgres binary header extension"
+    binary_trailer = raw_data[-2:]
+    assert binary_trailer == b'\xff\xff', "Corrupt COPY of binary data from postgres"
+    raw_data = raw_data[19:-2]


This code to handle the Postgres binary format could be a helper function. We might need it again somewhere else.
It would also clean up the code in here a bit, and make this function more single purpose.

That's a good point, will do.

hardbyte requested a review from wilko77 March 11, 2020 00:22

hardbyte changed the base branch from feature-use-encodings-from-db to develop March 16, 2020 01:17

hardbyte force-pushed the create-comparisons-to-use-blocking branch from 80ae8a7 to 42a2da3 Compare March 16, 2020 21:24

wilko77 reviewed Mar 17, 2020

View reviewed changes

hardbyte and others added 18 commits March 18, 2020 16:38

Add function to fetch block ids and sizes from db

065b11a

Retrieve blocking info in create_comparison_jobs task

6f7c75c

WIP - identify blocks that need to be broken up further

ebcb248

Query for getting encodings in a block

a066ccc

Split tasks into chunks using blocking information

fb550de

Refactor create comparison jobs function

610b3bb

More refactoring of chunk creation

d838fe4

Add a few unit tests for chunking

ec36e8d

Add database index on encodings table

ddcbcc3

clknblocks not clksnblocks and other minor cleanup

4ab16e6

cleanup

d66bf58

Add blocking concept to docs

1e5151f

Deduplicate candidate pairs before solving

aec9b5c

Catch the empty candidate pair case

f30c819

Simplify solver task by using anonlink's _merge_similarities function

9dc59e1

Update celery

6219e44

Address code review feedback

0b6a4c2

Bump version to beta2

5467362

hardbyte force-pushed the create-comparisons-to-use-blocking branch from 5c8b6a2 to 5467362 Compare March 18, 2020 03:51

hardbyte added 4 commits March 19, 2020 15:00

Celery concurrency defaults

f342d5a

Add another layer of tracing into the comparison task

2add5ef

Update task names in celery routing

e2ebe99

Faster encoding retrieval by using COPY.

38b624f

Pass on stored size when retrieving encodings from DB

7ec7fef

hardbyte requested a review from wilko77 March 22, 2020 21:59

Increase time on test

24caa79

wilko77 approved these changes Mar 23, 2020

View reviewed changes

hardbyte added 3 commits March 24, 2020 11:48

Refactor binary copy into own function for easier reuse and testing

dc1983b

Add more detailed tracing around binary encoding insertions.

8bae410

Add tests for binary copy function

88e968d

hardbyte merged commit 7dbcf2d into develop Mar 24, 2020

This was referenced Apr 30, 2020

Release v1.13.0 beta.2 #553

Merged

Merge develop into master for v1.13.0-beta2 #554

Merged

hardbyte deleted the create-comparisons-to-use-blocking branch March 22, 2021 01:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comparison tasks now based on blocks #527

Comparison tasks now based on blocks #527

hardbyte commented Mar 9, 2020 •

edited

Loading

hardbyte commented Mar 16, 2020

hardbyte commented Mar 16, 2020

wilko77 left a comment

hardbyte commented Mar 18, 2020

hardbyte commented Mar 22, 2020

wilko77 Mar 23, 2020

wilko77 Mar 23, 2020

hardbyte Mar 23, 2020

		yield block_name.strip(), count


		def iterate_cursor_results(cur, one=True, page_size=4096):

Comparison tasks now based on blocks #527

Comparison tasks now based on blocks #527

Conversation

hardbyte commented Mar 9, 2020 • edited Loading

hardbyte commented Mar 16, 2020

hardbyte commented Mar 16, 2020

wilko77 left a comment

Choose a reason for hiding this comment

hardbyte commented Mar 18, 2020

hardbyte commented Mar 22, 2020

wilko77 Mar 23, 2020

Choose a reason for hiding this comment

wilko77 Mar 23, 2020

Choose a reason for hiding this comment

hardbyte Mar 23, 2020

Choose a reason for hiding this comment

hardbyte commented Mar 9, 2020 •

edited

Loading