Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sort similarity scores #335

Closed
nbgl opened this issue Mar 5, 2019 · 4 comments · Fixed by #339
Closed

Sort similarity scores #335

nbgl opened this issue Mar 5, 2019 · 4 comments · Fixed by #339

Comments

@nbgl
Copy link
Contributor

nbgl commented Mar 5, 2019

Similarity scores need to be sorted before solving. There are some helper tools in anonlink.concurrency for this.

@nbgl
Copy link
Contributor Author

nbgl commented Mar 6, 2019

I’ve disabled two tests in test_result_correctness.py to account for the output of anonlink v0.11 being correct, but the Entity Service being still incorrect. They need to be reenabled.

@hardbyte
Copy link
Collaborator

hardbyte commented Jun 28, 2019

I think this may have been incorrectly closed by a comment in #339. The entity service doesn't do a merge sort on similarity scores, however the solving occurs in a single celery task (on a high memory machine) and the anonlink library sorts before solving. @nbgl is there any reason to still want the entity service to merge sort (other than if/when we have to support a parallel solver?)

See:
#336

@hardbyte hardbyte removed this from the Entity Service v1.10 milestone Jun 28, 2019
@nbgl
Copy link
Contributor Author

nbgl commented Jun 28, 2019

The greedy solver needs to consider scores from highest to lowest to maximise accuracy. Sorting before solving is the usual way of doing this (I describe a more efficient way in data61/anonlink#212, but it is more complex to implement).

The Entity Service does actually sort similarity scores now, so this issue may be closed. The code is here as part of #339.

A concern might be that this aggregation is single-threaded, so it might not be the most efficient. Parallelising this merge sort is not a research question, but a software engineering one. (Not hard, but annoying.) But this should be identified as a bottleneck before any work is done, and it should be a separate issue.

@hardbyte
Copy link
Collaborator

Thanks for the clarification Jakub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants