Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Entity service only requesting the top 5 matches for each chunks comparison #59

Closed
mpnd opened this issue Sep 27, 2017 · 6 comments
Closed

Comments

@mpnd
Copy link

mpnd commented Sep 27, 2017

I'm concerned with the following line on compute_filter_similarity() in async_worker.py:

chunk_results = anonlink.entitymatch.calculate_filter_similarity(chunk_dp1, chunk_dp2,
                                                                 threshold=threshold,
                                                                 k=5,
                                                                 use_python=False)

Why is k arbitrarily set to 5? Is there a better value for k and why?

@hardbyte
Copy link
Collaborator

We can probably get rid of k entirely - or handle None so we return all matches over threshold.

@hardbyte hardbyte added this to the ES 1.8 Release milestone Jan 10, 2018
@gusmith gusmith modified the milestones: ES 1.8 Release, Sprint 18-01-15 Jan 10, 2018
@unzvfu
Copy link
Contributor

unzvfu commented Jan 14, 2018

The function anonlink.entitymatch.calculate_filter_similarity has a Python implementation and a C implementation; the choice being determined by the parameter use_python. For some reason the Python version doesn't use k or threshold at all.

@unzvfu
Copy link
Contributor

unzvfu commented Jan 14, 2018

The primary purpose of both k and threshold seems to be to prune the number of relevant entries in the similarity matrix; threshold being a kind of soft limit and k being a hard limit. Is that right? If so, the correct value of k could just be some system parameter that's determined by how much memory we have available.

@unzvfu
Copy link
Contributor

unzvfu commented Jan 15, 2018

As mentioned in data61/anonlink#56 the default values of k and threshold vary from function to function which suggests that maybe there are no suitable default values. If not, then the defaults should be removed everywhere and the caller should be obliged to specify them each time.

@unzvfu
Copy link
Contributor

unzvfu commented Jan 15, 2018

After discussion with @hardbyte we are leaning towards removing k altogether---essentially setting it equal to n---, as (i) it is often set to n in practice and (ii) we don't have a very clear example of a case where setting it is an obviously good idea or net win. The main argument for keeping k (indeed the only one that I can see) seems to be that we can use it to reduce the memory footprint when we are confident that the data is clean enough that the correct match will be within the top k scores.

@unzvfu unzvfu self-assigned this Jan 15, 2018
@hardbyte hardbyte modified the milestones: ES 1.8 Release, Sprint 18-01-15 Jan 19, 2018
@hardbyte hardbyte removed their assignment Jan 25, 2018
@hardbyte hardbyte mentioned this issue Jan 30, 2018
@unzvfu
Copy link
Contributor

unzvfu commented Jan 31, 2018

I think this is closed by #84. Some related discussion is at anonlink/issues/56.

@unzvfu unzvfu closed this as completed Jan 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants