Entity service only requesting the top 5 matches for each chunks comparison #59

mpnd · 2017-09-27T23:23:51Z

I'm concerned with the following line on compute_filter_similarity() in async_worker.py:

chunk_results = anonlink.entitymatch.calculate_filter_similarity(chunk_dp1, chunk_dp2,
                                                                 threshold=threshold,
                                                                 k=5,
                                                                 use_python=False)

Why is k arbitrarily set to 5? Is there a better value for k and why?

The text was updated successfully, but these errors were encountered:

hardbyte · 2017-11-22T22:43:13Z

We can probably get rid of k entirely - or handle None so we return all matches over threshold.

unzvfu · 2018-01-14T23:38:23Z

The function anonlink.entitymatch.calculate_filter_similarity has a Python implementation and a C implementation; the choice being determined by the parameter use_python. For some reason the Python version doesn't use k or threshold at all.

unzvfu · 2018-01-14T23:56:51Z

The primary purpose of both k and threshold seems to be to prune the number of relevant entries in the similarity matrix; threshold being a kind of soft limit and k being a hard limit. Is that right? If so, the correct value of k could just be some system parameter that's determined by how much memory we have available.

unzvfu · 2018-01-15T05:54:53Z

As mentioned in data61/anonlink#56 the default values of k and threshold vary from function to function which suggests that maybe there are no suitable default values. If not, then the defaults should be removed everywhere and the caller should be obliged to specify them each time.

unzvfu · 2018-01-15T23:02:43Z

After discussion with @hardbyte we are leaning towards removing k altogether---essentially setting it equal to n---, as (i) it is often set to n in practice and (ii) we don't have a very clear example of a case where setting it is an obviously good idea or net win. The main argument for keeping k (indeed the only one that I can see) seems to be that we can use it to reduce the memory footprint when we are confident that the data is clean enough that the correct match will be within the top k scores.

unzvfu · 2018-01-31T05:00:54Z

I think this is closed by #84. Some related discussion is at anonlink/issues/56.

hardbyte added bug P1: urgent labels Oct 3, 2017

hardbyte self-assigned this Oct 3, 2017

hardbyte added this to the ES 1.8 Release milestone Jan 10, 2018

gusmith modified the milestones: ES 1.8 Release, Sprint 18-01-15 Jan 10, 2018

unzvfu mentioned this issue Jan 15, 2018

Refactor main C++ function to avoid use "constant" memory and avoid new/delete data61/anonlink#55

Merged

unzvfu self-assigned this Jan 15, 2018

hardbyte modified the milestones: ES 1.8 Release, Sprint 18-01-15 Jan 19, 2018

hardbyte removed their assignment Jan 25, 2018

hardbyte modified the milestones: Sprint 18-01-15, Sprint 2018-01-29 Jan 25, 2018

hardbyte mentioned this issue Jan 30, 2018

k was a hack #84

Merged

unzvfu closed this as completed Jan 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Entity service only requesting the top 5 matches for each chunks comparison #59

Entity service only requesting the top 5 matches for each chunks comparison #59

mpnd commented Sep 27, 2017

hardbyte commented Nov 22, 2017

unzvfu commented Jan 14, 2018

unzvfu commented Jan 14, 2018

unzvfu commented Jan 15, 2018

unzvfu commented Jan 15, 2018

unzvfu commented Jan 31, 2018

Entity service only requesting the top 5 matches for each chunks comparison #59

Entity service only requesting the top 5 matches for each chunks comparison #59

Comments

mpnd commented Sep 27, 2017

hardbyte commented Nov 22, 2017

unzvfu commented Jan 14, 2018

unzvfu commented Jan 14, 2018

unzvfu commented Jan 15, 2018

unzvfu commented Jan 15, 2018

unzvfu commented Jan 31, 2018