Threshold-finding plots #142

nbgl · 2018-08-12T16:46:22Z

For the last few weeks, I've been thinking about the following question: How can we find the best linkage threshold using only the similarity matrix?

I have now realised that this is the wrong question to ask. There is no such thing as the single best threshold; rather, our choice of the threshold is inherently bound by the compromise between precision and recall that we wish to achieve. Hence, instead of trying to make the choice for the analyst, we should be providing them with the tools they need to make their own decision.

Plots are very helpful in this endeavour! The below plots were generated using synthetic data from the GeCo tool. It's a two-party linkage problem where each party has just over 3 200 records, and where the intersection is about 1 450.

The first plot is a histogram of the similarity scores:

These plots can be helpful if we assume that the matches and non-matches will each form an approximate Gaussian. Good thresholds will lie in the valley between the two Gaussians.

Another plot we can make is the number of matches for a particular threshold:

Sometimes this line will have a region that is flatter (in this case around 0.8). In that region we are least sensitive to changes in the threshold. That region is also our set of reasonable thresholds.

The last plot can be obtained by running the greedy solver and splitting the candidate pairs into ones that may be matched (depending on the threshold) and ones that will never be matched (because one of the records has a more similar alternative). We plot the proportion of candidate pairs that is possible to match:

We immediately see what the good thresholds are. They are in the region that is not fully orange (that region has unreasonably low precision) but also not fully blue (which has very low recall). Since we have the ground truth here, we can verify this:

Noisy data

Let's try that on a dataset that is much more noisy than the above!

The above plots fail us. The density is not easily split into two nice Gaussians (although it is possible to guess). Our plot of the number of matches vs the threshold does not have a flat region. However, the third plot provides useful insight, even with noisy data:

By eye alone, the good thresholds appear to be between 0.72 and 0.82, depending on whether you prioritise precision or recall.

Since in this case we actually have the ground truth, we can actually plot the results:

And indeed, our guess made using just the third (proportion of candidate pairs that is possible to match) plot is spot on!

Conclusions

These plots are useful. Especially the last kind (proportion of candidate pairs that is possible to match). We should work out a good way of incorporating this into Anonlink.

@hardbyte @unzvfu @wilko77 thoughts?

Aha! Link: https://csiro.aha.io/features/ANONLINK-62

wilko77 · 2018-08-14T00:27:53Z

It would have been good if you included the previous related work on this problem. It makes it easier for the reader to get a fuller picture.
There is another issue data61/anonlink-entity-service#246 talking about similar things. It is good form to include related items in your issue. Also, we talk about the precision/recall trade-off in the tutorials in clkhash and entity service.

I don't believe that the similarity scores for matches will form a Gaussian. This distribution is clearly not symmetric around the mean.

All plots show

that the lower-hand side of the similarity scores mountain is of no use for us.
essentially the same area of interest for the choice of the threshold.

Since the similarity matrix is potentially huge, we should also consider efficiency:

How expensive is it to compute each of those graphs?
Can the computation be parallelized in a similar fashion to computing the similarity scores?
Could the graph be a by-product of the computations we do already without incurring any additional costs?
Can the graph be computed without having to store the whole similarity matrix?

There is also a second aspect for real world usage: how well suited are the provided hashes for linking?
Whereas the differences between good and bad data in the histogram are very clear, it is a lot harder to judge the quality of the data looking at the proportion graph.

It's a bit premature to generalize from experiments over just one synthetic dataset.
I think we should generate these graphs over a diverse lot of datasets.

Is there a way that anonlink can only store the scores for 'possible matches' in the similarity matrix and discard the 'definite non-matches' on the fly?

There are also other graphs to consider:

cumulative histogram, plot the number of scores above a threshold over the thresholds. (This also gives you an upper bound of matches you can find for each threshold.)
?

nbgl · 2018-08-14T01:52:53Z

I don't believe that the similarity scores for matches will form a Gaussian. This distribution is clearly not symmetric around the mean.

I agree. Another argument is that the distribution has finite support. Nonetheless, I still claim that a Gaussian is a reasonable approximation.

All plots show that the lower-hand side of the similarity scores mountain is of no use for us.

This is true when you’re doing matching on a bipartite graph: if you have n similarity scores, then there are at most √n matches. If either of those datasets contain duplicates then you may have more matches than that. Indeed, it is theoretically possible that in an extreme case most similarity scores are matches.

All plots show (…) essentially the same area of interest for the choice of the threshold.

Yes, but I would argue that this interesting area is easier to spot on some graphs than others. (In particular for noisy data, it’s easier to spot in the proportion graph than the other two.)

How expensive is it to compute each of those graphs?

For a particular threshold, the graphs are no more expensive to make than a matching using this threshold.

Can the computation be parallelized in a similar fashion to computing the similarity scores?

Yes, indeed the graphs require us to compute the similarity scores, which is an easily parallelisable step. The second (number of matches) and third (proportion of possible matches) graphs require us to run the greedy solver, which cannot be parallelised.

Could the graph be a by-product of the computations we do already without incurring any additional costs?

Yes. We could (1) compute the similarity scores with a low threshold, (2) produce graphs, (3) choose threshold, (4) discard similarity scores below that threshold.

Can the graph be computed without having to store the whole similarity matrix?

To an extent. You can produce part of the graph without storing the whole similarity matrix. With all those graphs, if you want to produce the part of the graph between t and 1, you only need to compute the similarity matrix with a threshold of t.

Whereas the differences between good and bad data in the histogram are very clear, it is a lot harder to judge the quality of the data looking at the proportion graph.

Good point!

Is there a way that anonlink can only store the scores for 'possible matches' in the similarity matrix and discard the 'definite non-matches' on the fly?

Not in general, as this requires us to run the greedy solver. But we can try approximating it by setting a low value for k.

nbgl · 2018-11-07T00:37:22Z

The helper code to generate these kinds of plots has been included in the library, together with examples.

nbgl added research proposal labels Aug 12, 2018

nbgl mentioned this issue Oct 25, 2018

Similarity stats #173

Merged

nbgl closed this as completed Nov 7, 2018

nbgl mentioned this issue Jan 21, 2019

Probability calibration #181

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Threshold-finding plots #142

Threshold-finding plots #142

nbgl commented Aug 12, 2018 •

edited by hardbyte

Loading

wilko77 commented Aug 14, 2018

nbgl commented Aug 14, 2018

nbgl commented Nov 7, 2018

Threshold-finding plots #142

Threshold-finding plots #142

Comments

nbgl commented Aug 12, 2018 • edited by hardbyte Loading

Noisy data

Conclusions

wilko77 commented Aug 14, 2018

nbgl commented Aug 14, 2018

nbgl commented Nov 7, 2018

nbgl commented Aug 12, 2018 •

edited by hardbyte

Loading