-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Threshold-finding plots #142
Comments
It would have been good if you included the previous related work on this problem. It makes it easier for the reader to get a fuller picture. I don't believe that the similarity scores for matches will form a Gaussian. This distribution is clearly not symmetric around the mean. All plots show
Since the similarity matrix is potentially huge, we should also consider efficiency:
There is also a second aspect for real world usage: how well suited are the provided hashes for linking? It's a bit premature to generalize from experiments over just one synthetic dataset. Is there a way that anonlink can only store the scores for 'possible matches' in the similarity matrix and discard the 'definite non-matches' on the fly? There are also other graphs to consider:
|
I agree. Another argument is that the distribution has finite support. Nonetheless, I still claim that a Gaussian is a reasonable approximation.
This is true when you’re doing matching on a bipartite graph: if you have n similarity scores, then there are at most √n matches. If either of those datasets contain duplicates then you may have more matches than that. Indeed, it is theoretically possible that in an extreme case most similarity scores are matches.
Yes, but I would argue that this interesting area is easier to spot on some graphs than others. (In particular for noisy data, it’s easier to spot in the proportion graph than the other two.)
For a particular threshold, the graphs are no more expensive to make than a matching using this threshold.
Yes, indeed the graphs require us to compute the similarity scores, which is an easily parallelisable step. The second (number of matches) and third (proportion of possible matches) graphs require us to run the greedy solver, which cannot be parallelised.
Yes. We could (1) compute the similarity scores with a low threshold, (2) produce graphs, (3) choose threshold, (4) discard similarity scores below that threshold.
To an extent. You can produce part of the graph without storing the whole similarity matrix. With all those graphs, if you want to produce the part of the graph between t and 1, you only need to compute the similarity matrix with a threshold of t.
Good point!
Not in general, as this requires us to run the greedy solver. But we can try approximating it by setting a low value for |
The helper code to generate these kinds of plots has been included in the library, together with examples. |
For the last few weeks, I've been thinking about the following question: How can we find the best linkage threshold using only the similarity matrix?
I have now realised that this is the wrong question to ask. There is no such thing as the single best threshold; rather, our choice of the threshold is inherently bound by the compromise between precision and recall that we wish to achieve. Hence, instead of trying to make the choice for the analyst, we should be providing them with the tools they need to make their own decision.
Plots are very helpful in this endeavour! The below plots were generated using synthetic data from the GeCo tool. It's a two-party linkage problem where each party has just over 3 200 records, and where the intersection is about 1 450.
The first plot is a histogram of the similarity scores:
![1](https://user-images.githubusercontent.com/12507513/44009116-b2f5c738-9eec-11e8-9dc4-d4a9d623a464.png)
These plots can be helpful if we assume that the matches and non-matches will each form an approximate Gaussian. Good thresholds will lie in the valley between the two Gaussians.
Another plot we can make is the number of matches for a particular threshold:
![2](https://user-images.githubusercontent.com/12507513/44009118-b574d09e-9eec-11e8-98b0-bf1ed727f8ae.png)
Sometimes this line will have a region that is flatter (in this case around 0.8). In that region we are least sensitive to changes in the threshold. That region is also our set of reasonable thresholds.
The last plot can be obtained by running the greedy solver and splitting the candidate pairs into ones that may be matched (depending on the threshold) and ones that will never be matched (because one of the records has a more similar alternative). We plot the proportion of candidate pairs that is possible to match:
![3](https://user-images.githubusercontent.com/12507513/44009120-ba4bce06-9eec-11e8-88ef-32df5bce9feb.png)
We immediately see what the good thresholds are. They are in the region that is not fully orange (that region has unreasonably low precision) but also not fully blue (which has very low recall). Since we have the ground truth here, we can verify this:
![real](https://user-images.githubusercontent.com/12507513/44009137-e10b5052-9eec-11e8-9eca-459ad06c5ed2.png)
Noisy data
Let's try that on a dataset that is much more noisy than the above!
![1](https://user-images.githubusercontent.com/12507513/44009124-c7c68b16-9eec-11e8-97c0-0c1a5dc7a2e8.png)
![2](https://user-images.githubusercontent.com/12507513/44009126-c7f10f44-9eec-11e8-8d45-662429fa0b22.png)
The above plots fail us. The density is not easily split into two nice Gaussians (although it is possible to guess). Our plot of the number of matches vs the threshold does not have a flat region. However, the third plot provides useful insight, even with noisy data:
![3](https://user-images.githubusercontent.com/12507513/44009128-cd2857ba-9eec-11e8-9bcc-41fb7fbf77b0.png)
By eye alone, the good thresholds appear to be between 0.72 and 0.82, depending on whether you prioritise precision or recall.
Since in this case we actually have the ground truth, we can actually plot the results:
![real](https://user-images.githubusercontent.com/12507513/44009130-d3770878-9eec-11e8-9e1a-5ad2c9d6a691.png)
And indeed, our guess made using just the third (proportion of candidate pairs that is possible to match) plot is spot on!
Conclusions
These plots are useful. Especially the last kind (proportion of candidate pairs that is possible to match). We should work out a good way of incorporating this into Anonlink.
@hardbyte @unzvfu @wilko77 thoughts?
Aha! Link: https://csiro.aha.io/features/ANONLINK-62
The text was updated successfully, but these errors were encountered: