-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Similarity stats #173
Similarity stats #173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple of suggestions inline, I'll leave it to you to carefully consider whether including the matplotlib
functions. If so I suggest adding it as an optional dependency in setup.py
.
I like the hypothesis testing!
anonlink/stats.py
Outdated
return num_matches, thresholds | ||
|
||
|
||
def plot_similarities_hist(candidate_pairs, bins=100): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suspect it would be better to return a matplotlib.axes.Axes
instead of calling pyplot.show
.
I'd probably vote for not including the matplotlib plotting in the library at all though - and move this into an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the example idea.
I wanted to include the plotting because I had put a bunch of thought into doing it well, and I didn’t want this effort to get lost. But putting it in an example works as well!
anonlink/stats.py
Outdated
|
||
nonmatches_ratio = [nm * 100 / (m + nm) if (m + nm) else 0 | ||
for nm, m in zip(nonmatches_num, matches_num)] | ||
matches_ratio = [m * 100 / (m + nm) if (m + nm) else 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Trivial but could be useful to refactor this out into a non-matplotlib function that is exposed.
I'm just imagining a javascript/html frontend where it might be nicer - not to mention slightly less revealing - if the Python library could compute the nonmatches/matches ratio too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that was my first thought too! I chose not to do it as some values are undefined (bins where matches + nonmatches = 0), so these need special handling. Options are:
- NaNs
- NumPy masked arrays
These are both fine options, but I thought it’d be neater to just return 2 arrays, and let the user perform the division and decide what to do with the undefined values. I don’t feel strongly about this though.
Would you have a preferred way of handling these values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point - let's just leave it how you've done it
@hardbyte, could you please have another look? In particular, I would appreciate feedback on the Jupyter notebook (anonlink/docs/examples/similarity-plots/similarity-plots.ipynb) to which I moved all the plots. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice - I like the jupyter notebook example.
Utilities for viewing statistics on similarities. Mainly, it lets us make plots like in #142.
This PR looks reasonably big, but it should be easy to review. Everything is very self-contained, which lowers the cognitive load.
Three functions return statistics as NumPy arrays:
similarities_hist
matches_nonmatches_hist
cumul_number_matches_vs_threshold
Three more use the above and make Matplotlib plots:plot_similarities_hist
plot_matches_nonmatches_ratio_hist
plot_cumul_number_matches_vs_threshold
Example plots: