Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Similarity stats #173

Merged
merged 7 commits into from
Oct 29, 2018
Merged

Similarity stats #173

merged 7 commits into from
Oct 29, 2018

Conversation

nbgl
Copy link
Contributor

@nbgl nbgl commented Oct 25, 2018

Utilities for viewing statistics on similarities. Mainly, it lets us make plots like in #142.

This PR looks reasonably big, but it should be easy to review. Everything is very self-contained, which lowers the cognitive load.

Three functions return statistics as NumPy arrays:

  • similarities_hist
  • matches_nonmatches_hist
  • cumul_number_matches_vs_threshold

Three more use the above and make Matplotlib plots:

  • plot_similarities_hist
  • plot_matches_nonmatches_ratio_hist
  • plot_cumul_number_matches_vs_threshold

Example plots:
1
2
3

@nbgl nbgl self-assigned this Oct 25, 2018
@nbgl nbgl requested a review from hardbyte October 25, 2018 07:34
Copy link
Collaborator

@hardbyte hardbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of suggestions inline, I'll leave it to you to carefully consider whether including the matplotlib functions. If so I suggest adding it as an optional dependency in setup.py.

I like the hypothesis testing!

return num_matches, thresholds


def plot_similarities_hist(candidate_pairs, bins=100):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suspect it would be better to return a matplotlib.axes.Axes instead of calling pyplot.show.

I'd probably vote for not including the matplotlib plotting in the library at all though - and move this into an example.

Copy link
Contributor Author

@nbgl nbgl Oct 25, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the example idea.

I wanted to include the plotting because I had put a bunch of thought into doing it well, and I didn’t want this effort to get lost. But putting it in an example works as well!


nonmatches_ratio = [nm * 100 / (m + nm) if (m + nm) else 0
for nm, m in zip(nonmatches_num, matches_num)]
matches_ratio = [m * 100 / (m + nm) if (m + nm) else 0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial but could be useful to refactor this out into a non-matplotlib function that is exposed.

I'm just imagining a javascript/html frontend where it might be nicer - not to mention slightly less revealing - if the Python library could compute the nonmatches/matches ratio too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was my first thought too! I chose not to do it as some values are undefined (bins where matches + nonmatches = 0), so these need special handling. Options are:

  • NaNs
  • NumPy masked arrays

These are both fine options, but I thought it’d be neater to just return 2 arrays, and let the user perform the division and decide what to do with the undefined values. I don’t feel strongly about this though.

Would you have a preferred way of handling these values?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point - let's just leave it how you've done it

@nbgl
Copy link
Contributor Author

nbgl commented Oct 29, 2018

@hardbyte, could you please have another look? In particular, I would appreciate feedback on the Jupyter notebook (anonlink/docs/examples/similarity-plots/similarity-plots.ipynb) to which I moved all the plots.

@nbgl nbgl requested a review from hardbyte October 29, 2018 03:18
Copy link
Collaborator

@hardbyte hardbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice - I like the jupyter notebook example.

@nbgl nbgl merged commit 8574594 into develop Oct 29, 2018
@nbgl nbgl deleted the feature-similarity-stats branch October 29, 2018 06:29
@hardbyte hardbyte added this to the Anonlink 0.11 release milestone Feb 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants