-
Notifications
You must be signed in to change notification settings - Fork 732
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion: Box Plot Summary Charts #377
Comments
Great suggestion, would be interesting to see such a plot. Do you have the time to contribute this, Thomas? (Because I don't :-( ) I've uploaded the raw results from the December 2021 run here. The easiest way forward is probably to use With regard to the AUC, I suppose you don't want to go all the way up to recall 1. but rather stop at .9 or .95. Highest QPS with recall above .9 is even simpler? |
My hesitation with box plots is it forces us to pick more parameters. What about just breaking up the chart in a few different subplots with say 5 algos on each? We can do this alphabetically, the bucketing is not super important. |
Nice, have you thought about putting it on the github repo as a "release"? You can host files there up to 2GB. Your file is 2.17GB, so either compress it a bit more, or split it in two.
I was going to just add a point at (qps=0, recall=1) to all the curves. An alternative could be sampling the qps at recalls .5, .6, etc., and considering those as separate tasks/datasets. That is, for each recall we find the top qps and divide every other score by that, then add it to the pool of values for the boxplot.
This solves the first problem (charts are hard to read), but makes the second problem worse (there are too many charts). |
I didn't know that the limit is that high, good suggestion. |
What does this script actually do? It seems to be downloading every dataset to my computer, even though I was just interested in the timing results. Also, how are the timing results 6GB? |
Ok, this is what it looks like, using the above process: Code here: https://gist.github.com/thomasahle/edf90cad1c074c0f589ae46be758485d |
Nice – Ii think this is clearly much better from a legibility perspective, but a bit trickier from an interpretability perspective. Is there some way to make the score (y-axis) something that's more natural? Like I'm thinking if we you can fix the recall to some distribution and compute the average weighted QPS as the score, or vice versa (fix the QPS to some distribution and compute the average weighted recall). Would something like that work? |
You are right that the tricky thing is the freedom we get in defining the score. There's also a question of whether to use log(qps) or qps when computing the area. |
Nice work! I agree both with it being a great summary that is difficult to interpret. It's also very curios that the hnsw implementation score so differently.
The size comes from the raw results containing all actual answers to search queries to allow for using other metrics without re-computation. It downloads the datasets because the ground truth is attached to those. (But since the raw results probably had the recall values cached, this was probably not necessary.) |
Yeah, I think so – this is just normalized 1/QPS right? Maybe just plotting QPS would make it more interpretable? Happy to play around with this too btw. I want to prioritize running the benchmarks first though, but I'll soon get to the point of having to plot the results :) |
It's actually 1/(normalized QPS), that was just easier. But yeah, should probably do normalized 1/QPS instead :-)
The only reason for normalizing is to have a standard scale for each dataset.
Definitely, happy to leave it to you from here :-) |
As the number of ANN libraries grow, the qps vs recall curves are getting harder to read.
On top of that, each curve only represents one dataset.
Could there possibly be a graphical representation that includes all the information from all the curves in a single, easy to read plot?
I suggest taking inspiration from The Computer Language 23.03 Benchmarks Game.
In more detail :
It should be pretty easy to compute, given the raw data. I don't think that is in the git repository though?
The text was updated successfully, but these errors were encountered: