Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use BLAS in brute force (via NumPy) #5

Closed
wants to merge 5 commits into from

Conversation

piskvorky
Copy link

Replace slow brute force algo (sklearn) by a direct, fast BLAS call.

Make sure you have a fast BLAS installed -- a recent ATLAS, or OpenBlas, Intel's MKL, Apple's Accelerate etc. This can make a huge difference in performance.

PR not tested at all. @erikbern can you run it on the AWS machine? I just wrote the code, I didn't get a chance to run it (install assumes Debian). Sorry for any typos.

@@ -0,0 +1,3 @@
sudo apt-get install -y python-pip python-dev
sudo apt-get install -y libatlas-dev libatlas3gf-base
sudo apt-get install -y python-numpy
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@erikbern I'm not terribly sure how the debian-packaged NumPy plays with BLAS... can you check that ATLAS is being picked up by NumPy (=dot calls are fast)?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems like the pypi version of numpy gets pulled in from some other package so it might not be needed anyway

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright... can you check the timings for np.dot anyway, just to be sure? What version of numpy is that?

@erikbern
Copy link
Owner

LGTM except needs support for Euclidean distance as well.

Btw I wonder if Annoy could leverage BLAS for fast dot products etc... probably means I have to rewrite some of it in matrix form

@piskvorky
Copy link
Author

Hmm, I could have sworn I read "tests are cossim only" in the README Principles, but I don't see it there now.

Will try to add Euclidean when I have time again, but can't promise any ETA :(

Btw, as a sanity check, what is the time for numpy.dot(x, x.T) for x = numpy.random.rand(1000, 1000) on the "benchmark machine"?

@erikbern
Copy link
Owner

I changed the benchmarks a bit so that both cosine and Euclidean are represented

@piskvorky
Copy link
Author

I ran some of the tests on our machines (~dedicated server, not AWS, using ATLAS). The PR seems to work, no typos :)

These look a bit different to your results, not sure why. It's possible I messed something up, it's not clear to me what the launch process is (I just deleted all files under ./results and then ran install.sh and ann_benchmarks.py + plot.py). Or maybe it's machine-specific.

Everything except FLANN, ANNOY and KGraph seems to perform worse than brute force, on the GloVe dataset.

@piskvorky
Copy link
Author

Btw the chosen colour palette of the plot is a benchmark on its own (for my eyesight) :)

@maciejkula
Copy link

What surprises me is that LSHF does so badly. Does the author know about this benchmark? It's such a recent addition to sklearn.

@maciejkula
Copy link

@maheshakya pinging you in case you'd like to do some parameter tweaking on this benchmark?

@erikbern
Copy link
Owner

Thanks for running!

I think it's possible I'm doing something weird with LSHF but need to
figure out.

On benchmark discrepancy: might be caused by different compiler settings
etc. I might look at it later. In particular Annoy doesn't use any compiler
optimizations so I want to enable some :)

On vacation for another week so this might take a while.

Btw do you think it makes sense to look at bigger data sets?

On Friday, June 12, 2015, maciejkula notifications@github.com wrote:

@maheshakya https://github.com/maheshakya pinging you in case you'd
like to do some parameter tweaking on this benchmark?


Reply to this email directly or view it on GitHub
#5 (comment)
.

@hectorgon
Copy link

Have you tried using hierarchical kmeans / svd (using the eigenvectors as splitting planes)?
When I experimented with video embeddings they seemed to do much better than LSH,
at least for embeddings trained using dot / cosine distance loss functions.
The key thing that got the performance up for us was adding back-tracking to check nearby partitions.

http://www.jmlr.org/proceedings/papers/v28/weston13.pdf

@piskvorky
Copy link
Author

OK, I've added the Euclidean metric as well. SIFT results:

(I only ran on Annoy and brute force because they're fast to build -- there were no changes to other algos = their relative performance should stay the same)

I also noticed that the bottleneck is now NOT the distance computations, but rather sorting for "k nearest neighbours" at the very end :) So I optimized that too (assumes NumPy >= 1.8), GloVe results:

The brute force algo is now practically on par with its implementation in gensim, at least for the single-query-vector version. So it should be a worthy baseline for all the ANNs :)

@searchivarius
Copy link
Contributor

@piskvorky does BLAS create its own copy of the data?

@piskvorky
Copy link
Author

@searchivarius I think numpy.dot sometimes did memory copies, when the matrix order was "wrong" (Fortran order vs C order = column major vs. row major). For this reason, I used to call GEMM manually in HPC apps (via scipy.linalg.blas), setting the GEMM transposition flags by hand, rather than risk any silent memory copies or inefficiencies.

I actually asked a question about this, many years ago, but there was no good answer.

In any case, this is mostly relevant for BLAS level 3 calls (GEMM = matrix * matrix multiplications etc). This benchmark uses only level 2 (matrix * vector), and I'm pretty sure recent NumPy versions always do the sane thing, no matter the underlying BLAS implementation.

@piskvorky
Copy link
Author

One more optimization for brute force: switched to single precision math. About 50% further speed up:

Now about 30x faster than the original naive version :-)

@searchivarius
Copy link
Contributor

I haven't even bothered about doubles. Is there any chance that BLAS uses multiple threads? And what is the server that you are using?
Thanks!

@piskvorky
Copy link
Author

@searchivarius server -- same one as used in my previous benchmarks. And yes, most BLAS implementations use threads internally (where they deem fit).

@searchivarius
Copy link
Contributor

@piskvorky this makes a lot of sense. If you have 8-16 cores, this can easily compensate for the random memory layout. However, this doesn't make a fair comparison.

@piskvorky
Copy link
Author

@searchivarius The machine has 4 cores. And all contestants run on the same machine -- isn't that the point of the benchmark?

@searchivarius
Copy link
Contributor

I don't think so:

Use single core benchmarks. I believe most real world scenarios could be parallelized in other ways (eg. do multiple queries in parallel).

@searchivarius
Copy link
Contributor

Well, some algorithms might not scale well with the number of threads (though not very likely). However, it is always possible to test this by running a multi-threaded benchmark, which, however, executes a single-threaded implementation.

@piskvorky
Copy link
Author

Oops, good catch @searchivarius ! Must be one of the newer Principles I missed, I don't remember reading that earlier.

Anyway, dumbing down BLAS would be too hard to do in general, I'm not going there. I'll leave this PR as is -- up to Erik. It's possible the speed-up is only 10x then :-)

@maciejkula
Copy link

Does annoy use floats or doubles for this benchmark?

@aaalgo
Copy link
Contributor

aaalgo commented Jun 18, 2015

Just want to clarify that the pull request is not from me.

@erikbern
Copy link
Owner

yes I was hoping @piskvorky can rebase :)

i can do it if necessary though

@piskvorky
Copy link
Author

This seems to have died; I still think fast pure-Python similarities (BLAS via NumPy) are a worthy baseline: simple, easy to deploy, maintain, fast.

I won't have time for this, but if @aaalgo wants to add the OpenBlas threading restrictions, that would be cool! (I didn't realize the contenders must be single threaded, sorry).

@erikbern
Copy link
Owner

I'm also happy to change all benchmarks to be multi-threaded (in that case
I'll just use a thread pool to do multiple nearest neighbor searches
simultaneously for all libraries)

2015-08-26 20:53 GMT-04:00 Radim Řehůřek notifications@github.com:

This seems to have died; I still think fast pure-Python similarities (BLAS
via NumPy) are a worthy baseline: simple, easy to deploy, maintain, fast.

I won't have time for this, but if @aaalgo https://github.com/aaalgo
wants to add the OpenBlas threading restrictions, that would be cool (I
didn't realize the contenders must be single threaded, sorry).


Reply to this email directly or view it on GitHub
#5 (comment)
.

@piskvorky
Copy link
Author

I think that makes good sense.

I love that these ANN benchmarks are practical -- practical datasets, practical lib installs, practical implementations. A practical, reproducible HW setup fits the theme nicely IMO.

@aaalgo
Copy link
Contributor

aaalgo commented Aug 27, 2015

I don't write the fastest Python code, but I'll add a parallel BLAS mode in my KGraph API just for the purpose of benchmark. Multi-threading can be enabled in KGraph by passing in the "threads=8" parameter in the python search API. It should be a little bit faster than an external thread pool.

@aaalgo
Copy link
Contributor

aaalgo commented Aug 28, 2015

I have updated my KGraph repository. After rebuilding the source, BLAS mode can be enabled by passing "blas=True" as in "index.search(dataset, query, K=K, blas=True)". No need to call index.build if only brute force with BLAS is to be used. Speedup will only start to show as dimension is > 100.

@erikbern
Copy link
Owner

erikbern commented May 3, 2016

i merged this as a separate algo that is used to compute the correct results

3c21b7d

seems to be like 100x faster than before :)

@erikbern erikbern closed this May 3, 2016
@searchivarius
Copy link
Contributor

It could be, b/c Python brute-force is 10x slower than a single-thread brute-force.

@piskvorky
Copy link
Author

Hooray! \o/ Thanks @erikbern.

Feel free to add a note that people can bug me regarding this code -- I'll be happy to maintain it, in case of any bugs / questions / extensions.

@piskvorky
Copy link
Author

piskvorky commented May 27, 2016

@erikbern have the graphs on the main README page been regenerated?

The numbers seem dodgy (brute force slower than FLANN on 100% acc, and almost on par with KD), which doesn't match my results on GloVe above. What BLAS was this using?

@erikbern
Copy link
Owner

brute force doesn't use BLAS in the benchmarks

@piskvorky
Copy link
Author

piskvorky commented May 27, 2016

Do you mean you're letting NumPy automatically link against whatever BLAS is already installed in your system, or you're specifically disabling external BLAS during NumPy installation?

@erikbern
Copy link
Owner

@erikbern
Copy link
Owner

the reason I don't want to use BruteforceBLAS for benchmarks is that it uses multiple CPU cores by default and I'm not sure how to disable that.

@piskvorky
Copy link
Author

piskvorky commented May 27, 2016

Aah, sorry, I thought BruteForce == BruteForceBLAS, for some reason. Never mind then :)

@erikbern
Copy link
Owner

i might rewrite the benchmarks so it runs on multiple threads instead... that way i don't have to worry about it. more realistic too

erikbern pushed a commit that referenced this pull request May 16, 2018
maumueller pushed a commit to maumueller/ann-benchmarks-1 that referenced this pull request Nov 20, 2018
Added MIH through subprocess system.
tinkerlin added a commit to tinkerlin/ann-benchmarks that referenced this pull request Jul 1, 2020
* u

Signed-off-by: Nicky <nicky.xj.lin@gmail.com>

* 0.8+R@10

Co-authored-by: Tinkerrr <linxiaojun.cn@outlook.com>
Co-authored-by: Erik Bernhardsson <mail@erikbern.com>
erikbern pushed a commit that referenced this pull request Apr 14, 2023
erikbern pushed a commit that referenced this pull request Apr 14, 2023
Added MIH through subprocess system.
erikbern pushed a commit that referenced this pull request Jun 9, 2023
fix Docker build paths in install.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants