use BLAS in brute force (via NumPy) #5

piskvorky · 2015-06-11T09:25:49Z

Replace slow brute force algo (sklearn) by a direct, fast BLAS call.

Make sure you have a fast BLAS installed -- a recent ATLAS, or OpenBlas, Intel's MKL, Apple's Accelerate etc. This can make a huge difference in performance.

PR not tested at all. @erikbern can you run it on the AWS machine? I just wrote the code, I didn't get a chance to run it (install assumes Debian). Sorry for any typos.

piskvorky · 2015-06-11T10:42:09Z

install/bruteforce.sh

@@ -0,0 +1,3 @@
+sudo apt-get install -y python-pip python-dev
+sudo apt-get install -y libatlas-dev libatlas3gf-base
+sudo apt-get install -y python-numpy


@erikbern I'm not terribly sure how the debian-packaged NumPy plays with BLAS... can you check that ATLAS is being picked up by NumPy (=dot calls are fast)?

it seems like the pypi version of numpy gets pulled in from some other package so it might not be needed anyway

Alright... can you check the timings for np.dot anyway, just to be sure? What version of numpy is that?

erikbern · 2015-06-11T17:18:04Z

LGTM except needs support for Euclidean distance as well.

Btw I wonder if Annoy could leverage BLAS for fast dot products etc... probably means I have to rewrite some of it in matrix form

piskvorky · 2015-06-11T17:30:32Z

Hmm, I could have sworn I read "tests are cossim only" in the README Principles, but I don't see it there now.

Will try to add Euclidean when I have time again, but can't promise any ETA :(

Btw, as a sanity check, what is the time for numpy.dot(x, x.T) for x = numpy.random.rand(1000, 1000) on the "benchmark machine"?

erikbern · 2015-06-11T19:22:03Z

I changed the benchmarks a bit so that both cosine and Euclidean are represented

piskvorky · 2015-06-12T11:59:39Z

I ran some of the tests on our machines (~dedicated server, not AWS, using ATLAS). The PR seems to work, no typos :)

These look a bit different to your results, not sure why. It's possible I messed something up, it's not clear to me what the launch process is (I just deleted all files under ./results and then ran install.sh and ann_benchmarks.py + plot.py). Or maybe it's machine-specific.

Everything except FLANN, ANNOY and KGraph seems to perform worse than brute force, on the GloVe dataset.

piskvorky · 2015-06-12T12:13:35Z

Btw the chosen colour palette of the plot is a benchmark on its own (for my eyesight) :)

maciejkula · 2015-06-12T12:14:43Z

What surprises me is that LSHF does so badly. Does the author know about this benchmark? It's such a recent addition to sklearn.

maciejkula · 2015-06-12T13:50:59Z

@maheshakya pinging you in case you'd like to do some parameter tweaking on this benchmark?

erikbern · 2015-06-12T14:57:46Z

Thanks for running!

I think it's possible I'm doing something weird with LSHF but need to
figure out.

On benchmark discrepancy: might be caused by different compiler settings
etc. I might look at it later. In particular Annoy doesn't use any compiler
optimizations so I want to enable some :)

On vacation for another week so this might take a while.

Btw do you think it makes sense to look at bigger data sets?

On Friday, June 12, 2015, maciejkula notifications@github.com wrote:

@maheshakya https://github.com/maheshakya pinging you in case you'd
like to do some parameter tweaking on this benchmark?

—
Reply to this email directly or view it on GitHub
#5 (comment)
.

hectorgon · 2015-06-12T17:12:18Z

Have you tried using hierarchical kmeans / svd (using the eigenvectors as splitting planes)?
When I experimented with video embeddings they seemed to do much better than LSH,
at least for embeddings trained using dot / cosine distance loss functions.
The key thing that got the performance up for us was adding back-tracking to check nearby partitions.

http://www.jmlr.org/proceedings/papers/v28/weston13.pdf

piskvorky · 2015-06-13T14:40:32Z

OK, I've added the Euclidean metric as well. SIFT results:

(I only ran on Annoy and brute force because they're fast to build -- there were no changes to other algos = their relative performance should stay the same)

I also noticed that the bottleneck is now NOT the distance computations, but rather sorting for "k nearest neighbours" at the very end :) So I optimized that too (assumes NumPy >= 1.8), GloVe results:

The brute force algo is now practically on par with its implementation in gensim, at least for the single-query-vector version. So it should be a worthy baseline for all the ANNs :)

searchivarius · 2015-06-13T16:34:41Z

@piskvorky does BLAS create its own copy of the data?

piskvorky · 2015-06-13T19:03:18Z

@searchivarius I think numpy.dot sometimes did memory copies, when the matrix order was "wrong" (Fortran order vs C order = column major vs. row major). For this reason, I used to call GEMM manually in HPC apps (via scipy.linalg.blas), setting the GEMM transposition flags by hand, rather than risk any silent memory copies or inefficiencies.

I actually asked a question about this, many years ago, but there was no good answer.

In any case, this is mostly relevant for BLAS level 3 calls (GEMM = matrix * matrix multiplications etc). This benchmark uses only level 2 (matrix * vector), and I'm pretty sure recent NumPy versions always do the sane thing, no matter the underlying BLAS implementation.

piskvorky · 2015-06-13T20:34:01Z

One more optimization for brute force: switched to single precision math. About 50% further speed up:

Now about 30x faster than the original naive version :-)

searchivarius · 2015-06-13T23:52:06Z

I haven't even bothered about doubles. Is there any chance that BLAS uses multiple threads? And what is the server that you are using?
Thanks!

piskvorky · 2015-06-14T03:50:54Z

@searchivarius server -- same one as used in my previous benchmarks. And yes, most BLAS implementations use threads internally (where they deem fit).

searchivarius · 2015-06-14T03:58:18Z

@piskvorky this makes a lot of sense. If you have 8-16 cores, this can easily compensate for the random memory layout. However, this doesn't make a fair comparison.

piskvorky · 2015-06-14T04:02:16Z

@searchivarius The machine has 4 cores. And all contestants run on the same machine -- isn't that the point of the benchmark?

searchivarius · 2015-06-14T04:04:52Z

I don't think so:

Use single core benchmarks. I believe most real world scenarios could be parallelized in other ways (eg. do multiple queries in parallel).

searchivarius · 2015-06-14T04:05:43Z

Well, some algorithms might not scale well with the number of threads (though not very likely). However, it is always possible to test this by running a multi-threaded benchmark, which, however, executes a single-threaded implementation.

piskvorky · 2015-06-14T04:19:50Z

Oops, good catch @searchivarius ! Must be one of the newer Principles I missed, I don't remember reading that earlier.

Anyway, dumbing down BLAS would be too hard to do in general, I'm not going there. I'll leave this PR as is -- up to Erik. It's possible the speed-up is only 10x then :-)

maciejkula · 2015-06-14T08:20:53Z

Does annoy use floats or doubles for this benchmark?

aaalgo · 2015-06-18T19:45:15Z

Just want to clarify that the pull request is not from me.

erikbern · 2015-06-18T20:13:29Z

yes I was hoping @piskvorky can rebase :)

i can do it if necessary though

piskvorky · 2015-08-27T00:53:03Z

This seems to have died; I still think fast pure-Python similarities (BLAS via NumPy) are a worthy baseline: simple, easy to deploy, maintain, fast.

I won't have time for this, but if @aaalgo wants to add the OpenBlas threading restrictions, that would be cool! (I didn't realize the contenders must be single threaded, sorry).

erikbern · 2015-08-27T03:31:57Z

I'm also happy to change all benchmarks to be multi-threaded (in that case
I'll just use a thread pool to do multiple nearest neighbor searches
simultaneously for all libraries)

2015-08-26 20:53 GMT-04:00 Radim Řehůřek notifications@github.com:

This seems to have died; I still think fast pure-Python similarities (BLAS
via NumPy) are a worthy baseline: simple, easy to deploy, maintain, fast.

I won't have time for this, but if @aaalgo https://github.com/aaalgo
wants to add the OpenBlas threading restrictions, that would be cool (I
didn't realize the contenders must be single threaded, sorry).

—
Reply to this email directly or view it on GitHub
#5 (comment)
.

piskvorky · 2015-08-27T03:37:59Z

I think that makes good sense.

I love that these ANN benchmarks are practical -- practical datasets, practical lib installs, practical implementations. A practical, reproducible HW setup fits the theme nicely IMO.

aaalgo · 2015-08-27T11:29:12Z

I don't write the fastest Python code, but I'll add a parallel BLAS mode in my KGraph API just for the purpose of benchmark. Multi-threading can be enabled in KGraph by passing in the "threads=8" parameter in the python search API. It should be a little bit faster than an external thread pool.

aaalgo · 2015-08-28T17:41:41Z

I have updated my KGraph repository. After rebuilding the source, BLAS mode can be enabled by passing "blas=True" as in "index.search(dataset, query, K=K, blas=True)". No need to call index.build if only brute force with BLAS is to be used. Speedup will only start to show as dimension is > 100.

erikbern · 2016-05-03T12:20:40Z

i merged this as a separate algo that is used to compute the correct results

3c21b7d

seems to be like 100x faster than before :)

searchivarius · 2016-05-03T13:28:15Z

It could be, b/c Python brute-force is 10x slower than a single-thread brute-force.

piskvorky · 2016-05-03T14:31:57Z

Hooray! \o/ Thanks @erikbern.

Feel free to add a note that people can bug me regarding this code -- I'll be happy to maintain it, in case of any bugs / questions / extensions.

piskvorky · 2016-05-27T06:14:58Z

@erikbern have the graphs on the main README page been regenerated?

The numbers seem dodgy (brute force slower than FLANN on 100% acc, and almost on par with KD), which doesn't match my results on GloVe above. What BLAS was this using?

erikbern · 2016-05-27T11:43:35Z

brute force doesn't use BLAS in the benchmarks

piskvorky · 2016-05-27T12:28:25Z

Do you mean you're letting NumPy automatically link against whatever BLAS is already installed in your system, or you're specifically disabling external BLAS during NumPy installation?

erikbern · 2016-05-27T12:50:48Z

https://github.com/erikbern/ann-benchmarks/blob/master/ann_benchmarks/__init__.py#L331

see BruteForce vs BruteForceBLAS

erikbern · 2016-05-27T12:51:20Z

the reason I don't want to use BruteforceBLAS for benchmarks is that it uses multiple CPU cores by default and I'm not sure how to disable that.

piskvorky · 2016-05-27T15:04:33Z

Aah, sorry, I thought BruteForce == BruteForceBLAS, for some reason. Never mind then :)

erikbern · 2016-05-27T15:44:18Z

i might rewrite the benchmarks so it runs on multiple threads instead... that way i don't have to worry about it. more realistic too

upmerge

Added MIH through subprocess system.

* u Signed-off-by: Nicky <nicky.xj.lin@gmail.com> * 0.8+R@10 Co-authored-by: Tinkerrr <linxiaojun.cn@outlook.com> Co-authored-by: Erik Bernhardsson <mail@erikbern.com>

upmerge

Added MIH through subprocess system.

fix Docker build paths in install.py

use BLAS in brute force (via NumPy)

8b76eb2

piskvorky force-pushed the bruteforce branch from 2037c55 to 8b76eb2 Compare June 11, 2015 10:39

piskvorky reviewed Jun 11, 2015
View reviewed changes

piskvorky added 2 commits June 13, 2015 15:06

add euclidean distance to BruteForce

6af3c21

use partial sort to speed up BruteForce

db1c2a2

simplify & improve BruteForce docs

cf2b2d9

piskvorky force-pushed the bruteforce branch from 8b321a2 to 84dbf7e Compare June 13, 2015 20:37

switch BruteForce to single precision math

b72f452

piskvorky force-pushed the bruteforce branch from 84dbf7e to b72f452 Compare June 13, 2015 21:50

piskvorky mentioned this pull request Aug 19, 2015

[doc2vec] train new doc tags with old words vocab piskvorky/gensim#430

Closed

piskvorky mentioned this pull request Nov 13, 2015

Improving performance for function most_similar piskvorky/gensim#527

Closed

erikbern force-pushed the master branch 2 times, most recently from d5e174a to b7610c5 Compare January 30, 2016 21:34

piskvorky mentioned this pull request Feb 1, 2016

Are there any plans to employ SIMD? spotify/annoy#128

Closed

erikbern closed this May 3, 2016

erikbern pushed a commit that referenced this pull request May 16, 2018

Merge pull request #5 from erikbern/master

ca34bef

upmerge

maumueller pushed a commit to maumueller/ann-benchmarks-1 that referenced this pull request Nov 20, 2018

Merge pull request erikbern#5 from maumueller/add-mih

b779c14

Added MIH through subprocess system.

erikbern pushed a commit that referenced this pull request Apr 14, 2023

Merge pull request #5 from erikbern/master

e5db6ac

upmerge

erikbern pushed a commit that referenced this pull request Apr 14, 2023

Merge pull request #5 from maumueller/add-mih

e77866c

Added MIH through subprocess system.

erikbern pushed a commit that referenced this pull request Jun 9, 2023

Merge pull request #5 from Jeadie/jack/test

5b5252e

fix Docker build paths in install.py

cyberpower678 mentioned this pull request Nov 3, 2023

install.py reports numerous installation failures #477

Open

use BLAS in brute force (via NumPy) #5

use BLAS in brute force (via NumPy) #5

Conversation

piskvorky commented Jun 11, 2015

piskvorky Jun 11, 2015

Choose a reason for hiding this comment

erikbern Jun 11, 2015

Choose a reason for hiding this comment

piskvorky Jun 13, 2015

Choose a reason for hiding this comment

erikbern commented Jun 11, 2015

piskvorky commented Jun 11, 2015

erikbern commented Jun 11, 2015

piskvorky commented Jun 12, 2015

piskvorky commented Jun 12, 2015

maciejkula commented Jun 12, 2015

maciejkula commented Jun 12, 2015

erikbern commented Jun 12, 2015

hectorgon commented Jun 12, 2015

piskvorky commented Jun 13, 2015

searchivarius commented Jun 13, 2015

piskvorky commented Jun 13, 2015

piskvorky commented Jun 13, 2015

searchivarius commented Jun 13, 2015

piskvorky commented Jun 14, 2015

searchivarius commented Jun 14, 2015

piskvorky commented Jun 14, 2015

searchivarius commented Jun 14, 2015

searchivarius commented Jun 14, 2015

piskvorky commented Jun 14, 2015

maciejkula commented Jun 14, 2015

aaalgo commented Jun 18, 2015

erikbern commented Jun 18, 2015

piskvorky commented Aug 27, 2015

erikbern commented Aug 27, 2015

piskvorky commented Aug 27, 2015

aaalgo commented Aug 27, 2015

aaalgo commented Aug 28, 2015

erikbern commented May 3, 2016 • edited

searchivarius commented May 3, 2016

piskvorky commented May 3, 2016

piskvorky commented May 27, 2016 • edited

erikbern commented May 27, 2016

piskvorky commented May 27, 2016 • edited

erikbern commented May 27, 2016

erikbern commented May 27, 2016

piskvorky commented May 27, 2016 • edited

erikbern commented May 27, 2016

erikbern commented May 3, 2016 •

edited

piskvorky commented May 27, 2016 •

edited

piskvorky commented May 27, 2016 •

edited

piskvorky commented May 27, 2016 •

edited