Integrate Eigen Library to Remove BLAS Dependency #8

init-random · 2016-03-06T22:13:38Z

The first step will be to integrate the Eigen library into the codebase and have both BLAS and Eigen paths. When formalized we can then remove the BLAS dependency.

honnibal · 2016-03-06T22:17:50Z

Really happy you're looking at this!

We'd like to use Eigen in the machine learning library, thinc, as well. The neural network code is now very close to release. But we won't be able to do that until we sort out this BLAS question. So, this is super timely.

init-random · 2016-03-06T22:29:15Z

OK, great. Hope I can help. I have to check that out.

We want to remove the depencecy on BLAS. I assume that this means we want to remove numpy as a requirement as well?

honnibal · 2016-03-06T23:33:16Z

numpy doesn't require you to have a BLAS library. It fills in its own, with a basic C implementation. I don't usually rely on numpy for performance critical stuff, but it's a nice type to return, so we'll probably keep the dependency.

What we want to avoid is a situation where to make the library perform adequately, you have to modify your system, or compile some code, etc.

We want pip install sense2vec to just work. And working includes having adequate performance. With numpy, scipy etc, users install off pip and then the library is crazy slow. The reply from the maintainers is "well you didn't install BLAS". We're hoping we can do a bit better than that.

init-random · 2016-03-07T05:25:40Z

OK thank you for the detail on the build and numpy.

Here is the feature branch: https://github.com/init-random/sense2vec/tree/eigen-integration
Builds with eigen headers, but still working on fleshing it out. It is probably not necessary to retain all of the eigen headers.

honnibal · 2016-03-07T10:45:32Z

Had no idea this would look so simple!

We need to come up with a benchmark for this. Maybe compute just fetch the similarity results for the top N words?

init-random · 2016-03-07T13:12:28Z

Just the initial commit... but i does not look like it should be too bad.
Are there other examples of benchmarks in the codebase? Just a baseline for eigen or a comparison, e.g. to BLAS or the math.h implementation?

honnibal · 2016-03-07T13:24:38Z

There are no benchmark examples sorry. I can have a look if it's confusing?

It should just take the most frequent N words from the vocabulary and run the similarity queries for them.

init-random · 2016-03-07T13:34:27Z

Not confusing, just wanted to follow a standard if there were existing benchmarks. Happy to take a look.

init-random · 2016-03-08T03:46:31Z

What corpus are you using for merge_text.py? Is that what you want the benchmark off of?

honnibal · 2016-03-08T09:06:56Z

We actually don't need a corpus here, just the trained model. You've downloaded that right? We want to benchmark the similarity queries.

The corpus is the Reddit comment corpus.

init-random · 2016-03-08T13:15:58Z

Got it... No, I had not downloaded the model yet, I had built my own. I'll do that, thanks.

init-random · 2016-03-11T16:19:21Z

I added a simple eigen benchmark.

Basically it grabs the top 50, 100, 500, and 1000 most frequent model terms and finds the top 50 most similar tokens for each. This iterates for each word in the top N to find the top 50 most similar for each word. If you were looking for something different, let me know.

> python -m sense2vec.eigen_benchmark
Similarity timing for top 50 model terms.
Finding top 50 most similar terms.
Completed in 9.61673880298622 seconds.
----------------------------------------------------------

Similarity timing for top 100 model terms.
Finding top 50 most similar terms.
Completed in 10.223447176860645 seconds.
----------------------------------------------------------

Similarity timing for top 500 model terms.
Finding top 50 most similar terms.
Completed in 85.45848718006164 seconds.
----------------------------------------------------------

Similarity timing for top 1000 model terms.
Finding top 50 most similar terms.
Completed in 121.82304402487352 seconds.
----------------------------------------------------------

henningpeters · 2016-03-11T17:19:50Z

Thanks, I just ran a comparison:

math.h:

Similarity timing for top 1000 model terms.
Finding top 50 most similar terms for each model term.
Completed in 66.8856822180096 seconds.

eigen:

Similarity timing for top 1000 model terms.
Finding top 50 most similar terms for each model term.
Completed in 153.325510979 seconds.

I would expect eigen to perform as good or better. Any idea what could be wrong?

init-random · 2016-03-11T23:07:42Z

I'll look into this to see where the differences may be. Thanks.

init-random · 2016-03-12T01:03:46Z

I wrote a C script to just calculate the time of the dot product. The math.h implementation seems 13x more efficient. Each iteration is is 500000 dot product calculations of a 128 length float vector. Maybe it is the creation of the eigen vector that takes the time Map<VectorXf> v((float*)arr, N)? I wll look into that next.

eigen dot product
iter: 0
seconds: 2.68831
iter: 1
seconds: 2.68928
iter: 2
seconds: 2.64486
iter: 3
seconds: 2.6382
iter: 4
seconds: 2.64097

math dot product
iter: 0
seconds: 0.209042
iter: 1
seconds: 0.206591
iter: 2
seconds: 0.208673
iter: 3
seconds: 0.210579
iter: 4
seconds: 0.206569

init-random · 2016-03-12T02:59:20Z

It looks like for the eigen implementation, the dot product is about 40% of the compute time and 60% for the creation of the vectors (30% for each vector). So, even if there was no overhead for the creation of the vectors the eigen dot product would still be about 1 sec (2.6 * 40%), which is still 5x the math.h implementation.

init-random · 2016-03-12T03:58:34Z

Blas timings below.

Have you looked at FLENS? http://apfel.mathematik.uni-ulm.de/~lehn/FLENS/index.html
This is a headers-only blas implementation, which looks like it might be efficient. http://arxiv.org/pdf/1103.3020.pdf

Do you think it is worthwhile to look into this?

blas dot product
iter: 0
seconds: 0.00746393
iter: 1
seconds: 0.00756788
iter: 2
seconds: 0.00784111
iter: 3
seconds: 0.00788403
iter: 4
seconds: 0.00781107

henningpeters · 2016-03-12T08:39:10Z

Looking at the code I suspect eigen copies the vector every time cblas_sdot or cblas_snrm2 is called. To fairly compare performance you have to initialize them earlier, specifically here: https://github.com/init-random/sense2vec/blob/eigen-integration/sense2vec/vectors.pyx#L0140.

Alternatively, could you run the timings on dot/norm of eigen vs. math.h outside sense2vec? I can help you later then integrating it more deeply into sense2vec.

henningpeters · 2016-03-12T08:43:13Z

No, we didn't look into FLENS, yet. Any reliable performance numbers would be great to make a good decision here.

honnibal · 2016-03-12T09:07:28Z

FLENS looks like it's asking for system BLAS for optimisation. We're anxious to avoid that.

Have you seen the implementations here? https://bitbucket.org/eigen/eigen/src/c3c494ec0a006d25dd6e6d65864b0fb51fe4da56/blas/?at=default

This issue is starting to block release of the neural network code in thinc. I'm actually very keen to get that code out there. It's very different from the Theano-based solutions. I think it's quite nice for training small neural networks on CPU.

init-random · 2016-03-13T00:46:24Z

The timing provided are standalone, outside sense2vec. In essence the code is

const float vec[] = {  1.2, 3,9, ...}
// iterate and calc time for each 
call_eigen_dot(128, vec, 1, vec, 1)
call_math_dot(128, vec, 1, vec, 1)
call_blas_dot(128, vec, 1, vec, 1)

I can take a look into FLENS timings. Also, I think it has its own BLAS implementation. From http://apfel.mathematik.uni-ulm.de/~lehn/FLENS/index.html

-- FLENS gives you generic implementation of BLAS
-- If high performance BLAS libraries like ATLAS or GotoBLAS are available...

I can install and see.

Numpy is a dependency. What if there is a blas used there, np.config.show()? Would it be OK to link against this blas and otherwise use eigen or flens? I suppose that is not the case for thinc however.

init-random · 2016-03-13T02:07:30Z

FLENS did not need to be linked with a BLAS implementation. Timings below, which are much better than eigen. Also, this library has the option to be linked with BLAS (-DWITH_OPENBLAS and other blas implementations supported), which look very good.

Compiler needs -std=c++11 flag, not sure if this is an issue.

flens dot product
iter: 0
seconds: 0.366519
iter: 1
seconds: 0.362471
iter: 2
seconds: 0.365611
iter: 3
seconds: 0.363058
iter: 4
seconds: 0.36223

And when linked with openblas.

flens dot product, linked with openblas
iter: 0
seconds: 0.0203369
iter: 1
seconds: 0.0214162
iter: 2
seconds: 0.0209489
iter: 3
seconds: 0.0212181
iter: 4
seconds: 0.0132542

henningpeters · 2016-04-18T06:39:18Z

Please take a look at the simd branch (https://github.com/spacy-io/sense2vec/tree/simd). It relies on libsimdpp library (https://github.com/p12tic/libsimdpp). Implementing fast vector operations on top of it was pretty simple, performance should be comparable to sense2vec's BLAS backend.

init-random · 2016-04-18T13:21:01Z

Great. Interested in taking a look. I'll create a new branch for this implementation.

init-random · 2016-04-25T02:34:35Z

Just to follow up on this. Here are the timings for SIMD.

simd dot product
iter: 0
seconds: 5.49406
iter: 1
seconds: 5.29704
iter: 2
seconds: 5.09631
iter: 3    
seconds: 4.96923
iter: 4
seconds: 4.9608

Here is the code I used for this comparison.

#include <iostream>
#include <sys/time.h>
#include <simdpp/simd.h>

using namespace simdpp;
using namespace std;

static float dot(float const *__pyx_v_v1, float const *__pyx_v_v2,  int __pyx_v_n) {
  float32<SIMDPP_FAST_FLOAT32_SIZE>  __pyx_v_x;
  float32<SIMDPP_FAST_FLOAT32_SIZE>  __pyx_v_z;
  int __pyx_v_i;
  float32<SIMDPP_FAST_FLOAT32_SIZE>  __pyx_v_y;
  float __pyx_r;
  int __pyx_t_1;
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;
  __pyx_v_z = make_float(0.0, 0.0, 0.0, 0.0);   
  __pyx_v_i = 0;

  while (1) {
    __pyx_t_1 = ((__pyx_v_i < __pyx_v_n) != 0);
    if (!__pyx_t_1) break;
    __pyx_v_x = load((&(__pyx_v_v1[__pyx_v_i])));
    __pyx_v_y = load((&(__pyx_v_v2[__pyx_v_i])));
    __pyx_v_z = add(mul(__pyx_v_x, __pyx_v_y), __pyx_v_z);
    __pyx_v_i = (__pyx_v_i + SIMDPP_FAST_FLOAT32_SIZE);
  }

  __pyx_r = reduce_add(__pyx_v_z);
  goto __pyx_L0;

  __pyx_r = 0;
  __pyx_L0:;
  return __pyx_r;
}

int main()
{
    const float vec[] = {  1.01527765e-01,   2.79432237e-01,  -3.73641342e-01,
          2.51039743e-01,  -4.02947888e-02,  -1.81458637e-01,
          3.38296473e-01,  -1.52945951e-01,  -4.86327589e-01,
         -1.40676364e-01,  -2.61033531e-02,  -3.77920508e-01,
         -7.91965276e-02,  -5.02799675e-02,   2.30213434e-01,
         -1.41876683e-01,  -1.01002000e-01,   1.13062605e-01,
          3.13347168e-02,  -5.69926053e-02,  -1.59407547e-03,
          6.95948908e-03,  -6.85513392e-02,   2.09068850e-01,
         -1.97719410e-01,   7.33529553e-02,   1.18763991e-01,
          2.59771552e-02,   1.22092061e-01,   2.99735460e-02,
          2.31065676e-01,   1.11942671e-01,   1.76496938e-01,
          2.40554616e-01,   1.23780323e-02,  -1.49769649e-01,
         -6.78487942e-02,  -1.68823540e-01,   1.25860432e-02,
         -1.42609745e-01,  -2.79608935e-01,   4.05597091e-01,
         -1.30016565e-01,  -1.61046922e-01,  -1.19268715e-01,
         -2.28311032e-01,   9.18834880e-02,  -2.17846423e-01,
         -3.90507914e-02,  -8.53851587e-02,  -4.70737405e-02,
         -2.18207464e-01,   2.75754511e-01,   9.13140923e-02,
         -2.79673934e-01,  -5.34516461e-02,   4.93734449e-01,
          1.04101725e-01,  -2.91903373e-02,  -5.06813973e-02,
         -1.94194764e-01,   2.23975092e-01,   8.44103694e-02,
         -9.23509710e-03,  -2.19202667e-01,  -2.17131674e-01,
          6.37218237e-01,   1.22713091e-05,  -2.01738223e-01,
         -9.08939913e-02,   1.71261504e-01,  -1.41390860e-01,
          1.66930482e-01,   1.15627356e-01,   1.80309758e-01,
         -6.89798743e-02,  -9.28448811e-02,  -1.64074153e-01,
          1.57544822e-01,   6.61634728e-02,  -5.09247221e-02,
         -1.29333317e-01,  -2.13789865e-01,   9.47066844e-02,
         -2.44404078e-01,   4.49547440e-01,  -6.95184572e-03,
          3.74490559e-01,   1.68442249e-01,  -6.65905029e-02,
          3.05999577e-01,  -2.00129256e-01,   1.12281077e-01,
         -4.92399186e-02,   2.75324523e-01,   8.90553892e-02,
         -1.88322157e-01,   8.06960985e-02,   1.14517249e-01,
          2.04371959e-02,   2.05912948e-01,  -1.99612379e-01,
          2.54161924e-01,  -1.54119819e-01,   4.35664386e-01,
          3.43942016e-01,   2.20982641e-01,  -1.42827898e-03,
          3.42738152e-01,  -1.75066385e-02,  -1.07859090e-01,
          1.51815578e-01,  -2.30473414e-01,   1.95806384e-01,
         -2.15485111e-01,   4.48448863e-03,   2.25995764e-01,
          9.53625664e-02,   3.98220867e-02,   1.11700244e-01,
          1.67724952e-01,   1.43995574e-02,  -2.46458724e-01,
          1.89194545e-01,  -1.57280952e-01,  -2.26236075e-01,
          1.81051582e-01,   3.59262735e-01};

    struct timeval tim;

    cout << "simd dot product" << endl;
    for(int e=0; e<5; e++) {
        gettimeofday(&tim, NULL);
        double t1=tim.tv_sec+(tim.tv_usec/1000000.0);
        for(int i=0; i<500000; i++) {
           dot(&vec[0], &vec[0], 128);
        }
        gettimeofday(&tim, NULL);
        double t2=tim.tv_sec+(tim.tv_usec/1000000.0);
        cout << "iter: " << e << endl;
        cout << "seconds: " << t2-t1 << endl;
    }
    cout << endl;    
    return 0;
}

The dot function comes from cosine_similarity function (less the normalization) in the compiled vectors.pyx. So, in summary, here are the average dot product timings of 500000 iterations for the different libraries.

openblas: 0.0077
flens linked with blas: 0.0194
math.h: 0.2082
flens: 0.3640
eigen: 2.6603
simd: 5.1635

henningpeters · 2016-04-25T06:55:05Z

Numbers are milliseconds per iteratio? The results from your benchmark don't align with mine. I assume you overlooked something. Two ideas:

is openblas compiled with openmp? To compare raw kernel performance you should turn it off.
what are your compile flags for the simd solution? You need to enable simd instructions via libsimdpp options and compiler flags for a particular instruction set

You should be very suspicious as long as the simd approach is slower than a naive math.h implementation.

init-random · 2016-04-25T15:13:23Z

Timings are seconds for the complete 500000 iterations, but you are correct on both accounts. I'll re-run the simd times this evening with the proper compile flags. Thanks.

henningpeters closed this as completed Apr 18, 2016

joshweir mentioned this issue Nov 18, 2019

Possible to speed up most_similar? #86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Eigen Library to Remove BLAS Dependency #8

Integrate Eigen Library to Remove BLAS Dependency #8

init-random commented Mar 6, 2016

honnibal commented Mar 6, 2016

init-random commented Mar 6, 2016

honnibal commented Mar 6, 2016

init-random commented Mar 7, 2016

honnibal commented Mar 7, 2016

init-random commented Mar 7, 2016

honnibal commented Mar 7, 2016

init-random commented Mar 7, 2016

init-random commented Mar 8, 2016

honnibal commented Mar 8, 2016

init-random commented Mar 8, 2016

init-random commented Mar 11, 2016

henningpeters commented Mar 11, 2016

init-random commented Mar 11, 2016

init-random commented Mar 12, 2016

init-random commented Mar 12, 2016

init-random commented Mar 12, 2016

henningpeters commented Mar 12, 2016

henningpeters commented Mar 12, 2016

honnibal commented Mar 12, 2016

init-random commented Mar 13, 2016

init-random commented Mar 13, 2016

henningpeters commented Apr 18, 2016 •

edited

init-random commented Apr 18, 2016

init-random commented Apr 25, 2016

henningpeters commented Apr 25, 2016 •

edited

init-random commented Apr 25, 2016

Integrate Eigen Library to Remove BLAS Dependency #8

Integrate Eigen Library to Remove BLAS Dependency #8

Comments

init-random commented Mar 6, 2016

honnibal commented Mar 6, 2016

init-random commented Mar 6, 2016

honnibal commented Mar 6, 2016

init-random commented Mar 7, 2016

honnibal commented Mar 7, 2016

init-random commented Mar 7, 2016

honnibal commented Mar 7, 2016

init-random commented Mar 7, 2016

init-random commented Mar 8, 2016

honnibal commented Mar 8, 2016

init-random commented Mar 8, 2016

init-random commented Mar 11, 2016

henningpeters commented Mar 11, 2016

init-random commented Mar 11, 2016

init-random commented Mar 12, 2016

init-random commented Mar 12, 2016

init-random commented Mar 12, 2016

henningpeters commented Mar 12, 2016

henningpeters commented Mar 12, 2016

honnibal commented Mar 12, 2016

init-random commented Mar 13, 2016

init-random commented Mar 13, 2016

henningpeters commented Apr 18, 2016 • edited

init-random commented Apr 18, 2016

init-random commented Apr 25, 2016

henningpeters commented Apr 25, 2016 • edited

init-random commented Apr 25, 2016

henningpeters commented Apr 18, 2016 •

edited

henningpeters commented Apr 25, 2016 •

edited