Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Eigen Library to Remove BLAS Dependency #8

Closed
init-random opened this issue Mar 6, 2016 · 27 comments
Closed

Integrate Eigen Library to Remove BLAS Dependency #8

init-random opened this issue Mar 6, 2016 · 27 comments

Comments

@init-random
Copy link
Contributor

The first step will be to integrate the Eigen library into the codebase and have both BLAS and Eigen paths. When formalized we can then remove the BLAS dependency.

@honnibal
Copy link
Member

honnibal commented Mar 6, 2016

Really happy you're looking at this!

We'd like to use Eigen in the machine learning library, thinc, as well. The neural network code is now very close to release. But we won't be able to do that until we sort out this BLAS question. So, this is super timely.

@init-random
Copy link
Contributor Author

OK, great. Hope I can help. I have to check that out.

We want to remove the depencecy on BLAS. I assume that this means we want to remove numpy as a requirement as well?

@honnibal
Copy link
Member

honnibal commented Mar 6, 2016

numpy doesn't require you to have a BLAS library. It fills in its own, with a basic C implementation. I don't usually rely on numpy for performance critical stuff, but it's a nice type to return, so we'll probably keep the dependency.

What we want to avoid is a situation where to make the library perform adequately, you have to modify your system, or compile some code, etc.

We want pip install sense2vec to just work. And working includes having adequate performance. With numpy, scipy etc, users install off pip and then the library is crazy slow. The reply from the maintainers is "well you didn't install BLAS". We're hoping we can do a bit better than that.

@init-random
Copy link
Contributor Author

OK thank you for the detail on the build and numpy.

Here is the feature branch: https://github.com/init-random/sense2vec/tree/eigen-integration
Builds with eigen headers, but still working on fleshing it out. It is probably not necessary to retain all of the eigen headers.

@honnibal
Copy link
Member

honnibal commented Mar 7, 2016

Had no idea this would look so simple!

We need to come up with a benchmark for this. Maybe compute just fetch the similarity results for the top N words?

@init-random
Copy link
Contributor Author

Just the initial commit... but i does not look like it should be too bad.
Are there other examples of benchmarks in the codebase? Just a baseline for eigen or a comparison, e.g. to BLAS or the math.h implementation?

@honnibal
Copy link
Member

honnibal commented Mar 7, 2016

There are no benchmark examples sorry. I can have a look if it's confusing?

It should just take the most frequent N words from the vocabulary and run the similarity queries for them.

@init-random
Copy link
Contributor Author

Not confusing, just wanted to follow a standard if there were existing benchmarks. Happy to take a look.

@init-random
Copy link
Contributor Author

What corpus are you using for merge_text.py? Is that what you want the benchmark off of?

@honnibal
Copy link
Member

honnibal commented Mar 8, 2016

We actually don't need a corpus here, just the trained model. You've downloaded that right? We want to benchmark the similarity queries.

The corpus is the Reddit comment corpus.

@init-random
Copy link
Contributor Author

Got it... No, I had not downloaded the model yet, I had built my own. I'll do that, thanks.

@init-random
Copy link
Contributor Author

I added a simple eigen benchmark.

Basically it grabs the top 50, 100, 500, and 1000 most frequent model terms and finds the top 50 most similar tokens for each. This iterates for each word in the top N to find the top 50 most similar for each word. If you were looking for something different, let me know.

> python -m sense2vec.eigen_benchmark
Similarity timing for top 50 model terms.
Finding top 50 most similar terms.
Completed in 9.61673880298622 seconds.
----------------------------------------------------------

Similarity timing for top 100 model terms.
Finding top 50 most similar terms.
Completed in 10.223447176860645 seconds.
----------------------------------------------------------

Similarity timing for top 500 model terms.
Finding top 50 most similar terms.
Completed in 85.45848718006164 seconds.
----------------------------------------------------------

Similarity timing for top 1000 model terms.
Finding top 50 most similar terms.
Completed in 121.82304402487352 seconds.
----------------------------------------------------------

@henningpeters
Copy link
Contributor

Thanks, I just ran a comparison:

math.h:

Similarity timing for top 1000 model terms.
Finding top 50 most similar terms for each model term.
Completed in 66.8856822180096 seconds.

eigen:

Similarity timing for top 1000 model terms.
Finding top 50 most similar terms for each model term.
Completed in 153.325510979 seconds.

I would expect eigen to perform as good or better. Any idea what could be wrong?

@init-random
Copy link
Contributor Author

I'll look into this to see where the differences may be. Thanks.

@init-random
Copy link
Contributor Author

I wrote a C script to just calculate the time of the dot product. The math.h implementation seems 13x more efficient. Each iteration is is 500000 dot product calculations of a 128 length float vector. Maybe it is the creation of the eigen vector that takes the time Map<VectorXf> v((float*)arr, N)? I wll look into that next.

eigen dot product
iter: 0
seconds: 2.68831
iter: 1
seconds: 2.68928
iter: 2
seconds: 2.64486
iter: 3
seconds: 2.6382
iter: 4
seconds: 2.64097

math dot product
iter: 0
seconds: 0.209042
iter: 1
seconds: 0.206591
iter: 2
seconds: 0.208673
iter: 3
seconds: 0.210579
iter: 4
seconds: 0.206569

@init-random
Copy link
Contributor Author

It looks like for the eigen implementation, the dot product is about 40% of the compute time and 60% for the creation of the vectors (30% for each vector). So, even if there was no overhead for the creation of the vectors the eigen dot product would still be about 1 sec (2.6 * 40%), which is still 5x the math.h implementation.

@init-random
Copy link
Contributor Author

Blas timings below.

Have you looked at FLENS? http://apfel.mathematik.uni-ulm.de/~lehn/FLENS/index.html
This is a headers-only blas implementation, which looks like it might be efficient. http://arxiv.org/pdf/1103.3020.pdf

Do you think it is worthwhile to look into this?

blas dot product
iter: 0
seconds: 0.00746393
iter: 1
seconds: 0.00756788
iter: 2
seconds: 0.00784111
iter: 3
seconds: 0.00788403
iter: 4
seconds: 0.00781107

@henningpeters
Copy link
Contributor

Looking at the code I suspect eigen copies the vector every time cblas_sdot or cblas_snrm2 is called. To fairly compare performance you have to initialize them earlier, specifically here: https://github.com/init-random/sense2vec/blob/eigen-integration/sense2vec/vectors.pyx#L0140.

Alternatively, could you run the timings on dot/norm of eigen vs. math.h outside sense2vec? I can help you later then integrating it more deeply into sense2vec.

@henningpeters
Copy link
Contributor

No, we didn't look into FLENS, yet. Any reliable performance numbers would be great to make a good decision here.

@honnibal
Copy link
Member

FLENS looks like it's asking for system BLAS for optimisation. We're anxious to avoid that.

Have you seen the implementations here? https://bitbucket.org/eigen/eigen/src/c3c494ec0a006d25dd6e6d65864b0fb51fe4da56/blas/?at=default

This issue is starting to block release of the neural network code in thinc. I'm actually very keen to get that code out there. It's very different from the Theano-based solutions. I think it's quite nice for training small neural networks on CPU.

@init-random
Copy link
Contributor Author

The timing provided are standalone, outside sense2vec. In essence the code is

const float vec[] = {  1.2, 3,9, ...}
// iterate and calc time for each 
call_eigen_dot(128, vec, 1, vec, 1)
call_math_dot(128, vec, 1, vec, 1)
call_blas_dot(128, vec, 1, vec, 1) 

I can take a look into FLENS timings. Also, I think it has its own BLAS implementation. From http://apfel.mathematik.uni-ulm.de/~lehn/FLENS/index.html

-- FLENS gives you generic implementation of BLAS
-- If high performance BLAS libraries like ATLAS or GotoBLAS are available...

I can install and see.

Numpy is a dependency. What if there is a blas used there, np.config.show()? Would it be OK to link against this blas and otherwise use eigen or flens? I suppose that is not the case for thinc however.

@init-random
Copy link
Contributor Author

FLENS did not need to be linked with a BLAS implementation. Timings below, which are much better than eigen. Also, this library has the option to be linked with BLAS (-DWITH_OPENBLAS and other blas implementations supported), which look very good.

Compiler needs -std=c++11 flag, not sure if this is an issue.

flens dot product
iter: 0
seconds: 0.366519
iter: 1
seconds: 0.362471
iter: 2
seconds: 0.365611
iter: 3
seconds: 0.363058
iter: 4
seconds: 0.36223

And when linked with openblas.

flens dot product, linked with openblas
iter: 0
seconds: 0.0203369
iter: 1
seconds: 0.0214162
iter: 2
seconds: 0.0209489
iter: 3
seconds: 0.0212181
iter: 4
seconds: 0.0132542

@henningpeters
Copy link
Contributor

henningpeters commented Apr 18, 2016

Please take a look at the simd branch (https://github.com/spacy-io/sense2vec/tree/simd). It relies on libsimdpp library (https://github.com/p12tic/libsimdpp). Implementing fast vector operations on top of it was pretty simple, performance should be comparable to sense2vec's BLAS backend.

@init-random
Copy link
Contributor Author

Great. Interested in taking a look. I'll create a new branch for this implementation.

@init-random
Copy link
Contributor Author

Just to follow up on this. Here are the timings for SIMD.

simd dot product
iter: 0
seconds: 5.49406
iter: 1
seconds: 5.29704
iter: 2
seconds: 5.09631
iter: 3    
seconds: 4.96923
iter: 4
seconds: 4.9608

Here is the code I used for this comparison.

#include <iostream>
#include <sys/time.h>
#include <simdpp/simd.h>

using namespace simdpp;
using namespace std;

static float dot(float const *__pyx_v_v1, float const *__pyx_v_v2,  int __pyx_v_n) {
  float32<SIMDPP_FAST_FLOAT32_SIZE>  __pyx_v_x;
  float32<SIMDPP_FAST_FLOAT32_SIZE>  __pyx_v_z;
  int __pyx_v_i;
  float32<SIMDPP_FAST_FLOAT32_SIZE>  __pyx_v_y;
  float __pyx_r;
  int __pyx_t_1;
  int __pyx_lineno = 0;
  const char *__pyx_filename = NULL;
  int __pyx_clineno = 0;
  __pyx_v_z = make_float(0.0, 0.0, 0.0, 0.0);   
  __pyx_v_i = 0;

  while (1) {
    __pyx_t_1 = ((__pyx_v_i < __pyx_v_n) != 0);
    if (!__pyx_t_1) break;
    __pyx_v_x = load((&(__pyx_v_v1[__pyx_v_i])));
    __pyx_v_y = load((&(__pyx_v_v2[__pyx_v_i])));
    __pyx_v_z = add(mul(__pyx_v_x, __pyx_v_y), __pyx_v_z);
    __pyx_v_i = (__pyx_v_i + SIMDPP_FAST_FLOAT32_SIZE);
  }

  __pyx_r = reduce_add(__pyx_v_z);
  goto __pyx_L0;

  __pyx_r = 0;
  __pyx_L0:;
  return __pyx_r;
}

int main()
{
    const float vec[] = {  1.01527765e-01,   2.79432237e-01,  -3.73641342e-01,
          2.51039743e-01,  -4.02947888e-02,  -1.81458637e-01,
          3.38296473e-01,  -1.52945951e-01,  -4.86327589e-01,
         -1.40676364e-01,  -2.61033531e-02,  -3.77920508e-01,
         -7.91965276e-02,  -5.02799675e-02,   2.30213434e-01,
         -1.41876683e-01,  -1.01002000e-01,   1.13062605e-01,
          3.13347168e-02,  -5.69926053e-02,  -1.59407547e-03,
          6.95948908e-03,  -6.85513392e-02,   2.09068850e-01,
         -1.97719410e-01,   7.33529553e-02,   1.18763991e-01,
          2.59771552e-02,   1.22092061e-01,   2.99735460e-02,
          2.31065676e-01,   1.11942671e-01,   1.76496938e-01,
          2.40554616e-01,   1.23780323e-02,  -1.49769649e-01,
         -6.78487942e-02,  -1.68823540e-01,   1.25860432e-02,
         -1.42609745e-01,  -2.79608935e-01,   4.05597091e-01,
         -1.30016565e-01,  -1.61046922e-01,  -1.19268715e-01,
         -2.28311032e-01,   9.18834880e-02,  -2.17846423e-01,
         -3.90507914e-02,  -8.53851587e-02,  -4.70737405e-02,
         -2.18207464e-01,   2.75754511e-01,   9.13140923e-02,
         -2.79673934e-01,  -5.34516461e-02,   4.93734449e-01,
          1.04101725e-01,  -2.91903373e-02,  -5.06813973e-02,
         -1.94194764e-01,   2.23975092e-01,   8.44103694e-02,
         -9.23509710e-03,  -2.19202667e-01,  -2.17131674e-01,
          6.37218237e-01,   1.22713091e-05,  -2.01738223e-01,
         -9.08939913e-02,   1.71261504e-01,  -1.41390860e-01,
          1.66930482e-01,   1.15627356e-01,   1.80309758e-01,
         -6.89798743e-02,  -9.28448811e-02,  -1.64074153e-01,
          1.57544822e-01,   6.61634728e-02,  -5.09247221e-02,
         -1.29333317e-01,  -2.13789865e-01,   9.47066844e-02,
         -2.44404078e-01,   4.49547440e-01,  -6.95184572e-03,
          3.74490559e-01,   1.68442249e-01,  -6.65905029e-02,
          3.05999577e-01,  -2.00129256e-01,   1.12281077e-01,
         -4.92399186e-02,   2.75324523e-01,   8.90553892e-02,
         -1.88322157e-01,   8.06960985e-02,   1.14517249e-01,
          2.04371959e-02,   2.05912948e-01,  -1.99612379e-01,
          2.54161924e-01,  -1.54119819e-01,   4.35664386e-01,
          3.43942016e-01,   2.20982641e-01,  -1.42827898e-03,
          3.42738152e-01,  -1.75066385e-02,  -1.07859090e-01,
          1.51815578e-01,  -2.30473414e-01,   1.95806384e-01,
         -2.15485111e-01,   4.48448863e-03,   2.25995764e-01,
          9.53625664e-02,   3.98220867e-02,   1.11700244e-01,
          1.67724952e-01,   1.43995574e-02,  -2.46458724e-01,
          1.89194545e-01,  -1.57280952e-01,  -2.26236075e-01,
          1.81051582e-01,   3.59262735e-01};

    struct timeval tim;

    cout << "simd dot product" << endl;
    for(int e=0; e<5; e++) {
        gettimeofday(&tim, NULL);
        double t1=tim.tv_sec+(tim.tv_usec/1000000.0);
        for(int i=0; i<500000; i++) {
           dot(&vec[0], &vec[0], 128);
        }
        gettimeofday(&tim, NULL);
        double t2=tim.tv_sec+(tim.tv_usec/1000000.0);
        cout << "iter: " << e << endl;
        cout << "seconds: " << t2-t1 << endl;
    }
    cout << endl;    
    return 0;
}

The dot function comes from cosine_similarity function (less the normalization) in the compiled vectors.pyx. So, in summary, here are the average dot product timings of 500000 iterations for the different libraries.

openblas: 0.0077
flens linked with blas: 0.0194
math.h: 0.2082
flens: 0.3640
eigen: 2.6603
simd: 5.1635

@henningpeters
Copy link
Contributor

henningpeters commented Apr 25, 2016

Numbers are milliseconds per iteratio? The results from your benchmark don't align with mine. I assume you overlooked something. Two ideas:

  • is openblas compiled with openmp? To compare raw kernel performance you should turn it off.
  • what are your compile flags for the simd solution? You need to enable simd instructions via libsimdpp options and compiler flags for a particular instruction set

You should be very suspicious as long as the simd approach is slower than a naive math.h implementation.

@init-random
Copy link
Contributor Author

Timings are seconds for the complete 500000 iterations, but you are correct on both accounts. I'll re-run the simd times this evening with the proper compile flags. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants