New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance #1

Open
faroit opened this Issue Apr 18, 2016 · 8 comments

Comments

Projects
None yet
5 participants
@faroit

faroit commented Apr 18, 2016

Compared to numpy.einsum, Einsum.jl seems to be quite slow. I wonder if there is some room for improvements on this side:

numpy

import numpy as np
import timeit

# P = np.random.random((20, 15, 100, 30, 2))

A = np.random.random((20, 15, 100, 5))
H = np.random.random((30, 5))
C = np.random.random((2, 5))


def run():
    return np.einsum('abfj,tj,cj->abftc', A, H, C)

times = timeit.Timer(run).timeit(number=10)

print times

Elapsed time: 1.44563603401s

Julia

using Einsum

P = zeros(20,15,100,30,2); 

A = randn(20,15,100,5);
H = randn(30,5);
C = randn(2,5);

tic()
for i = 1:10
  @einsum P[a,b,f,t,c] = A[a,b,f,j]*H[t,j]*C[c,j]
end
toc()

elapsed time: 85.405141333 seconds

@Jutho

This comment has been minimized.

Show comment
Hide comment
@Jutho

Jutho Apr 18, 2016

Timing correctly (i.e. putting the code in a function, not counting the first run because of compilation/warm-up/..., see Julia manual) brings this down to approximately 2s per run on my machine, i.e. about 20s to run this 10 times. Still slower than python. (Is Python reporting the time per run or the time for the total number of 10 runs?).

Inspecting the resulting function with @code_warntype shows that the accumulation variable s generated by the macro is type-unstable, i.e. not inferred correctly. The reason is that it is initialized as 0 (an Int64) and then the different contributions are added to it, which in this example are of type Float64. Quickly changing the code to s=zero($lhs) shows this running in about 0.3s for the 10 runs, or thus about 0.03s for one run.

However, a more sensible initialization of s is required, which depends on the type on the left hand side (or on the type of the evaluated right hand side).

Jutho commented Apr 18, 2016

Timing correctly (i.e. putting the code in a function, not counting the first run because of compilation/warm-up/..., see Julia manual) brings this down to approximately 2s per run on my machine, i.e. about 20s to run this 10 times. Still slower than python. (Is Python reporting the time per run or the time for the total number of 10 runs?).

Inspecting the resulting function with @code_warntype shows that the accumulation variable s generated by the macro is type-unstable, i.e. not inferred correctly. The reason is that it is initialized as 0 (an Int64) and then the different contributions are added to it, which in this example are of type Float64. Quickly changing the code to s=zero($lhs) shows this running in about 0.3s for the 10 runs, or thus about 0.03s for one run.

However, a more sensible initialization of s is required, which depends on the type on the left hand side (or on the type of the evaluated right hand side).

@ahwillia

This comment has been minimized.

Show comment
Hide comment
@ahwillia

ahwillia Apr 18, 2016

Owner

Yes sorry this was on my to do list. Just pushed some commits (0968d13) and it is much faster now for me.

https://github.com/ahwillia/Einsum.jl#benchmarks

If this works for you both then I'll close this issue.

Owner

ahwillia commented Apr 18, 2016

Yes sorry this was on my to do list. Just pushed some commits (0968d13) and it is much faster now for me.

https://github.com/ahwillia/Einsum.jl#benchmarks

If this works for you both then I'll close this issue.

@faroit

This comment has been minimized.

Show comment
Hide comment
@faroit

faroit Apr 19, 2016

I tried your new implementation. While there is a significant speedup compared to the older version it still depends on the actual summation.

I therefore created some small benchmarks scripts to compare the einsum methods. I've also included opt-einsum which is an optimized numpy einsum package.

benchmark

  • The results show the measured averaged time in seconds for 1 run
  • n is a parameter which influences the tensor dimension but is different for each sum.
  • The results show that for the simple 2D parafac the differences are very small
  • for 5D parafac (like the one you have in https://github.com/ahwillia/Einsum.jl#benchmarks) Einsum.jl is faster than numpy but not as fast as the opt_einsum.
  • On the commonfate index sum Einsum.jl is actually the slowest, hence there is probably more room for improvements.

@Jutho feel free to comment on the measured timings or send me a PR in the benchmark repo

faroit commented Apr 19, 2016

I tried your new implementation. While there is a significant speedup compared to the older version it still depends on the actual summation.

I therefore created some small benchmarks scripts to compare the einsum methods. I've also included opt-einsum which is an optimized numpy einsum package.

benchmark

  • The results show the measured averaged time in seconds for 1 run
  • n is a parameter which influences the tensor dimension but is different for each sum.
  • The results show that for the simple 2D parafac the differences are very small
  • for 5D parafac (like the one you have in https://github.com/ahwillia/Einsum.jl#benchmarks) Einsum.jl is faster than numpy but not as fast as the opt_einsum.
  • On the commonfate index sum Einsum.jl is actually the slowest, hence there is probably more room for improvements.

@Jutho feel free to comment on the measured timings or send me a PR in the benchmark repo

@ahwillia

This comment has been minimized.

Show comment
Hide comment
@ahwillia

ahwillia Apr 19, 2016

Owner

Wow - nice work! I didn't know about the opt_einsum package - the documentation on it is quite good. I will give it a careful read when I get the time. Would you mind opening a PR here to include your more rigorous benchmarks?

Any ideas on why we lose out on the commonfate benchmark? Switching the order of the indices H[t,j] -> H[j,t] and C[c,j] -> C[j,c] should improve our speed (less cache misses). Though I'd assume numpy would be facing the same problem.

Looks like beating opt_einsum will take a decent (but feasible!) amount of extra code.

Edit: Two nice stackoverflow questions that are relevant: Q1, Q2

Owner

ahwillia commented Apr 19, 2016

Wow - nice work! I didn't know about the opt_einsum package - the documentation on it is quite good. I will give it a careful read when I get the time. Would you mind opening a PR here to include your more rigorous benchmarks?

Any ideas on why we lose out on the commonfate benchmark? Switching the order of the indices H[t,j] -> H[j,t] and C[c,j] -> C[j,c] should improve our speed (less cache misses). Though I'd assume numpy would be facing the same problem.

Looks like beating opt_einsum will take a decent (but feasible!) amount of extra code.

Edit: Two nice stackoverflow questions that are relevant: Q1, Q2

@ahwillia

This comment has been minimized.

Show comment
Hide comment
@ahwillia

ahwillia Jul 7, 2016

Owner

A brief update: adding @inbounds and @simd leads to a roughly 2x speed-up on the CPD benchmark. The common fate benchmark is still slower than I'd like. My guess is that we have to be clever about the order of the for loops.

Even better would be to follow opt_einsum and figure out intermediate solutions. This is outside of my bandwidth at the moment, but all input and PRs are welcome. I'll tag a new release somewhat soon after I do more testing to make sure this doesn't break anything.

Owner

ahwillia commented Jul 7, 2016

A brief update: adding @inbounds and @simd leads to a roughly 2x speed-up on the CPD benchmark. The common fate benchmark is still slower than I'd like. My guess is that we have to be clever about the order of the for loops.

Even better would be to follow opt_einsum and figure out intermediate solutions. This is outside of my bandwidth at the moment, but all input and PRs are welcome. I'll tag a new release somewhat soon after I do more testing to make sure this doesn't break anything.

@mihirparadkar

This comment has been minimized.

Show comment
Hide comment
@mihirparadkar

mihirparadkar Apr 18, 2017

Is it possible for the package to use TensorOperations' @tensor when possible? It seems like the difference between @tensor and @einsum is that @einsum allows pairs of indices on the RHS to exist on the LHS. Since @tensor can be orders of magnitude faster than the loop approach for contraction (because of BLAS), is there a way to replace the inner loops with @tensor, keeping the broadcasting in the outer loops? I imagine that would greatly improve performance. For example,
if

C = zeros(50,100)
B = randn(50,100)
A = randn(50,50,100)

Then the generated code could look like:

#@einsum C[i,k] = A[i,j,k] * B[j,k]
for k in 1:size(A,3)
  vC = view(C,:,k)
  vA = view(A,:,:,k)
  vB = view(B,:,k)
  @tensor vC[i] = vA[i,j] * vB[j]
end

mihirparadkar commented Apr 18, 2017

Is it possible for the package to use TensorOperations' @tensor when possible? It seems like the difference between @tensor and @einsum is that @einsum allows pairs of indices on the RHS to exist on the LHS. Since @tensor can be orders of magnitude faster than the loop approach for contraction (because of BLAS), is there a way to replace the inner loops with @tensor, keeping the broadcasting in the outer loops? I imagine that would greatly improve performance. For example,
if

C = zeros(50,100)
B = randn(50,100)
A = randn(50,50,100)

Then the generated code could look like:

#@einsum C[i,k] = A[i,j,k] * B[j,k]
for k in 1:size(A,3)
  vC = view(C,:,k)
  vA = view(A,:,:,k)
  vB = view(B,:,k)
  @tensor vC[i] = vA[i,j] * vB[j]
end
@ahwillia

This comment has been minimized.

Show comment
Hide comment
@ahwillia

ahwillia Apr 18, 2017

Owner

This would be a great addition to this package! Feel free to open a PR. At the moment, I am working on other projects and do not have time to implement this. Similarly, you could think about swapping out operations for BLAS calls directly (e.g. identifying sub-problems that are matrix multiplies).

Owner

ahwillia commented Apr 18, 2017

This would be a great addition to this package! Feel free to open a PR. At the moment, I am working on other projects and do not have time to implement this. Similarly, you could think about swapping out operations for BLAS calls directly (e.g. identifying sub-problems that are matrix multiplies).

@GiggleLiu

This comment has been minimized.

Show comment
Hide comment
@GiggleLiu

GiggleLiu Sep 13, 2018

Unreasonable allocation comparing with @Jutho 's TensorOperations (even with pre-allocation)

a = randn(200,2,200)
b = randn(200,2,190)
c = randn(200,2,2,190)

############# Einsum #################
julia> @benchmark @einsum c[i,j,l,m] = a[i,j,k]*b[k,l,m]
BenchmarkTools.Trial: 
  memory estimate:  2.74 GiB
  allocs estimate:  152762661
  --------------
  minimum time:     5.796 s (6.30% GC)
  median time:      5.796 s (6.30% GC)
  mean time:        5.796 s (6.30% GC)
  maximum time:     5.796 s (6.30% GC)
  --------------
  samples:          1
  evals/sample:     1

############# TensorOperations.jl ##############
julia> @benchmark tensorcontract(a, (1,2,3), b, (3,4,5), (1,2,4,5))
BenchmarkTools.Trial: 
  memory estimate:  1.16 MiB
  allocs estimate:  30
  --------------
  minimum time:     1.244 ms (0.00% GC)
  median time:      1.287 ms (0.00% GC)
  mean time:        1.453 ms (3.73% GC)
  maximum time:     60.382 ms (97.04% GC)
  --------------
  samples:          3415
  evals/sample:     1

GiggleLiu commented Sep 13, 2018

Unreasonable allocation comparing with @Jutho 's TensorOperations (even with pre-allocation)

a = randn(200,2,200)
b = randn(200,2,190)
c = randn(200,2,2,190)

############# Einsum #################
julia> @benchmark @einsum c[i,j,l,m] = a[i,j,k]*b[k,l,m]
BenchmarkTools.Trial: 
  memory estimate:  2.74 GiB
  allocs estimate:  152762661
  --------------
  minimum time:     5.796 s (6.30% GC)
  median time:      5.796 s (6.30% GC)
  mean time:        5.796 s (6.30% GC)
  maximum time:     5.796 s (6.30% GC)
  --------------
  samples:          1
  evals/sample:     1

############# TensorOperations.jl ##############
julia> @benchmark tensorcontract(a, (1,2,3), b, (3,4,5), (1,2,4,5))
BenchmarkTools.Trial: 
  memory estimate:  1.16 MiB
  allocs estimate:  30
  --------------
  minimum time:     1.244 ms (0.00% GC)
  median time:      1.287 ms (0.00% GC)
  mean time:        1.453 ms (3.73% GC)
  maximum time:     60.382 ms (97.04% GC)
  --------------
  samples:          3415
  evals/sample:     1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment