Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dot_product #24

Open
gha3mi opened this issue Jan 23, 2024 · 17 comments
Open

dot_product #24

gha3mi opened this issue Jan 23, 2024 · 17 comments

Comments

@gha3mi
Copy link
Owner

gha3mi commented Jan 23, 2024

          This might be more representative:
a = 0._rk
do nl = 1,bench%nloops
      a = a + dot_product(u,v)
end do
call bench%stop_benchmark(cmp_gflops)
print *, a

The cumulative variable + the print forces the compiler to avoid optimizing to actually print the correct value. Removed m2 as it was stagnating:
ifort
dot_ifort_speedup_avg

ifx
dot_ifx_speedup_avg

gfortran
dot_gfortran_speedup_avg

Originally posted by @jalvesz in jalvesz/fast_math#8 (comment)

@gha3mi gha3mi changed the title This might be more representative: dot_product Jan 23, 2024
@gha3mi
Copy link
Owner Author

gha3mi commented Jan 23, 2024

@jalvesz, thanks for your work. I found another solution for this issue using the volatile attribute for a. That prevents a from being optimized. I am currently testing this approach.

real(rk), volatile  :: a

call bench%start_benchmark(1,'dot_product','a = dot_product(u,v)',[p])
      do nl = 1,bench%nloops
         a = dot_product(u,v)
      end do
call bench%stop_benchmark(cmp_gflops)

@jalvesz
Copy link
Contributor

jalvesz commented Jan 23, 2024

Brilliant! I did a commit on my fork of your project with the proposed idea, testing on a different computer I saw your Blas interface in m2 actually being the most performant one. So this seems to vary quite a bit from machine to machine.

@gha3mi
Copy link
Owner Author

gha3mi commented Jan 23, 2024

And I think we should consider reducing the number of tests from do p = 5000_ik, 100000_ik, 5000_ik to avoid having too many bar charts on speedup plots or other plots in the future. Perhaps 3-5 tests could be sufficient.

@jalvesz
Copy link
Contributor

jalvesz commented Jan 23, 2024

5 sounds reasonable

@gha3mi
Copy link
Owner Author

gha3mi commented Jan 23, 2024

Blas interface in m2 actually being the most performant one.

Yes, I did many tests for matmul. I think it is impossible to get the performance of BLAS libraries! ;)

@gha3mi
Copy link
Owner Author

gha3mi commented Jan 23, 2024

I changed the number of elements to [500, 1000, 10000, 100000, 1000000], used the volatile attribute for a, and added average speedup to the readme. Please check out the results. I am not entirely sure about volatile. Perhaps it prevents all optimizations because the results for BLAS seem somehow strange to me now! Or perhaps performing benchmarks on the GitHub server is not a good idea. Or maybe should the benchmarks run sequentially for each compiler on GitHub.

@jalvesz
Copy link
Contributor

jalvesz commented Jan 23, 2024

Perhaps it prevents all optimizations because the results for BLAS seem somehow strange to me now! Or perhaps performing benchmarks on the GitHub server is not a good idea

Your new results look the results I was obtaining with one of the machines I tested. Where the blas implementation drops in performance at a certain array size. In another machine I got different results, with the blas being constantly optimal...

I'll recheck later with your new version

gfortran
dot_gfortran_perf
ifx
dot_ifx_perf

@gha3mi
Copy link
Owner Author

gha3mi commented Jan 24, 2024

I couldn't find the system specifications for the GitHub server. I have now tested this approach on both my system and GitHub (I have to push new tests to GitHub to check the results consistently). On my system, I achieved the desired results, but on GitHub, not so much. I believe relying on benchmarking on the GitHub server without a self-hosted runner is not a good idea. I will change the CI to push results from my system. What's your opinion?

call bench%start_benchmark(1,'dot_product','a = dot_product(u,v)',[p])
do nl = 1,bench%nloops
  a = dot_product(u,v)
  call prevent_optimization(a,nl) ! loop-invariant
end do
call bench%stop_benchmark(cmp_gflops)

subroutine prevent_optimization(a, nl)
  real(rk), intent(in) :: a
  integer, intent(in)  :: nl
  if (a == 0.0_rk) print*, nl, 'a = 0.0'
end subroutine prevent_optimization

@jalvesz
Copy link
Contributor

jalvesz commented Jan 24, 2024

The function to avoid optimization works quite well.

I think that the benchmarks results heavily depend on the systems load. I keep obtaining different trends specially for the blas case between two different computers. In one basically running only the benchmark, another one that has several applications open. Most probably the github server is also under this circumstances thus it can not get optimal performance.

Maybe the bench should be run under two scenarios: a) A dedicated machine for which a socket can be allocated only for that b) a workstation with more than one load. Those should cover use for HPC and for "daily" work kind of limit cases. The reference data files should then be signed with at least the characteristics of the cpu+ram+os.

It is always difficult to draw the line of "fairness"/"representativeness" for these kind of benchmarks.

@gha3mi
Copy link
Owner Author

gha3mi commented Jan 24, 2024

I think that the benchmarks results heavily depend on the systems load.
It is always difficult to draw the line of "fairness"/"representativeness" for these kind of benchmarks.

Yes, exactly.

Maybe the bench should be run under two scenarios: a) A dedicated machine for which a socket can be allocated only for that b) a workstation with more than one load. Those should cover use for HPC and for "daily" work kind of limit cases. The reference data files should then be signed with at least the characteristics of the cpu+ram+os.

Yes, that's a good idea. I think that after gathering and improving the infrastructure, it will be possible to run all benchmarks on a cluster using slurm and sbatch, as you mentioned.

I am also considering, for the next version, running the entire benchmark code multiple times and obtaining average values.

I have updated the Python scripts and pushed the latest changes. I believe the 'dot_product' benchmark is working well for now. I will work on the 'matmul' results to generate same plots.

Thanks again for contributing and testing!

@jalvesz
Copy link
Contributor

jalvesz commented Jan 27, 2024

Here just a wild thought to consider: what about benchmarking matmul and dot_product together with a traditional Conjugate Gradient with fixed number of iterations? These two kernels are usually used together and it also implies that the program has to switch kernel quite fast, so this might limit optimizations thus providing a more "realistic" view on the performance of each algorithm in the context of working in tandem in a more complex problem ?

@gha3mi
Copy link
Owner Author

gha3mi commented Jan 27, 2024

Yes, of course! I think this is a good idea. We could create a simple example. I will check if I have any implementation in ForSolver, or if you have some implementation, we could start with that. But what should be our reference? Should we consider other libraries?

@jalvesz
Copy link
Contributor

jalvesz commented Jan 29, 2024

I've been thinking about this idea. Most of the problems I have at hand are on sparse systems. Here the idea is to benchmark matmul on square matrices and the dot_product. So, I also was thinking that the important thing for the benchmark is performance, in that case, we could think about generating a synthetic problem that enables evaluating the different algorithms performance.

Here one idea: Given a matrix size, create a symmetric positive definite matrix by first creating the upper triangular part (U) with random numbers and then
A = c*I + U^T * U ; where c is some positive constant such that the spectral radius of the matrix is small.
Then perform a PCCG with Jacobi preconditioning on this system to solve for a A * X = B system (B should also be set to enable the problem to converge) with a fixed number of iterations.

All implementations should converge to similar results within a given tolerance. The result itself would be irrelevant to certain degree, the important point would be to very consistency and speed-ups.

What do you think?

@gha3mi
Copy link
Owner Author

gha3mi commented Jan 29, 2024

This is generally a good idea. I have first two initial questions:

  • Where is the need for matrix-matrix operations in the algorithm, aside from initializing a symmetric positive definite matrix? It seems that matrix-vector operations are required, or am I mistaken? If matrix-matrix multiplication is only necessary at the beginning of the algorithm, we already know which matmul implementation is better.
  • What are your expectations? Do you think that using other implementation of dot_product in the algorithm might perform better than BLAS?

@gha3mi
Copy link
Owner Author

gha3mi commented Jan 29, 2024

If you want to benchmark matmul and dot_product together, wouldn't it be easier to perform matrix-matrix multiplication first, followed by matrix-vector multiplication, and finally the dot product?

a = A_{ij} \cdot B_{jk} \cdot u_k \cdot v_i
a = dot_product(matmul(matmul(A,B),u),v)

@jalvesz
Copy link
Contributor

jalvesz commented Jan 30, 2024

Yes, you are right! I was overcomplicating things, I would split the operations thought. The idea behind being to time each operation as they play along.

What are your expectations? Do you think that using other implementation of dot_product in the algorithm might perform better than BLAS?

Not necessarily, but to have an idea of how far a naive implementation, a simple intrinsic or an implementation with some tweaks would behave and its comparison against the reference being BLAS.

@gha3mi
Copy link
Owner Author

gha3mi commented Jan 30, 2024

But it doesn't mean that I'm not for it. If you have any implementation of PCCG (I don't have it), we could now easily test it.

Let's check out some new results for dot_product. This works only with ifx. I'm trying to perform the same tests with other compilers and other optimized libraries like MKL and OpenBLAS. (The results are not pushed yet!)
dot_ifx_speedup
dot_ifx_speedup_avg

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants