dot_product #24

gha3mi · 2024-01-23T07:05:13Z

          This might be more representative:

a = 0._rk
do nl = 1,bench%nloops
      a = a + dot_product(u,v)
end do
call bench%stop_benchmark(cmp_gflops)
print *, a

The cumulative variable + the print forces the compiler to avoid optimizing to actually print the correct value. Removed m2 as it was stagnating:
ifort

ifx

gfortran

Originally posted by @jalvesz in jalvesz/fast_math#8 (comment)

The text was updated successfully, but these errors were encountered:

gha3mi · 2024-01-23T07:14:13Z

@jalvesz, thanks for your work. I found another solution for this issue using the volatile attribute for a. That prevents a from being optimized. I am currently testing this approach.

real(rk), volatile  :: a

call bench%start_benchmark(1,'dot_product','a = dot_product(u,v)',[p])
      do nl = 1,bench%nloops
         a = dot_product(u,v)
      end do
call bench%stop_benchmark(cmp_gflops)

jalvesz · 2024-01-23T07:20:30Z

Brilliant! I did a commit on my fork of your project with the proposed idea, testing on a different computer I saw your Blas interface in m2 actually being the most performant one. So this seems to vary quite a bit from machine to machine.

gha3mi · 2024-01-23T07:22:26Z

And I think we should consider reducing the number of tests from do p = 5000_ik, 100000_ik, 5000_ik to avoid having too many bar charts on speedup plots or other plots in the future. Perhaps 3-5 tests could be sufficient.

jalvesz · 2024-01-23T07:27:50Z

5 sounds reasonable

gha3mi · 2024-01-23T07:28:25Z

Blas interface in m2 actually being the most performant one.

Yes, I did many tests for matmul. I think it is impossible to get the performance of BLAS libraries! ;)

gha3mi · 2024-01-23T11:25:52Z

I changed the number of elements to [500, 1000, 10000, 100000, 1000000], used the volatile attribute for a, and added average speedup to the readme. Please check out the results. I am not entirely sure about volatile. Perhaps it prevents all optimizations because the results for BLAS seem somehow strange to me now! Or perhaps performing benchmarks on the GitHub server is not a good idea. Or maybe should the benchmarks run sequentially for each compiler on GitHub.

jalvesz · 2024-01-23T13:01:13Z

Perhaps it prevents all optimizations because the results for BLAS seem somehow strange to me now! Or perhaps performing benchmarks on the GitHub server is not a good idea

Your new results look the results I was obtaining with one of the machines I tested. Where the blas implementation drops in performance at a certain array size. In another machine I got different results, with the blas being constantly optimal...

I'll recheck later with your new version

gfortran

ifx

gha3mi · 2024-01-24T11:53:10Z

I couldn't find the system specifications for the GitHub server. I have now tested this approach on both my system and GitHub (I have to push new tests to GitHub to check the results consistently). On my system, I achieved the desired results, but on GitHub, not so much. I believe relying on benchmarking on the GitHub server without a self-hosted runner is not a good idea. I will change the CI to push results from my system. What's your opinion?

call bench%start_benchmark(1,'dot_product','a = dot_product(u,v)',[p])
do nl = 1,bench%nloops
  a = dot_product(u,v)
  call prevent_optimization(a,nl) ! loop-invariant
end do
call bench%stop_benchmark(cmp_gflops)

subroutine prevent_optimization(a, nl)
  real(rk), intent(in) :: a
  integer, intent(in)  :: nl
  if (a == 0.0_rk) print*, nl, 'a = 0.0'
end subroutine prevent_optimization

jalvesz · 2024-01-24T13:09:57Z

The function to avoid optimization works quite well.

I think that the benchmarks results heavily depend on the systems load. I keep obtaining different trends specially for the blas case between two different computers. In one basically running only the benchmark, another one that has several applications open. Most probably the github server is also under this circumstances thus it can not get optimal performance.

Maybe the bench should be run under two scenarios: a) A dedicated machine for which a socket can be allocated only for that b) a workstation with more than one load. Those should cover use for HPC and for "daily" work kind of limit cases. The reference data files should then be signed with at least the characteristics of the cpu+ram+os.

It is always difficult to draw the line of "fairness"/"representativeness" for these kind of benchmarks.

gha3mi · 2024-01-24T19:38:58Z

I think that the benchmarks results heavily depend on the systems load.
It is always difficult to draw the line of "fairness"/"representativeness" for these kind of benchmarks.

Yes, exactly.

Maybe the bench should be run under two scenarios: a) A dedicated machine for which a socket can be allocated only for that b) a workstation with more than one load. Those should cover use for HPC and for "daily" work kind of limit cases. The reference data files should then be signed with at least the characteristics of the cpu+ram+os.

Yes, that's a good idea. I think that after gathering and improving the infrastructure, it will be possible to run all benchmarks on a cluster using slurm and sbatch, as you mentioned.

I am also considering, for the next version, running the entire benchmark code multiple times and obtaining average values.

I have updated the Python scripts and pushed the latest changes. I believe the 'dot_product' benchmark is working well for now. I will work on the 'matmul' results to generate same plots.

Thanks again for contributing and testing!

jalvesz · 2024-01-27T09:10:02Z

Here just a wild thought to consider: what about benchmarking matmul and dot_product together with a traditional Conjugate Gradient with fixed number of iterations? These two kernels are usually used together and it also implies that the program has to switch kernel quite fast, so this might limit optimizations thus providing a more "realistic" view on the performance of each algorithm in the context of working in tandem in a more complex problem ?

gha3mi · 2024-01-27T11:06:02Z

Yes, of course! I think this is a good idea. We could create a simple example. I will check if I have any implementation in ForSolver, or if you have some implementation, we could start with that. But what should be our reference? Should we consider other libraries?

jalvesz · 2024-01-29T22:29:17Z

I've been thinking about this idea. Most of the problems I have at hand are on sparse systems. Here the idea is to benchmark matmul on square matrices and the dot_product. So, I also was thinking that the important thing for the benchmark is performance, in that case, we could think about generating a synthetic problem that enables evaluating the different algorithms performance.

Here one idea: Given a matrix size, create a symmetric positive definite matrix by first creating the upper triangular part (U) with random numbers and then
A = c*I + U^T * U ; where c is some positive constant such that the spectral radius of the matrix is small.
Then perform a PCCG with Jacobi preconditioning on this system to solve for a A * X = B system (B should also be set to enable the problem to converge) with a fixed number of iterations.

All implementations should converge to similar results within a given tolerance. The result itself would be irrelevant to certain degree, the important point would be to very consistency and speed-ups.

What do you think?

gha3mi · 2024-01-29T23:12:34Z

This is generally a good idea. I have first two initial questions:

Where is the need for matrix-matrix operations in the algorithm, aside from initializing a symmetric positive definite matrix? It seems that matrix-vector operations are required, or am I mistaken? If matrix-matrix multiplication is only necessary at the beginning of the algorithm, we already know which matmul implementation is better.
What are your expectations? Do you think that using other implementation of dot_product in the algorithm might perform better than BLAS?

gha3mi · 2024-01-29T23:22:56Z

If you want to benchmark matmul and dot_product together, wouldn't it be easier to perform matrix-matrix multiplication first, followed by matrix-vector multiplication, and finally the dot product?

a = A_{ij} \cdot B_{jk} \cdot u_k \cdot v_i
a = dot_product(matmul(matmul(A,B),u),v)

jalvesz · 2024-01-30T10:50:35Z

Yes, you are right! I was overcomplicating things, I would split the operations thought. The idea behind being to time each operation as they play along.

What are your expectations? Do you think that using other implementation of dot_product in the algorithm might perform better than BLAS?

Not necessarily, but to have an idea of how far a naive implementation, a simple intrinsic or an implementation with some tweaks would behave and its comparison against the reference being BLAS.

gha3mi · 2024-01-30T11:19:09Z

But it doesn't mean that I'm not for it. If you have any implementation of PCCG (I don't have it), we could now easily test it.

Let's check out some new results for dot_product. This works only with ifx. I'm trying to perform the same tests with other compilers and other optimized libraries like MKL and OpenBLAS. (The results are not pushed yet!)

gha3mi changed the title ~~This might be more representative:~~ dot_product Jan 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dot_product #24

dot_product #24

gha3mi commented Jan 23, 2024

gha3mi commented Jan 23, 2024

jalvesz commented Jan 23, 2024 •

edited

Loading

gha3mi commented Jan 23, 2024 •

edited

Loading

jalvesz commented Jan 23, 2024

gha3mi commented Jan 23, 2024 •

edited

Loading

gha3mi commented Jan 23, 2024 •

edited

Loading

jalvesz commented Jan 23, 2024

gha3mi commented Jan 24, 2024

jalvesz commented Jan 24, 2024

gha3mi commented Jan 24, 2024

jalvesz commented Jan 27, 2024

gha3mi commented Jan 27, 2024

jalvesz commented Jan 29, 2024

gha3mi commented Jan 29, 2024

gha3mi commented Jan 29, 2024 •

edited

Loading

jalvesz commented Jan 30, 2024

gha3mi commented Jan 30, 2024

dot_product #24

dot_product #24

Comments

gha3mi commented Jan 23, 2024

gha3mi commented Jan 23, 2024

jalvesz commented Jan 23, 2024 • edited Loading

gha3mi commented Jan 23, 2024 • edited Loading

jalvesz commented Jan 23, 2024

gha3mi commented Jan 23, 2024 • edited Loading

gha3mi commented Jan 23, 2024 • edited Loading

jalvesz commented Jan 23, 2024

gha3mi commented Jan 24, 2024

jalvesz commented Jan 24, 2024

gha3mi commented Jan 24, 2024

jalvesz commented Jan 27, 2024

gha3mi commented Jan 27, 2024

jalvesz commented Jan 29, 2024

gha3mi commented Jan 29, 2024

gha3mi commented Jan 29, 2024 • edited Loading

jalvesz commented Jan 30, 2024

gha3mi commented Jan 30, 2024

jalvesz commented Jan 23, 2024 •

edited

Loading

gha3mi commented Jan 23, 2024 •

edited

Loading

gha3mi commented Jan 23, 2024 •

edited

Loading

gha3mi commented Jan 23, 2024 •

edited

Loading

gha3mi commented Jan 29, 2024 •

edited

Loading