Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-3393] Implement SIMD usage for basic dense dense MM #1643

Closed
wants to merge 6 commits into from

Conversation

kev-inn
Copy link
Contributor

@kev-inn kev-inn commented Jun 20, 2022

DoubleVector replacement for matrix multiply

JDK 17 adds Vector classes to use SIMD instructions. This PR replaces the basic dense dense matrix multiply with an equivalent DoubleVector implementation. It is necessary to use JDK 17, therefore we should not merge this yet, but keep it in staging for future reference.

As an experiment we check a simple matrix multiply:
Z = X %*% Y, $X\in \mathbb{R}^{n\times k}, Y\in \mathbb{R}^{k\times m}$

The experiment script performs 10 matrix multiplications and saves the time of the last 5 to give the JVM some time to optimize.

Vary rows n, m fixed at 1000

Alpha Node

$k = 1000$ $k = 10000$
plot_alpha_n_1000 plot_alpha_n_10000

Lima Node

$k = 1000$ $k = 10000$
plot_lima_n_1000 plot_lima_n_10000

Vary cols m, n fixed at 1000

Alpha Node

$k = 1000$ $k = 10000$
plot_alpha_m_1000 plot_alpha_m_10000

Lima Node

$k = 1000$ $k = 10000$
plot_lima_m_1000 plot_lima_m_10000

Conclusion

The implementation seems to boost the performance in most cases. The case where we vary the number of columns n on the alpha node needs some more exploration, but it seems we are never worse than the current implementation.

Experiment Script

X = read($Xfname);
Y = read($Yfname);

lim = 10;
R = matrix(0, rows=lim, cols=1);
for (i in 1:lim) {
  t1 = time();
  Z = X %*% Y;
  t2 = time();
  R[i,1] = (t2-t1)/1000000;
}

print(sum(Z));
res = R[5:lim,];
write(res, $fname, format="csv", sep="\t");

@phaniarnab
Copy link
Contributor

This looks pretty good. I think Alpha has wider SIMD registers, which explains why most configurations perform better in Alpha.
Is there any JVM flag that you needed to enable? If so, can you please mention those as well for documentation purposes? @kev-inn

@kev-inn
Copy link
Contributor Author

kev-inn commented Jun 20, 2022

This looks pretty good. I think Alpha has wider SIMD registers, which explains why most configurations perform better in Alpha. Is there any JVM flag that you needed to enable? If so, can you please mention those as well for documentation purposes? @kev-inn

Note the varying columns case for Alpha though, Lima is faster with DoubleVector than Alpha. I will take a closer look why that might be.

Yes, the flag is --add-modules=jdk.incubator.vector, which has to be added when running systemds (see the systemds script).

@Baunsgaard
Copy link
Contributor

What if you use some of the other experimental JVM that should have better support ?

@kev-inn kev-inn changed the title Implement SIMD usage for basic dense dense MM [SYSTEMDS-3393] Implement SIMD usage for basic dense dense MM Jun 21, 2022
@kev-inn
Copy link
Contributor Author

kev-inn commented Jun 21, 2022

What if you use some of the other experimental JVM that should have better support ?

Which ones do you have in mind, and what kind of support do you expect (which the current does not support)? To clarify, do you expect better performance or that we can remove the --add-modules=jdk.incubator.vector flag?

@Baunsgaard
Copy link
Contributor

Baunsgaard commented Jun 23, 2022

What if you use some of the other experimental JVM that should have better support ?

Which ones do you have in mind, and what kind of support do you expect (which the current does not support)? To clarify, do you expect better performance or that we can remove the --add-modules=jdk.incubator.vector flag?

Project Panama: https://openjdk.java.net/projects/panama/
And JDK 19 have the official full support for vectorizing; https://openjdk.org/jeps/426

JDK 17 only officially have the API, this does not guarantee that the instructions are correctly vectorized. Hence i am positively looking at your improvements already/

@kev-inn
Copy link
Contributor Author

kev-inn commented Jul 1, 2022

I ran the experiments again with jdk-19 (early access).
It seems to still require the same additional flags and still is part of the incubator module. This might be due to the early access version.
Results are similar, with the addition of sometimes one single iteration, of our sample, taking ~3-4x the time. Probably due to GC. This already existed before though and was introduced in my last commit. Or maybe I just got lucky in my first run, before the second commit.

@kev-inn
Copy link
Contributor Author

kev-inn commented Jul 31, 2022

More experiments

Dump of more experiments and updated plots.

All in all the results look promising, but we can also clearly some of the weak spots.

Alpha

Variable columns

Alpha_1_1000_variable
Alpha_10_1000_variable
Alpha_1000_1_variable
Alpha_1000_10_variable
Alpha_1000_1000_variable
Alpha_1000_10000_variable

Variable rows

Alpha_variable_1_1000
Alpha_variable_10_1000
Alpha_variable_1000_1
Alpha_variable_1000_10
Alpha_variable_1000_1000
Alpha_variable_10000_1000

Lima

Variable columns

Lima_1_1000_variable
Lima_10_1000_variable
Lima_1000_1_variable
Lima_1000_10_variable
Lima_1000_1000_variable
Lima_1000_10000_variable

Variable rows

Lima_variable_1_1000
Lima_variable_10_1000
Lima_variable_1000_1
Lima_variable_1000_10
Lima_variable_1000_1000
Lima_variable_10000_1000

@kev-inn kev-inn marked this pull request as ready for review July 31, 2022 20:21
@kev-inn
Copy link
Contributor Author

kev-inn commented Jul 31, 2022

Closed by 9bf0a9f (messed up the commit message)

@kev-inn kev-inn closed this Jul 31, 2022
@Baunsgaard Baunsgaard reopened this Jul 31, 2022
@Baunsgaard Baunsgaard closed this Aug 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants