New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorize more work in reductions of vector operations #14254
Conversation
e49e6d1
to
997fdb3
Compare
r0 += r1; | ||
r2 += r3; | ||
result = r0 + r2; | ||
result = (r0 + r1) + (r2 + r3); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Much better :-)
@kronbichler The following test fails on the serial configuration: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good.
Shall we do a full run of the standard configuration of the regression tester on this one before merging?
@@ -31,6 +31,7 @@ main() | |||
{ | |||
initlog(); | |||
deallog << std::setprecision(4); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to drop this std::setprecision
or to set it to say 10 digits so that numdiff can do something about roundoff errors?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that we print a residual that ought to be zero in exact arithmetic, but turns out to be 1e-3 or something (above the absolute/relative threshold of numdiff), because we've scaled by matrix by 1e10
. I don't see a simple way to get this back to tolerance, apart from disabling the print of the residual to the logstream.
Yes, here it would be good. 👍 |
d95f5d4
to
4a25472
Compare
Addresses one of the issues mentioned in #14251: The reductions for small sizes involve work on up to
n_lanes * 32 - 1
vector items without vectorization. If the vector size is a few hundreds, this scalar work gets clearly visible. We can get this down to less thann_lanes
. While in this code, I also restructured the reductions on the temporaryouter_results
that contain the result of 32 operations: We can the associated pairwise summation by a single loop (rather than two nested loops) by writing the temporary result to the end of a (now 2x larger) array and sum until we've reached the end.These two actions result in a considerable improvement. I tested for vector size 400 and found that around 20% fewer instructions get issued with this patch.
Note: This patch changes the order of additions, so I expect that we might need some updates to the result files that are sensitive to roundoff. I think I identified one, but let us wait for the CI machines.