Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vectorize more work in reductions of vector operations #14254

Merged
merged 2 commits into from Sep 14, 2022

Conversation

kronbichler
Copy link
Member

Addresses one of the issues mentioned in #14251: The reductions for small sizes involve work on up to n_lanes * 32 - 1 vector items without vectorization. If the vector size is a few hundreds, this scalar work gets clearly visible. We can get this down to less than n_lanes. While in this code, I also restructured the reductions on the temporary outer_results that contain the result of 32 operations: We can the associated pairwise summation by a single loop (rather than two nested loops) by writing the temporary result to the end of a (now 2x larger) array and sum until we've reached the end.

These two actions result in a considerable improvement. I tested for vector size 400 and found that around 20% fewer instructions get issued with this patch.

Note: This patch changes the order of additions, so I expect that we might need some updates to the result files that are sensitive to roundoff. I think I identified one, but let us wait for the CI machines.

r0 += r1;
r2 += r3;
result = r0 + r2;
result = (r0 + r1) + (r2 + r3);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better :-)

@tamiko
Copy link
Member

tamiko commented Sep 12, 2022

@kronbichler The following test fails on the serial configuration: 4265 - lac/bicgstab_large.debug (Failed)

Copy link
Member

@tamiko tamiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good.

Shall we do a full run of the standard configuration of the regression tester on this one before merging?

@@ -31,6 +31,7 @@ main()
{
initlog();
deallog << std::setprecision(4);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to drop this std::setprecision or to set it to say 10 digits so that numdiff can do something about roundoff errors?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we print a residual that ought to be zero in exact arithmetic, but turns out to be 1e-3 or something (above the absolute/relative threshold of numdiff), because we've scaled by matrix by 1e10. I don't see a simple way to get this back to tolerance, apart from disabling the print of the residual to the logstream.

@kronbichler
Copy link
Member Author

Shall we do a full run of the standard configuration of the regression tester on this one before merging?

Yes, here it would be good. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants