Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensor product operations: Use loop unrolling for slow mat-vec #16984

Merged
merged 1 commit into from May 14, 2024

Conversation

kronbichler
Copy link
Member

This is the optimization I mentioned in #16970: For the simplex elements and good performance of the matrix-free evaluation kernels, we should provide reasonably good code, avoiding loop overhead and addition latencies. This code implements a 4x manual unrolling of the outer loop in the matrix-vector product with the respective reminder loops, and implements the option to create better code for unit-stride array access. In my tests, this PR gives a speedup of 1.6-1.8x for dim=3, degree={2,3} of the simplex matrix-free operator evaluation.

The code is still not optimal, because the simplices make a matrix-vector product, whereas it would be better to perform matrix-matrix multiplications. More precisely, one should combine 2 or 3 cell batches in order to amortize the load to matrix_ptr[0], as the current code is bottlenecked by the load to the matrix, and to hide the latency of the chained additions that gets clearly visible on my AMD systems with dim=3,degree=3. I have to reason more about what the best strategy could be, and would be happy to discuss options we have.

FYI @dominiktassilostill @nfehn.

@kronbichler kronbichler merged commit 7801e2e into dealii:master May 14, 2024
15 of 16 checks passed
@kronbichler kronbichler deleted the optimize_matvec_kernel branch May 14, 2024 03:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants