Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lowering subset of Vector512 methods for avx512. #82953

Merged
merged 3 commits into from
Mar 8, 2023

Conversation

DeepakRajendrakumaran
Copy link
Contributor

@DeepakRajendrakumaran DeepakRajendrakumaran commented Mar 3, 2023

This PR addresses the following issue : #80814. It currently has the following support.

Vector512.Load()
Vector512.LoadUnsafe()
Vector512.LoadAligned()
Vector512.LoadAlignedNonTemporal()

Vector512.Store()
Vector512.StoreUnsafe()
Vector512.StoreAligned()
Vector512.StoreAlignedNonTemporal()

@ghost ghost added the community-contribution Indicates that the PR has been added by a community member label Mar 3, 2023
@dotnet-issue-labeler dotnet-issue-labeler bot added the area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI label Mar 3, 2023
@ghost
Copy link

ghost commented Mar 3, 2023

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak
See info in area-owners.md if you want to be subscribed.

Issue Details

null

Author: DeepakRajendrakumaran
Assignees: -
Labels:

area-CodeGen-coreclr, community-contribution

Milestone: -

@BruceForstall
Copy link
Member

@DeepakRajendrakumaran Does this address a specific GitHub issue (work item) that is listed on #77034?

@DeepakRajendrakumaran
Copy link
Contributor Author

@DeepakRajendrakumaran Does this address a specific GitHub issue (work item) that is listed on #77034?

This PR addresses part of this - #80814

I plan to add a few more including store here. So, this will have further commits this week. Will move this PR from draft -> live once it's ready for review

@DeepakRajendrakumaran DeepakRajendrakumaran changed the title Lowering Vector512.Load* for avx512 Lowering subset of Vector512 methods for avx512. Mar 6, 2023
@tannergooding
Copy link
Member

Looks to be a small TP regression, likely due to the more case statements and that changing how the jump tables or other dispatch can work.

It's possible some of these would be better via flags or something long term, but I don' think that's something we strictly have to handle as part of this PR.

Might be good as a separate PR and then taking this PR after if the up to +0.03% is a concern.

@DeepakRajendrakumaran DeepakRajendrakumaran marked this pull request as ready for review March 7, 2023 22:31
@tannergooding
Copy link
Member

CC. @dotnet/jit-contrib, @dotnet/avx512-contrib for secondary review and merging

@BruceForstall BruceForstall merged commit 272fb4e into dotnet:main Mar 8, 2023
@BruceForstall
Copy link
Member

Looks to be a small TP regression, likely due to the more case statements and that changing how the jump tables or other dispatch can work.

Maybe because all the if (size == 64) checks come first and are always false / "not taken"?

@DeepakRajendrakumaran
Copy link
Contributor Author

Looks to be a small TP regression, likely due to the more case statements and that changing how the jump tables or other dispatch can work.

Maybe because all the if (size == 64) checks come first and are always false / "not taken"?

I can switch this around and check again. I'll have another set of methods to lower pretty soon and this should be easy to check

@BruceForstall
Copy link
Member

Note that our SPMI throughput measurements are based on collections taken on the Helix test machines. I think these are all AVX-512 capable machines so presumably hit the AVX2 paths. Seems like that's the one to optimize for in general, based on expected customer installed base.

Also note that TP measurements are done using a native JIT built without PGO. Theoretically, if all the code paths were hit in the training scenarios (unlikely), the native compiler would rearrange the order of branches.

Final note: the TP measurements are instruction counts, not cycle counts. It's as close a proxy to time-based throughput as we can reliably get.

@DeepakRajendrakumaran
Copy link
Contributor Author

Note that our SPMI throughput measurements are based on collections taken on the Helix test machines. I think these are all AVX-512 capable machines so presumably hit the AVX2 paths. Seems like that's the one to optimize for in general, based on expected customer installed base.

Also note that TP measurements are done using a native JIT built without PGO. Theoretically, if all the code paths were hit in the training scenarios (unlikely), the native compiler would rearrange the order of branches.

Final note: the TP measurements are instruction counts, not cycle counts. It's as close a proxy to time-based throughput as we can reliably get.

I flipped the condition. For eg.
image

image

@BruceForstall BruceForstall added the avx512 Related to the AVX-512 architecture label Mar 27, 2023
@ghost ghost locked as resolved and limited conversation to collaborators Apr 26, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI avx512 Related to the AVX-512 architecture community-contribution Indicates that the PR has been added by a community member
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants