Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose cross-platform helpers for Vector64, Vector128, and Vector256 #53450

Merged
merged 22 commits into from
Sep 30, 2021

Conversation

tannergooding
Copy link
Member

@tannergooding tannergooding commented May 28, 2021

This mostly resolves #49397 by exposes cross platform helpers for core functionality, mirroring Vector.

This PR is mostly the tests and the actual product changes are much much smaller than the sum. As with the other HWIntrinsic PRs, the tests are templated and are largely autogenerated from a small table. It is recommended to give a pass over a couple of the templates and the template input data, and not to try and review every test individually. There aren't any new concepts here, just cloning the existing templates with a few tweaks for the shared types.

In particular this PR is (numbers are slightly off now due to a few fixes, but largely still representative):

  • +3171, -1226 for the JIT - Largely a refactoring to share SimdAsHWIntrinsic logic with the Vector64/128/256 code paths
  • +7494, -3090 for the Libraries - Small refactoring + Software fallback logic for each API
  • +1523, -0001 for the Test Metadata Table - Largely copy/pasted from our existing hwintrinsic scenarios
  • +3856, -0000 for the Test Templates - Largely copy/pasted from our existing hwintrinsic scenarios
  • The remaining 444,136 lines are the generated tests (roughly 10 tests per API exposed, covering all the scenarios we said were important and are covering for the existing HWIntrinsics)

This does not cover Narrow, Widen, ConverTo*, or IsHardwareAccelerated. These remaining APIs will go in a separate PR since it requires a bit more refactoring to make work (Vector<T> still implements these using the "legacy" SIMD support and the PR will need to update that to use SimdAsHWIntrinsic and to have the logic shareable with the Vector64/128/256<T> paths).

This will unblock some work on related issues as well as allow us to merge several code paths on the managed side where the core logic is the same between various platforms. The related issues that will be resolved following this PR include:

@dotnet-issue-labeler
Copy link

Note regarding the new-api-needs-documentation label:

This serves as a reminder for when your PR is modifying a ref *.cs file and adding/modifying public APIs, to please make sure the API implementation in the src *.cs file is documented with triple slash comments, so the PR reviewers can sign off that change.

@dotnet-issue-labeler dotnet-issue-labeler bot added area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation labels May 28, 2021
@tannergooding
Copy link
Member Author

Going to run isa tests and make sure everything is passing before marking this as ready-for-review.

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@dotnet dotnet deleted a comment from azure-pipelines bot Jun 1, 2021
@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding
Copy link
Member Author

/azp run runtime-coreclr jitstress-isas-x86, runtime-coreclr jitstress-isas-arm, runtime-coreclr outerloop

@azure-pipelines
Copy link

Azure Pipelines successfully started running 3 pipeline(s).

@tannergooding tannergooding marked this pull request as ready for review June 1, 2021 23:38
@tannergooding
Copy link
Member Author

Just merged the workaround. Try rebasing again.

Thanks! Just did that, hopefully the AOT legs pass now

Copy link
Contributor

@echesakov echesakov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay.

Overall, the JIT changes look good. I left couple suggestions and one nit.

src/coreclr/vm/class.cpp Outdated Show resolved Hide resolved
src/coreclr/jit/gtlist.h Show resolved Hide resolved
break;
}

case NI_Vector128_GetUpper:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the most optimal way of implementing Vector128.GetUpper()...

Perhaps, something that does DUP Vd, Vn.D[1] should be used instead.
This would correspond to vector.AsUInt64().GetElement(1).As<T>()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mind if I fix this in a follow up PR (just so we can go ahead and get this one resolved and avoid further downstream conflicts)?

Also noting, this might be a case where we want to recognize the ExtractVector128 pattern and replace it with DUP if we know its always going to be more efficient...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I meant to write that this work can wait when posted the message (but forgot).

Co-authored-by: Egor Chesakov <Egor.Chesakov@microsoft.com>
@kunalspathak
Copy link
Member

Need to double check if these arm64 improvements are related - dotnet/perf-autofiling-issues#1715

@jkotas
Copy link
Member

jkotas commented Oct 10, 2021

This added close to 500,000 lines of auto-generated tests. Auto-generated tests like this are starting to kill us - in the repo footprint, build times and test times.

@tannergooding
Copy link
Member Author

tannergooding commented Oct 10, 2021

Auto-generated tests like this are starting to kill us - in the repo footprint, build times and test times.

@jkotas, We have several thousand APIs here, each of which need to have their basic functionality testing. Additionally, since they add a range of support to various phases of the JIT including new instructions and new encodings they need some validation that various operations, such as memory read/writes get folded into the instruction, CSE, etc work as expected.

Since we can't unit test phases of the JIT, we are stuck with functional tests written in C# that try to ensure trees are generated that cover the relevant patterns. If we could unit test phases, such as emit, we could have a much higher bar of confidence with many fewer tests, as we could directly validate based on the known encoding scenarios.

Many of these could likely be moved to some outerloop and only run on PRs known to be touching/impacting HWIntrinsic codegen (reducing the cost of your average PR/commit); but as is, these tests as is have caught a number of bugs and other issues that wouldn't have been exposed otherwise and so I don't think they should be removed from the repo until we have a suitable replacement.

@jkotas
Copy link
Member

jkotas commented Oct 10, 2021

Some data:

src/tests before this PR: 488,952,913 bytes
src/tests after this PR: 509,473,324 bytes

This PR grew src/tests footprint by more than 4%.

src/tests is about 60% of the dotnet/runtime repo disk size footprint. Yes, src/tests sources are 1.5x more than sources for both runtimes, all libraries and all libraries tests combined.

I understand that we do not have better option currently, but we may need to have to create one if this continues to grow like this.

@tannergooding
Copy link
Member Author

I understand that we do not have better option currently, but we may need to have to create one if this continues to grow like this.

I fully support having a better alternative here. I think the ideal scenario would be:

  • Emitter is unit testable; we can actually write tests for scenarios like INS_addps xmm0, xmm0, INS_addps xmm0 xmm15, and INS_addps xmm0, [addr] directly
    • Such tests would be significantly cheaper to run and validate; the also give a higher bar of confidence with less work for core scenarios
  • All of the existing src/tests/JIT/HardwareIntrinsics jobs are removed and replaced with Library tests that do basic functionality validation (e.g. ensure Sse.Add is behaving as expected and does per-element addition).

This would cover the majority of bugs and issues we've hit so far at a much cheaper cost. There would still be some things, like lowering, containment, and value numbering that wouldn't be covered here; but we've not hit any issues there that weren't immediately caught by general use of the intrinsics in the BCL.

@fanyang-mono
Copy link
Member

This PR caused some regression for TE benchmarks when running with Mono. I've created an issue to track it #61484. @tannergooding Could you take a look?

@tannergooding tannergooding deleted the fix-49397 branch November 11, 2022 15:26
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI new-api-needs-documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose cross-platform helpers for Vector64, Vector128, and Vector256
9 participants