Skip to content
This repository has been archived by the owner on Jan 23, 2023. It is now read-only.
/ corefx Public archive

Add ref/in/out overloads of methods for Vectors and Matrices #25388

Closed
wants to merge 23 commits into from

Conversation

eerhardt
Copy link
Member

@eerhardt eerhardt commented Nov 20, 2017

Adding overloads to Vector and Matrix types for better memory and performance characteristics.

Fixes #157

Notes

  1. The EOF newline changes came from merging these old commits into the master branch. I wanted to preserve the commit history, so I didn't try fixing up these newline changes. FIXED
  2. I wasn't excited about all the #if FEATURE_REF_OVERLOADS in the tests. But I didn't see a great way to factor them without the #ifs. I felt keeping the tests contained together was a high priority, that way the "expected" values were only in a single place. I'm open to better ideas here.

@@ -0,0 +1,15 @@
<?xml version="1.0" encoding="utf-8"?>
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, not sure how this file snuck in. I'll remove it.

@@ -52,6 +52,27 @@ public partial struct Matrix3x2 : System.IEquatable<System.Numerics.Matrix3x2>
public static System.Numerics.Matrix3x2 operator -(System.Numerics.Matrix3x2 value) { throw null; }
public static System.Numerics.Matrix3x2 Subtract(System.Numerics.Matrix3x2 value1, System.Numerics.Matrix3x2 value2) { throw null; }
public override string ToString() { throw null; }
#if FEATURE_REF_OVERLOADS
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this needed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because we build for netfx, and these APIs aren't in desktop (yet). But we wanted to make these APIs available in as many places as possible. So we are starting with netcoreapp and uap.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@weshaggard, in many projects we add to the refs without special-casing them for netfx. What's special about this one and others that still require special-casing?

@eerhardt, personally I'd prefer if these were added in a .netcoreapp.cs specific file rather than using #ifdefs, but I'm ok being overruled if this is the practice we've already established elsewhere.

Copy link
Member Author

@eerhardt eerhardt Nov 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn’t put it in .netcoreapp.cs file was because it gets compiled in for both netcoreapp AND uap. When searching other scenarios, I found the FEATURE pattern being used.

Ex

#if FEATURE_HASHDATA // uap and netcoreapp specific

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I didn’t put it in .netcoreapp.cs file was because it gets compiled in for both netcoreapp AND uap.

We use "netcoreapp" all over the place when it's also used for uap. That's now the case for the vast majority of new APIs across corefx. The terminology needs to be cleaned up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We use "netcoreapp" all over the place when it's also used for uap.

This seems like an anti-pattern to me. There are plenty of places where .netcoreapp.cs means "only used on netcoreapp". Here's just a couple I found really quickly:

<Compile Include="System\Reflection\Internal\Utilities\EncodingHelper.netcoreapp.cs" Condition="'$(TargetGroup)' == 'netcoreapp'" />

<ItemGroup Condition="'$(TargetGroup)' == 'netcoreapp'">
<Compile Include="CancellationTokenTests.netcoreapp.cs" />
<Compile Include="Task\TaskStatusTest.netcoreapp.cs" />
</ItemGroup>

<ItemGroup Condition="'$(TargetGroup)'=='netcoreapp'">
<Compile Include="SizeFTests.netcoreapp.cs" />
<Compile Include="ColorTests.netcoreapp.cs" />
<Compile Include="SizeTests.netcoreapp.cs" />
</ItemGroup>


This one even mixes them both into the same project:

<ItemGroup Condition="'$(TargetGroup)'=='netcoreapp'">
<Compile Include="System\Random.netcoreapp.cs" />
</ItemGroup>
<ItemGroup Condition="'$(TargetGroup)'!='netstandard'">
<Compile Include="System\BitConverterSpan.cs" />
<Compile Include="System\BitConverter.netcoreapp.cs" />


I assume the intention of suffixing these files with a TFM was so that it was easily recognizable that this file was only compiled for this TFM. When we use one TFM to mean it is sometimes compiled for 2 TFMs, the only way you can know is to go look in the .csproj file.

I'd be happy to split these out into a separate file, but maybe a .netcoreapp.uap.cs suffix would be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like I said, the terminology needs updating. "netcoreapp" has morphed into also including all new APIs, as all new APIs are being added for all platforms that use corefx/coreclr/corert.

@tannergooding
Copy link
Member

I'm going to actually review this in a bit.

However, my primary concern (from a brief glance) is that this will lead to poorer performance on architectures using the System V ABI, where these are already supposed to be passing around via register (and will now be passed as a reference instead).

This might also lead to poorer performance if Windows ever gets vectorcall ABI support: https://github.com/dotnet/coreclr/issues/12120. That being said, until vectorcall support is implemented on Windows, in/ref/out will likely perform better due to less shadow copying.

@benaadams
Copy link
Member

Had wondered about this as BOTR ABI refers to the "x64 Software Conventions" which does specify use of XMM0:YMM15 registers; but via __vectorcall

result.M22 = 1.0f;

result.M31 = position.X;
result.M32 = position.Y;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than duplicating all of these implementations, would it make sense to implement the existing overload in terms of the new one? e.g.

public static Matrix3x2 CreateTranslation(Vector2 position)
{
    Matrix3x2 result;
    CreateTranslation(in position, out result);
    return result;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will force architectures using System V ABI to have worse performance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will force architectures using System V ABI to have worse performance.

To what extent? If significant, then ok, but there is a ton of duplicated code being added here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Jit used __vectorcall then wouldn't need any in params for the Vector types as they are all register sized?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stephentoub, I believe it would be measurable. It should be whatever the latency/overhead of copying from register to memory would be and back would be.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it would be measurable

Can we measure then?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've seen various poor CQ issues in the JIT around handling the SysV conventions for structs; things may be better for Vectors/HVAs but I' be kind of surprised if this is the case. In particular the jit seems to spill promoted structs to memory before calls (then reloading into registers) and vice versa on method entry.

So, please look carefully at the actual code the jit generates in any sort of A/B experiment. The results may say more about the abilities or limitations of the jit than any inherent "best" way of passing args.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are limitations or poor CQ with regard to Vectors/HVAs, then I would prefer to see those get fixed over adding new APIs which only patch the issue.

That being said, I will be sure to also look at the codegen when comparing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been able to run some benchmarks today. I'm using the suite that @mellinoe put together in the issue #157.

Machine 1: OS=ubuntu 16.04 Processor=Intel Xeon CPU E5-1620 0 3.60GHz, ProcessorCount=8 .NET Core 2.1.0-preview1-25919-02 (Framework 4.6.25917.03), 64bit RyuJIT

Baseline (none of my changes)

Method Mean Error StdDev
Matrix4x4ByValue 271.720 ns 0.2741 ns 0.2430 ns
QuaterionByValue 206.123 ns 0.0605 ns 0.0566 ns
Vector3ByValue 11.839 ns 0.0379 ns 0.0354 ns
Vector3SimpleAddByValue 2.118 ns 0.0017 ns 0.0016 ns

Current changes (duplicated code)

Method Mean Error StdDev
Matrix4x4ByValue 275.4029 ns 0.5192 ns 0.4335 ns
Matrix4x4ByRef 201.6788 ns 0.9473 ns 0.7910 ns
QuaterionByValue 202.4441 ns 0.1351 ns 0.1128 ns
QuaterionByRef 282.7828 ns 0.1976 ns 0.1848 ns
Vector3ByValue 12.3704 ns 0.0042 ns 0.0035 ns
Vector3ByRef 0.7947 ns 0.0009 ns 0.0007 ns
Vector3SimpleAddByValue 1.8711 ns 0.0018 ns 0.0016 ns
Vector3SimpleAddByRef 0.8011 ns 0.0211 ns 0.0187 ns

Doing some refactoring suggested here

See this commit for the code changes

Method Mean Error StdDev
Matrix4x4ByValue 250.2053 ns 0.2409 ns 0.2135 ns
Matrix4x4ByRef 235.3991 ns 0.1826 ns 0.1619 ns
QuaterionByValue 338.3308 ns 0.3273 ns 0.2733 ns
QuaterionByRef 283.7789 ns 0.4899 ns 0.4343 ns
Vector3ByValue 21.1915 ns 0.0120 ns 0.0106 ns
Vector3ByRef 0.7941 ns 0.0013 ns 0.0012 ns
Vector3SimpleAddByValue 1.8746 ns 0.0022 ns 0.0019 ns
Vector3SimpleAddByRef 0.8000 ns 0.0135 ns 0.0120 ns

Machine 2: OS=Windows 10 Redstone 3 [1709, Fall Creators Update] (10.0.16299.19) Processor=, ProcessorCount=8 .NET Core 2.1.0-preview1-25907-02 (Framework 4.6.25901.06), 64bit RyuJIT

I also ran the same tests on my Windows machine:

Baseline (none of my changes)

Method Mean Error StdDev
Matrix4x4ByValue 205.941 ns 0.8044 ns 0.7524 ns
QuaterionByValue 157.711 ns 0.5429 ns 0.5078 ns
Vector3ByValue 9.050 ns 0.0350 ns 0.0328 ns
Vector3SimpleAddByValue 1.391 ns 0.0098 ns 0.0087 ns

Current changes (duplicated code)

Method Mean Error StdDev
Matrix4x4ByValue 212.2870 ns 1.2065 ns 1.1285 ns
Matrix4x4ByRef 159.5631 ns 0.5604 ns 0.5242 ns
QuaterionByValue 158.7156 ns 0.7299 ns 0.6095 ns
QuaterionByRef 147.2884 ns 0.6121 ns 0.5111 ns
Vector3ByValue 9.0204 ns 0.0310 ns 0.0290 ns
Vector3ByRef 9.8258 ns 0.0334 ns 0.0296 ns
Vector3SimpleAddByValue 1.3929 ns 0.0118 ns 0.0110 ns
Vector3SimpleAddByRef 0.4443 ns 0.0075 ns 0.0066 ns

Doing some refactoring suggested here

Method Mean Error StdDev
Matrix4x4ByValue 181.4270 ns 0.5213 ns 0.4876 ns
Matrix4x4ByRef 172.6201 ns 0.6639 ns 0.5885 ns
QuaterionByValue 181.0366 ns 0.7396 ns 0.6918 ns
QuaterionByRef 148.7591 ns 0.6140 ns 0.5744 ns
Vector3ByValue 19.4549 ns 0.0934 ns 0.0828 ns
Vector3ByRef 9.2186 ns 0.0154 ns 0.0129 ns
Vector3SimpleAddByValue 1.3787 ns 0.0106 ns 0.0099 ns
Vector3SimpleAddByRef 0.4396 ns 0.0110 ns 0.0097 ns

On both Windows and Ubuntu, it appears that the Matrix4x4ByValue test gets better with the refactoring, but both QuaterionByValue and Vector3ByValue get worse.

My thoughts are that we shouldn't do the refactoring at this time, as it adds risk to this change


Another observation is that @tannergooding's initial primary concern holds for the QuaterionByValue vs. QuaterionByRef test on Ubuntu.

Method Mean Error StdDev
QuaterionByValue 202.4441 ns 0.1351 ns 0.1128 ns
QuaterionByRef 282.7828 ns 0.1976 ns 0.1848 ns

In this scenario, on my Ubuntu machine, using the new ref overloads appears to lead to poorer performance. But on my Windows machine, this isn't true and it wasn't true for @mellinoe's reports either.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But on my Windows machine, this isn't true and it wasn't true for @mellinoe's reports either.

On Windows, only the default x64 calling convention is used and there isn't support for __vectorcall today. This means that (on Windows) all of these are passed as normal structs (requiring shadow copying to the stack) and will likely see a perf increase (due to no longer requiring shadow-copying of the values).

On Unix based systems, we use the System V ABI, which allows these values to be passed in register, without shadow copying to the stack. However, as @AndyAyersMS pointed out, the code gen for System V ABI isn't the best and we are still shadow copying the values in some scenarios (which appears to be why you are seeing improvements).

I'm still working on collecting actual codegen data here.

#if FEATURE_REF_OVERLOADS
Vector2.Distance(in a, in b, out actual);
Assert.True(MathHelper.Equal(expected, actual), "Vector2f.Distance did not return the expected value.");
#endif
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've generally separated such tests out into their own .netcoreapp.cs files.

@mikedn
Copy link

mikedn commented Nov 22, 2017

Can someone explain to me why are ref overloads being added to methods that are JIT intrinsics?

@eerhardt
Copy link
Member Author

Can someone explain to me why are ref overloads being added to methods that are JIT intrinsics?

Does @mellinoe's answer here help? https://github.com/dotnet/corefx/issues/157#issuecomment-231807676

Specifically

While it would be nice to rely on great code-gen everywhere, the reality is that we are going to be supporting a wide variety of runtimes for the forseeable future, some of which will be lagging behind others,

@mikedn
Copy link

mikedn commented Nov 22, 2017

While it would be nice to rely on great code-gen everywhere, the reality is that we are going to be supporting a wide variety of runtimes for the forseeable future, some of which will be lagging behind others,

Which runtimes to be precise? .NET Framework x86 (that doesn't use RyuJIT yet)? .NET Framework 2.0-3.5? Unknown runtime nobody uses? I think that such API clutter deserves a less vague explanation.

@CarolEidt
Copy link

@tannergooding writes:

I'm still working on collecting actual codegen data here.

Getting some good scenarios, even as micro-benchmarks if they're known to be representative, would be desirable either way we decide to go on this.

@mellinoe wrote:

While it would be nice to rely on great code-gen everywhere, the reality is that we are going to be supporting a wide variety of runtimes for the forseeable future, some of which will be lagging behind others.

I am of two minds on this. I would really like to see us rely on good codegen for this. And note that, at least for the issues that I've seen, the problem isn't that the JIT isn't optimizing adequately, it's that the code that implements struct arguments and return values is unnecessarily generating conservative code - both in introducing excessive copies as well as marking structs as address-taken when they're not. I don't think that fixing it is a large task, and much of it is outlined in https://github.com/dotnet/coreclr/blob/master/Documentation/design-docs/first-class-structs.md, but it hasn't bubbled to the top priority.

@benaadams
Copy link
Member

As pointed out in dotnet/csharplang#1155 in Vector3 will cause a copy of the Vector3 parameter when its values are accessed .X

Will the Jit elide this copy?

@mikedn
Copy link

mikedn commented Nov 29, 2017

As pointed out in dotnet/csharplang#1155 in Vector3 will cause a copy of the Vector3 parameter when its values are accessed .X

That's a different Vector3, it has property getters and setters.

Will the Jit elide this copy?

That may turn out to be problematic, the JIT will have to figure out that v1 and v2 fields are read before the result is written. It's funny how people tend to think that passing values by refs has only positive consequences.

@CarolEidt
Copy link

I was rather troubled by the large performance differences, so I took a look at the benchmarks. What should have been obvious to me right away is that these benchmarks aren't using their results, so much or all of the code gets optimized away. By adding a static dummy float to the ByRefBenchmarks class, and incrementing it with the first field of each final result, I got these results:

Method Mean Error StdDev
Matrix4x4ByValue 337.461 ns 3.0390 ns 2.8426 ns
Matrix4x4ByRef 247.860 ns 3.5341 ns 3.3058 ns
QuaterionByValue 257.683 ns 3.6805 ns 3.4427 ns
QuaterionByRef 294.836 ns 2.8099 ns 2.4909 ns
Vector3ByValue 15.213 ns 0.2446 ns 0.2168 ns
Vector3ByRef 17.148 ns 0.1312 ns 0.1227 ns
Vector3SimpleAddByValue 5.057 ns 0.0852 ns 0.0797 ns
Vector3SimpleAddByRef 5.089 ns 0.0770 ns 0.0720 ns

There's a separate question about why in the Ref case more code is optimized away, but that's really an orthogonal issue. A more interesting question to me is how much of the Matrix4x4ByRef benefit can be matched with better codegen for the ByValue case. I'll try to look into both of those questions more over the next week.

One more note:

  • The technique of just using a single field of the result is not a great benchmarking strategy, but in this case I don't think the JIT will be able to optimize away the computation that affects only the other fields of the result.

@mikedn
Copy link

mikedn commented Dec 1, 2017

Aha, so as I suspected it's only Matrix4x4 that may significantly benefit from passing by ref. And even for Matrix4x4, by ref is only useful for some methods. Code generated for Matrix4x4ByValue - https://gist.github.com/mikedn/d74dcbe732715d90bfe3e248865d8ca7

  • For CreateScale and CreateTranslation adding out parameters seems completely pointless, at least in this example code. Large structs are returned via hidden out parameters anyway.
  • CreateFromYawPitchRoll gets inlined, the code does make an unnecessary copy of the quaternion produced by Quaternion:CreateFromYawPitchRoll. But that's just 2 instructions, it's unlikely to be a significant problem.
  • Multiply is what's likely to cause most problems, 6 matrix copies are generated for those 3 calls. Though only 2 are really needed, copies of "last use" variables (m1, m2, m3, m4) should not be needed since it doesn't matter if Multiply changes them or not. Interestingly, I recently noticed that the JIT attempts to generate such copies. It seems that it doesn't always succeed.

@mikedn
Copy link

mikedn commented Dec 1, 2017

For CreateScale and CreateTranslation adding out parameters seems completely pointless, at least in this example code. Large structs are returned via hidden out parameters anyway.

Hmm, turns out that these contain a hidden copy - they create the matrix in a local variable and then copy it to the output buffer. It looks like that copy isn't necessary, it's probably only needed if an exception can be thrown and that's not the case here.

@stephentoub
Copy link
Member

@eerhardt, given the feedback, what's the plan here?

@eerhardt
Copy link
Member Author

eerhardt commented Dec 6, 2017

I think the plan is to close this PR and the related issue as 'won't fix'. And then log a coreclr bug for the Matrix4x4 performance issues illustrated above.

@CarolEidt and @tannergooding - does that sound like a good plan?

@CarolEidt
Copy link

I think the plan is to close this PR and the related issue as 'won't fix'. And then log a coreclr bug for the Matrix4x4 performance issues illustrated above.

I think that's a great plan.

@stephentoub stephentoub closed this Dec 6, 2017
@d3x0r
Copy link

d3x0r commented Dec 11, 2017

This PR uses 'in' . Does that (in) actually pass by REference?

https://docs.microsoft.com/en-us/dotnet/csharp/language-reference/keywords/ref Only shows 'ref' and 'out' as methods to pass by reference.

The benchmarks shown above only show a insiginifcant improvement in performance, if it was that slight, I wouldn't have put so much effort into making a case for making new methods that pass values by reference.

My main system is not up for now but I'm sure there's benchmarks that show more notable improvement, especially in the case of matrix 4x4.

@eerhardt
Copy link
Member Author

@d3x0r - the in keyword is a new C# 7.2 feature. Check out https://docs.microsoft.com/en-us/dotnet/csharp/whats-new/csharp-7-2 for more info. I think the link you provided needs to be updated for the new features.

there's benchmarks that show more notable improvement, especially in the case of matrix 4x4.

I've logged https://github.com/dotnet/coreclr/issues/15467 to improve the code gen for this scenario.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Alternative versions of methods for Vectors and Matrices
10 participants