New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize Buffer.MemoryCopy #9786

Merged
merged 1 commit into from Mar 4, 2017

Conversation

@vkvenkat
Contributor

vkvenkat commented Feb 25, 2017

This PR optimizes Buffer.MemoryCopy for lengths up to a threshold. Performance measurements were made on a Skylake Desktop - Intel(R) Core(TM)i7-6700K CPU @ 4.00 GHz with 32 GB RAM. We used Windows 10 Enterprise and Ubuntu 14.04. We used a micro-benchmark which exercises multiple buckets of pseudo-random copy sizes from 0 to 4 KB (details below).

Performance improvements with the optimization:

  • 2% to 73% for x64 Windows
  • -4% to 73% for x86 Windows
  • 0% to 70% for x64 Ubuntu

The below charts show the performance speedup:

image

image

The below table has absolute clocks in Ticks for million copy operations:

image

Here's the pseudo-code for the micro-benchmark:

// 10 Buckets of copy sizes within specified ranges: [0,8], [9,16], [17,32],..., [2049,4096]
for (int i = 0; i < MAX_BUCKETS; i++)
{  
    Random rnd = new Random(1);  
    for (int j = 0; j < COPYLEN; j++)
    {
            // Randomize copy length within bounds
            copyLen[j] = (ushort) rnd.Next(0, 9); // Range changes with bucket
    }

    begin = DateTime.Now.Ticks;

    for (int k = 0; k < 1000000; k++)
    {
        for (count= 0; count < COPYLEN; count++) 
        {						
            Buffer.MemoryCopy(pSource, pTarget, 8, copyLen[count]); //Dest. max size changes with bucket
        }	
    }

    time = DateTime.Now.Ticks - begin; // Time reported for bucket
} 
@vkvenkat

This comment has been minimized.

Show comment
Hide comment
@vkvenkat

vkvenkat Feb 25, 2017

Contributor

@jkotas PTAL

Contributor

vkvenkat commented Feb 25, 2017

@jkotas PTAL

private struct Block16 { }
[StructLayout(LayoutKind.Sequential, Size = 64)]
private struct Block64 { }

This comment has been minimized.

@jkotas

jkotas Feb 25, 2017

Member

Does it work well / will it work well on ARM64?

@jkotas

jkotas Feb 25, 2017

Member

Does it work well / will it work well on ARM64?

This comment has been minimized.

@vkvenkat

vkvenkat Feb 25, 2017

Contributor

Shall special-case the usage of custom structs to AMD64 @jamesqo

@vkvenkat

vkvenkat Feb 25, 2017

Contributor

Shall special-case the usage of custom structs to AMD64 @jamesqo

This comment has been minimized.

@mikedn

mikedn Feb 26, 2017

Contributor

x86 CoreCLR switched to RyuJIT so there shouldn't be any differences between X86 and X64 here, both should use Block64.

I don't know what ARM64 does but it would make sense for it to use the best available method for moving a 64 bytes block, that should be the same or better than moving 4 bytes at a time.

It may be better to handle this by adding some specific define at the top of the file so it's easy to change:

#if AMD64 || X86
#define HAS_BLOCK64
#endif
@mikedn

mikedn Feb 26, 2017

Contributor

x86 CoreCLR switched to RyuJIT so there shouldn't be any differences between X86 and X64 here, both should use Block64.

I don't know what ARM64 does but it would make sense for it to use the best available method for moving a 64 bytes block, that should be the same or better than moving 4 bytes at a time.

It may be better to handle this by adding some specific define at the top of the file so it's easy to change:

#if AMD64 || X86
#define HAS_BLOCK64
#endif

This comment has been minimized.

@jkotas

jkotas Feb 28, 2017

Member

BTW: Have you done the perf testing with crossgened corelib (corelib is always crossgened in shipping configs)? I do not think the 64-byte blocks will make difference for crossgened corelib because of it is limited to SSE2; it should not hurt though.

@jkotas

jkotas Feb 28, 2017

Member

BTW: Have you done the perf testing with crossgened corelib (corelib is always crossgened in shipping configs)? I do not think the 64-byte blocks will make difference for crossgened corelib because of it is limited to SSE2; it should not hurt though.

This comment has been minimized.

@vkvenkat

vkvenkat Mar 4, 2017

Contributor

@jkotas Yes, all of the results posted are with crossgened corelib

@vkvenkat

vkvenkat Mar 4, 2017

Contributor

@jkotas Yes, all of the results posted are with crossgened corelib

This comment has been minimized.

@jamesqo

jamesqo Feb 12, 2018

Contributor

@vkvenkat Great work optimizing this. I didn't have much time to look over the code, but one thing I noticed-- why are there Block16 / Block64 structs but no Block32?

@jamesqo

jamesqo Feb 12, 2018

Contributor

@vkvenkat Great work optimizing this. I didn't have much time to look over the code, but one thing I noticed-- why are there Block16 / Block64 structs but no Block32?

This comment has been minimized.

@jamesqo

jamesqo Feb 12, 2018

Contributor

Lol I just realized you made that comment a year ago. Still asking if the code is still there though

@jamesqo

jamesqo Feb 12, 2018

Contributor

Lol I just realized you made that comment a year ago. Still asking if the code is still there though

@jkotas

This comment has been minimized.

Show comment
Hide comment
@jkotas
Member

jkotas commented Feb 25, 2017

@jamesqo

Nice changes! I have my own PR to modify MemoryCopy that I haven't gotten around to finishing yet: #9143 If this is merged I will gladly close mine in favor of yours, however.

Show outdated Hide outdated src/mscorlib/src/System/Buffer.cs
Show outdated Hide outdated src/mscorlib/src/System/Buffer.cs
Show outdated Hide outdated src/mscorlib/src/System/Buffer.cs
Show outdated Hide outdated src/mscorlib/src/System/Buffer.cs
Show outdated Hide outdated src/mscorlib/src/System/Buffer.cs
Show outdated Hide outdated src/mscorlib/src/System/Buffer.cs
Show outdated Hide outdated src/mscorlib/src/System/Buffer.cs
*(int*)(dest + 48) = *(int*)(src + 48);
*(int*)(dest + 52) = *(int*)(src + 52);
*(int*)(dest + 56) = *(int*)(src + 56);
*(int*)(dest + 60) = *(int*)(src + 60);

This comment has been minimized.

@jamesqo

jamesqo Feb 25, 2017

Contributor

Is it beneficial to loop unroll this much? In my experiments last year I recall 64 being slightly worse than doing chunks of 48.

@jamesqo

jamesqo Feb 25, 2017

Contributor

Is it beneficial to loop unroll this much? In my experiments last year I recall 64 being slightly worse than doing chunks of 48.

This comment has been minimized.

@vkvenkat

vkvenkat Feb 25, 2017

Contributor

Yes, 64 performs better compared to 48. Seeing a regression with 48 for all 3 copy size buckets involved here.

@vkvenkat

vkvenkat Feb 25, 2017

Contributor

Yes, 64 performs better compared to 48. Seeing a regression with 48 for all 3 copy size buckets involved here.

Show outdated Hide outdated src/mscorlib/src/System/Buffer.cs
@jamesqo

This comment has been minimized.

Show comment
Hide comment
@jamesqo
Contributor

jamesqo commented Feb 25, 2017

@DrewScoggins

This comment has been minimized.

Show comment
Hide comment
@DrewScoggins

DrewScoggins Feb 26, 2017

Member

test Windows_NT_X64 perf
test Ubuntu14.04_x64 perf

Member

DrewScoggins commented Feb 26, 2017

test Windows_NT_X64 perf
test Ubuntu14.04_x64 perf

@DrewScoggins

This comment has been minimized.

Show comment
Hide comment
@DrewScoggins

DrewScoggins Feb 26, 2017

Member

test Ubuntu14.04 perf

Member

DrewScoggins commented Feb 26, 2017

test Ubuntu14.04 perf

@vkvenkat

This comment has been minimized.

Show comment
Hide comment
@vkvenkat

vkvenkat Feb 28, 2017

Contributor

@jkotas @jamesqo Addressed all of the CR suggestions and pushed my latest. PTAL.

Contributor

vkvenkat commented Feb 28, 2017

@jkotas @jamesqo Addressed all of the CR suggestions and pushed my latest. PTAL.

*(int*)(dest + 4) = *(int*)(src + 4);
*(int*)(dest + 8) = *(int*)(src + 8);
*(int*)(dest + 12) = *(int*)(src + 12);
*(int*)dest = *(int*)src;

This comment has been minimized.

@jkotas

jkotas Feb 28, 2017

Member

These should be done as long for BIT64

@jkotas

jkotas Feb 28, 2017

Member

These should be done as long for BIT64

*(int*)(dest + i + 4) = *(int*)(src + i + 4);
*(int*)(dest + i + 8) = *(int*)(src + i + 8);
*(int*)(dest + i + 12) = *(int*)(src + i + 12);
*(int*)dest = *(int*)src;

This comment has been minimized.

@jkotas

jkotas Feb 28, 2017

Member

use long for BIT64

@jkotas

jkotas Feb 28, 2017

Member

use long for BIT64

@vkvenkat vkvenkat changed the title from Extending optimized JIT helpers to Buffer.MemoryCopy to Optimize Buffer.MemoryCopy Mar 1, 2017

@jkotas

This comment has been minimized.

Show comment
Hide comment
@jkotas

jkotas Mar 2, 2017

Member

@dotnet-bot test Windows_NT x64 Release Priority 1 Build and Test please

Member

jkotas commented Mar 2, 2017

@dotnet-bot test Windows_NT x64 Release Priority 1 Build and Test please

@jkotas

This comment has been minimized.

Show comment
Hide comment
@jkotas

jkotas Mar 2, 2017

Member

@vkvenkat Several tests are consistently failing. Could you please take a look?

CoreMangLib._cti_system_text_stringbuilder_StringBuilderReplace4_StringBuilderReplace4_._cti_system_text_stringbuilder_StringBuilderReplace4_StringBuilderReplace4_cmd
CoreMangLib._cti_system_version_VersionToString2_VersionToString2_._cti_system_version_VersionToString2_VersionToString2_cmd
CoreMangLib._cti_system_version_VersionToString1_VersionToString1_._cti_system_version_VersionToString1_VersionToString1_cmd
Member

jkotas commented Mar 2, 2017

@vkvenkat Several tests are consistently failing. Could you please take a look?

CoreMangLib._cti_system_text_stringbuilder_StringBuilderReplace4_StringBuilderReplace4_._cti_system_text_stringbuilder_StringBuilderReplace4_StringBuilderReplace4_cmd
CoreMangLib._cti_system_version_VersionToString2_VersionToString2_._cti_system_version_VersionToString2_VersionToString2_cmd
CoreMangLib._cti_system_version_VersionToString1_VersionToString1_._cti_system_version_VersionToString1_VersionToString1_cmd
@vkvenkat

This comment has been minimized.

Show comment
Hide comment
@vkvenkat

vkvenkat Mar 4, 2017

Contributor

@jkotas Addressed latest CR feedback, fixed test failures & updated results. PTAL

Contributor

vkvenkat commented Mar 4, 2017

@jkotas Addressed latest CR feedback, fixed test failures & updated results. PTAL

@jkotas

This comment has been minimized.

Show comment
Hide comment
@jkotas

jkotas Mar 4, 2017

Member

Looks great - thank you!

Member

jkotas commented Mar 4, 2017

Looks great - thank you!

@jkotas jkotas merged commit c6372c5 into dotnet:master Mar 4, 2017

13 checks passed

CentOS7.1 x64 Debug Build and Test Build finished.
Details
FreeBSD x64 Checked Build Build finished.
Details
Linux ARM Emulator Cross Debug Build Build finished.
Details
Linux ARM Emulator Cross Release Build Build finished.
Details
OSX x64 Checked Build and Test Build finished.
Details
Ubuntu x64 Checked Build and Test Build finished.
Details
Ubuntu x64 Formatting Build finished.
Details
Windows_NT arm Cross Debug Build Build finished.
Details
Windows_NT arm Cross Release Build Build finished.
Details
Windows_NT x64 Debug Build and Test Build finished.
Details
Windows_NT x64 Formatting Build finished.
Details
Windows_NT x64 Release Priority 1 Build and Test Build finished.
Details
Windows_NT x86 Checked Build and Test Build finished.
Details

kouvel added a commit to kouvel/coreclr that referenced this pull request Mar 7, 2017

Improve span copy of pointers and structs containing pointers
Fixes #9161

PR #9786 fixes perf of span copy of types that don't contain references

kouvel added a commit to kouvel/coreclr that referenced this pull request Mar 8, 2017

Improve span copy of pointers and structs containing pointers
Fixes #9161

PR #9786 fixes perf of span copy of types that don't contain references

kouvel added a commit that referenced this pull request Mar 9, 2017

Improve span copy of pointers and structs containing pointers (#9999)
Improve span copy of pointers and structs containing pointers

Fixes #9161

PR #9786 fixes perf of span copy of types that don't contain references

YongseopKim added a commit to YongseopKim/coreclr that referenced this pull request Mar 10, 2017

jorive added a commit to guhuro/coreclr that referenced this pull request May 4, 2017

jorive added a commit to guhuro/coreclr that referenced this pull request May 4, 2017

Improve span copy of pointers and structs containing pointers (#9999)
Improve span copy of pointers and structs containing pointers

Fixes #9161

PR #9786 fixes perf of span copy of types that don't contain references

sergiy-k added a commit to sergiy-k/corert that referenced this pull request May 8, 2017

sergiy-k added a commit to sergiy-k/corert that referenced this pull request May 9, 2017

sergiy-k added a commit to dotnet/corert that referenced this pull request May 9, 2017

Port optimization in Buffer.Memmove from CoreCLR (#3551)
Port optimization in Buffer.Memmove from CoreCLR

This is just a copy of changes made in
dotnet/coreclr#9786

* Code review feedback

@karelz karelz modified the milestone: 2.0.0 Aug 28, 2017

@glenn-slayden

This comment has been minimized.

Show comment
Hide comment
@glenn-slayden

glenn-slayden Feb 11, 2018

I posted a comment for c6372c5 here, but I don't think that was the right place, I apologize for repeating the remarks here...


Just curious why all these improvements to Buffer.Memmove (plus more recent work) aren't deployed in the IL cpblk instruction instead? It appears that this commit c6372c5 removes a comment to the effect that cpblk is where all this effort naturally belongs (that is, in case the proliferation of conditional #if code here wasn't already somewhat suggestive), and I agree.

I know everyone wants there to be one central nexus for obtaining an optimal memmove, with proper alignment, chunking, tuning, and everything else done right; shouldn't cpblk be that place?

I also note that this commit removes all address alignment behavior. Now while it's true that the alignment code in ea9bee5 was a bit meek in only considering the destination address--at the possible expense of source alignment:

if (((int)dest & 3) != 0)
{
if (((int)dest & 1) != 0)
{
*(dest + i) = *(src + i);
i += 1;
if (((int)dest & 2) != 0)
goto IntAligned;
}
*(short*)(dest + i) = *(short*)(src + i);
i += 2;
}
IntAligned:
#if BIT64
// On 64-bit IntPtr.Size == 8, so we want to advance to the next 8-aligned address. If
// (int)dest % 8 is 0, 5, 6, or 7, we will already have advanced by 0, 3, 2, or 1
// bytes to the next aligned address (respectively), so do nothing. On the other hand,
// if it is 1, 2, 3, or 4 we will want to copy-and-advance another 4 bytes until
// we're aligned.
// The thing 1, 2, 3, and 4 have in common that the others don't is that if you
// subtract one from them, their 3rd lsb will not be set. Hence, the below check.
if ((((int)dest - 1) & 4) == 0)
{
*(int*)(dest + i) = *(int*)(src + i);
i += 4;
}
#endif // BIT64

...at least this was something, and a reasonably motivated approach since misaligned store penalties are typically higher than for read.

But in this commit there's neither. Did the experiments in #9786 empirically determine that physically aligning these operations is always a waste of time? For severe misalignment between source/target, it would be possible to write a split-access loop, albeit complex, that marshals quad-aligned reads to quad-aligned writes. Was this type of thing evaluated before taking the decision to abandon all efforts at alignment?

Finally, note that by insisting on alignment to the destination address exclusively, the alignment code (that was removed in this commit, see above) may have been ruining a naturally ideal source alignment, with the overall result of making things worse. This could have unfairly penalized test results for the existing code, when it would be a trivial matter to only do that adjustment when src/dest are jointly aligned to some appropriate level, e.g., wrap the above code with:

if ((((int)src ^ (int)dest) & 0x7) == 0)        // (or some other alignment value)
{
    // ...

As far as I can tell, there has never been code in buffer.cs that considers the source and destination physical alignment jointly, in order to then proceed sensibly on the basis of 3 broadly defined cases:

  • Input src/dst physical addresses already jointly aligned (or len less than max align size);
  • Joint alignment can be improved (to some degree) prior to starting the gang-operation (code removed in c6372c5)
  • Perverse src/dst alignment detected; proceed as-is, or, if worthwhile, use split-access loop to fully remediate.

glenn-slayden commented Feb 11, 2018

I posted a comment for c6372c5 here, but I don't think that was the right place, I apologize for repeating the remarks here...


Just curious why all these improvements to Buffer.Memmove (plus more recent work) aren't deployed in the IL cpblk instruction instead? It appears that this commit c6372c5 removes a comment to the effect that cpblk is where all this effort naturally belongs (that is, in case the proliferation of conditional #if code here wasn't already somewhat suggestive), and I agree.

I know everyone wants there to be one central nexus for obtaining an optimal memmove, with proper alignment, chunking, tuning, and everything else done right; shouldn't cpblk be that place?

I also note that this commit removes all address alignment behavior. Now while it's true that the alignment code in ea9bee5 was a bit meek in only considering the destination address--at the possible expense of source alignment:

if (((int)dest & 3) != 0)
{
if (((int)dest & 1) != 0)
{
*(dest + i) = *(src + i);
i += 1;
if (((int)dest & 2) != 0)
goto IntAligned;
}
*(short*)(dest + i) = *(short*)(src + i);
i += 2;
}
IntAligned:
#if BIT64
// On 64-bit IntPtr.Size == 8, so we want to advance to the next 8-aligned address. If
// (int)dest % 8 is 0, 5, 6, or 7, we will already have advanced by 0, 3, 2, or 1
// bytes to the next aligned address (respectively), so do nothing. On the other hand,
// if it is 1, 2, 3, or 4 we will want to copy-and-advance another 4 bytes until
// we're aligned.
// The thing 1, 2, 3, and 4 have in common that the others don't is that if you
// subtract one from them, their 3rd lsb will not be set. Hence, the below check.
if ((((int)dest - 1) & 4) == 0)
{
*(int*)(dest + i) = *(int*)(src + i);
i += 4;
}
#endif // BIT64

...at least this was something, and a reasonably motivated approach since misaligned store penalties are typically higher than for read.

But in this commit there's neither. Did the experiments in #9786 empirically determine that physically aligning these operations is always a waste of time? For severe misalignment between source/target, it would be possible to write a split-access loop, albeit complex, that marshals quad-aligned reads to quad-aligned writes. Was this type of thing evaluated before taking the decision to abandon all efforts at alignment?

Finally, note that by insisting on alignment to the destination address exclusively, the alignment code (that was removed in this commit, see above) may have been ruining a naturally ideal source alignment, with the overall result of making things worse. This could have unfairly penalized test results for the existing code, when it would be a trivial matter to only do that adjustment when src/dest are jointly aligned to some appropriate level, e.g., wrap the above code with:

if ((((int)src ^ (int)dest) & 0x7) == 0)        // (or some other alignment value)
{
    // ...

As far as I can tell, there has never been code in buffer.cs that considers the source and destination physical alignment jointly, in order to then proceed sensibly on the basis of 3 broadly defined cases:

  • Input src/dst physical addresses already jointly aligned (or len less than max align size);
  • Joint alignment can be improved (to some degree) prior to starting the gang-operation (code removed in c6372c5)
  • Perverse src/dst alignment detected; proceed as-is, or, if worthwhile, use split-access loop to fully remediate.
@glenn-slayden

This comment has been minimized.

Show comment
Hide comment
@glenn-slayden

glenn-slayden May 4, 2018

Commit c6372c5 makes a change to the detection of overlap between the source and destination in Buffer.Memmove:

if ((nuint)dest - (nuint)src < len) goto PInvoke;

...becomes...

if (((nuint)dest - (nuint)src < len) || ((nuint)src - (nuint)dest < len)) goto PInvoke;

I must admit I puzzled over the original code since the unsigned subtraction and compare "incorrectly" fails to identify source/destination overlap in the case where the destination (address) precedes the source (address) in memory. Indeed c6372c5 clarifies this and would also seem to fix a bug, but how did the original work anyway?

Then I realized of course that the answer is that the original code is stealthily incorporating a strong assumption about--and dependency upon--the copy direction of the managed Buffer.Memmove implementation which follows it.

The unfortunate upshot is that, while c6372c5 certainly makes the code clearer and more readable, it now excludes all overlap cases and thus no longer allows the performance benefit of fully-managed operation (i.e., non-P/Invoke) in the 50% of cases where the overlap happens to be compatible with the copy direction implemented in Buffer.cs. I blame the original code for not documenting its entirely non-obvious design and/or operation.

Anyway, can we revisit this code to restore the optimization that was lost here? Namely, once again allow the managed code path to prevail (still subject to overall size considerations, etc.) instead of P/Invoke for the half of the detected overlaps which comport with the copy direction that's hard-coded into Buffer.Memmove?

glenn-slayden commented May 4, 2018

Commit c6372c5 makes a change to the detection of overlap between the source and destination in Buffer.Memmove:

if ((nuint)dest - (nuint)src < len) goto PInvoke;

...becomes...

if (((nuint)dest - (nuint)src < len) || ((nuint)src - (nuint)dest < len)) goto PInvoke;

I must admit I puzzled over the original code since the unsigned subtraction and compare "incorrectly" fails to identify source/destination overlap in the case where the destination (address) precedes the source (address) in memory. Indeed c6372c5 clarifies this and would also seem to fix a bug, but how did the original work anyway?

Then I realized of course that the answer is that the original code is stealthily incorporating a strong assumption about--and dependency upon--the copy direction of the managed Buffer.Memmove implementation which follows it.

The unfortunate upshot is that, while c6372c5 certainly makes the code clearer and more readable, it now excludes all overlap cases and thus no longer allows the performance benefit of fully-managed operation (i.e., non-P/Invoke) in the 50% of cases where the overlap happens to be compatible with the copy direction implemented in Buffer.cs. I blame the original code for not documenting its entirely non-obvious design and/or operation.

Anyway, can we revisit this code to restore the optimization that was lost here? Namely, once again allow the managed code path to prevail (still subject to overall size considerations, etc.) instead of P/Invoke for the half of the detected overlaps which comport with the copy direction that's hard-coded into Buffer.Memmove?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment