Tail Call bug in RyuJIT - Incorrect parameters passed #4391

NickCraver · 2015-07-27T13:42:21Z

The following critical bug is known and fixed internally (not released) by Microsoft (per private discussions while security was evaluated). In this bug "we" refers to Marc Gravell (@mgravell) and I who tracked down the issue on the Stack Overflow side on the fence. I am posting it here for public visibility and to help users hitting the same issue. I have also created a blog post addressing discovery and severity here: Why you should wait on upgrading to .Net 4.6.

There is an issue in the .Net 4.6 RTM RyuJIT implementation that incorrectly hooks up parameters during some tail call optimizations. The net result of this is that tail call methods can get the wrong parameters passed. Here's our code exhibiting the issue:

void Set<T>(string key, T val, int? durationSecs, bool sliding, bool broadcastRemoveFromCache = false)
{
    SetWithPriority<T>(key, val, durationSecs, sliding, CacheItemPriority.Default);
}

void SetWithPriority<T>(string key, T val, int? durationSecs, bool isSliding, CacheItemPriority priority)
{
    key = KeyInContext(key);

    RawSet(key, val, durationSecs, isSliding, priority);
}

void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding, CacheItemPriority priority)
{
    var absolute = !isSliding && durationSecs.HasValue 
                   ? DateTime.UtcNow.AddSeconds(durationSecs.Value) 
                   : Cache.NoAbsoluteExpiration;
    var sliding = isSliding && durationSecs.HasValue 
                  ? TimeSpan.FromSeconds(durationSecs.Value) 
                  : Cache.NoSlidingExpiration;

    HttpRuntime.Cache.Insert(cacheKey, val, null, absolute, sliding, priority, null);
}

The RawSet() is the last in the chain, and only when optimizations are enabled, is incorrectly optimized. If for example we call Set<T> with a durationSecs of 3600, we would expect that value to go all the way down the method chain. This doesn't happen. Instead, the value passed to the tail method (RawSet) is seemingly random (our assumption is it pulls some other value from the stack). For example, it may be 30, or 57, or null.

The net result for us is that items are getting cached for drastically shorter durations, or not at all (null often gets passed). On Stack Overflow this causes unpredictable local HTTP cache usage and more hits and heavier load to anything populating it. This is a production blocker on deployment for us, and (we believe) should be for anyone else.

We have collapsed the reproduction of this bug into a project you can run locally, available on GitHub. Here's an example test run (see the repo README for a full description):

I can't stress how serious of a bug this is, due to the subtleness of the occurrence, the "only production" likelihood given the RELEASE-only nature, and the scary examples we can easily come up with. When the parameters you're passing aren't the ones the method is getting, all sanity goes out the window.

What if your method says how much stock to buy? What if it gives dosing information to a patient? What if it tells a plane what altitude to climb to?

This bug is critical, I am posting the issue for several reasons:

We're not jerks
To raise awareness
To prevent as many people as possible from deploying affected code
To usher a fix and deployment from Microsoft as soon as possible

If you have already deployed .Net 4.6, our recommendation at this point is to disable RyuJIT immediately.

The text was updated successfully, but these errors were encountered:

vcsjones · 2015-07-27T14:17:30Z

I understand from various sources that given the normal patch workflow, the next hotfix window is September. For the appropriate people at Microsoft, please consider an out-of-band emergency fix for this. This issue is a show-stopper for us in moving to .NET 4.6.

richlander · 2015-07-27T14:21:09Z

Thanks @NickCraver for raising visibility.

I'll be talking with the team later today.

@vcsjones - For clarity, is this a show-stopper because you are concerned or because you have a repro w/your workload?

/cc @CarolEidt, @cmckinsey

YuvalItzchakov · 2015-07-27T14:22:29Z

Has anyone been able to understand the root cause of this? Narrowing the problem to certain use-cases can help "calm" people down.

NickCraver · 2015-07-27T14:26:17Z

@richlander not to be difficult, but given the nature of the bug...how would you (as a developer) know? This is such a subtle production-only bug for most cases I think most wouldn't find it, even in testing.

@YuvalItzchakov the issue here is it can affect any library as well, not just your code. Even if you had specific patterns, I'm not sure that would help any here. In our case it was a RELEASE library distributed by our internal NuGet server...so really like any other NuGet package.

vcsjones · 2015-07-27T14:29:59Z

@richlander @NickCraver pointed out that this was originally discovered while investigating an issue in MiniProfiler, which we actively use. We are re-opening a few incidents we had on an application to determine if this is the cause, or if it is related. At the time we did not have a mini-dump available to investigate, but that is the next thing we are working on.

mgravell · 2015-07-27T14:31:43Z

@vcsjones the "mini profiler" thing here was just what was visible; the actual glitch manifested in a different library that was providing a custom storage provider for mini profilier. So: the issue didn't actually happen in mini-profiler, but at the same time: I can't say that mini-profiler isn't affected. Just: don't focus on the mention of mini-profiler.

vcsjones · 2015-07-27T14:35:23Z

@mgravell Agree. My point being, this was seen in a real library. It isn't just some mocked together code that uses tricks to make the issue manifest. It doesn't take special work to hit it. Thus it is very likely this affects other real-world applications and it has severe consequences.

mgravell · 2015-07-27T14:37:07Z

Yes; absolutely; that's what scares us. Especially if Windows 10 deploys .NET 4.6... Wednesday could see random eager people installing, and potentially have their desktop business apps start behaving different in subtle, hard to spot, yet dangerous (whether to business well-being, or to personal safety) ways... Plus of course the VS2015 release prompting server-side installs.

masonwheeler · 2015-07-27T14:37:35Z

How likely is this to affect current work in progress to get RyuJit working for the 32- and 64-bit versions of ARM? Is this bug AMD64 specific or platform agnostic?

richlander · 2015-07-27T14:46:09Z

The team is taking this very seriously. We're going to talk about it later today as folks get into the office.

Definitely appreciate all the attention on the issue.

ChrisMcKee · 2015-07-27T15:01:40Z

@mgravell would the release have any effect if the apps are targeting < 4.6 though? Probably too late to slipstream RyuJIT as disabled considering the mode of distribution. 👍 Good find

vcsjones · 2015-07-27T15:03:50Z

@ChrisMcKee Yes this affects all applications that target 4.5.x (possibly earlier?) through 4.6. Just having RyuJIT installed is enough to introduce the problem. http://nickcraver.com/blog/2015/07/27/why-you-should-wait-on-dotnet-46/#comment-2159337569

JamesThorpe · 2015-07-27T15:07:18Z

@vcsjones My understanding is that the same JIT compiler would be used in any 4.x environment, possibly even things targetting .NET 2, but running under the 4.x CLR?

masonwheeler · 2015-07-27T15:10:34Z

Wait, this will glitch out on 4.5 as well?

NickCraver · 2015-07-27T15:12:24Z

@masonwheeler Correct, if .Net 4.6 is installed the JIT appears to be used for both. Our repro case targets .Net 4.5 and .Net 4.5.2 in the CorruptionRepro and StackRedis projects respectively.

ChrisMcKee · 2015-07-27T15:45:55Z

@NickCraver even bigger 'good catch'. I was sat whistling, ah well they'll fix 4.6. Nobody expects the regression fail.

NotMyself · 2015-07-27T15:58:25Z

Thanks for the heads up @NickCraver, was going to push to get 4.6 on our build server this week. Going to hold back till a patch is issued now.

latkin · 2015-07-27T16:04:29Z

Is this the same tailcall bug as raised here? dotnet/fsharp#536

AndreyAkinshin · 2015-07-27T16:06:31Z

I have found some troubles with asm code.
Part 1.

The call of OnLogDuration from SetWithPriority:

mov         r8,5BFFFCC3E8h
mov         r8,qword ptr [r8]        ; "LocalCache.SetWithPriority"
mov         rcx,rsi                  ; key
mov         rdx,qword ptr [rsp+70h]  ; durationSecs
call        00007FFA7292EF80         ; OnLogDuration

We can see that durationSecs stores at the address [rsp+70h]. So far, everything looks fine.

Part 2.

The call of OnLogDuration from RawSet:

mov         r8,5BFFFCC3F0h  
mov         r8,qword ptr [r8]        ; "RawSet"
mov         rcx,rsi                  ; key
mov         rdx,qword ptr [rsp+0C8h] ; durationSecs
call        00007FFA7292EF80         ; OnLogDuration

We can see that RyuJIT tries to read durationSecs from [rsp+0C8h]. But:

RSP                  = 0x0000005787AAD6C0
durationSecs address = 0x0000005787AAD790
durationSecs offset  = 0x00000000000000D0

As we can see, the calculated durationSecs offset is 0C8h, but the actual offset is 0D0h. In this case, the TCO calculation error is 8 bytes.

mgravell · 2015-07-27T16:08:15Z

@latkin I don't think so; my reasoning: the bug you link to threw a fatal execution exception. It was broken in a way that was detectable. Contrast: silently substituting parameter values.

mgravell · 2015-07-27T16:09:35Z

@AndreyAkinshin nice bit of demonstration, thanks

mmitche · 2015-07-27T16:23:48Z

@latkin I'll try and confirm whether it is the same or not.

viboux · 2015-07-27T16:24:21Z

Which parameter is causing the trouble? The "object val" or the "int? durationSecs"?

The int? needs some boxing/unboxing above the real 32 bit memory integer. What is the size of the boxing object (with the hasValue)?

AndreyAkinshin · 2015-07-27T16:26:53Z

@vbouret, the parameter is int? durationSecs. It stores on the stack in unboxed form, the size is 8 byte.

mmitche · 2015-07-27T16:27:26Z

@masonwheeler You should only see this on the clr running in 64-bit mode.

masonwheeler · 2015-07-27T17:14:05Z

@mmitche That's good to know. I know that if you want to load a 32-bit native DLL into the process, it will force everything into 32-bit mode, but what if you're calling an out-of-process COM server that's implemented in a 32-bit DLL? (Yes, this question is based on a very specific real-world use case.)

mmitche · 2015-07-27T17:24:53Z

@masonwheeler I don't believe that would force it into 32-bit mode. However, IIRC, targeting your app to x86 instead of AnyCPU should cause it to run on the 32-bit CLR.

masonwheeler · 2015-07-27T17:33:43Z

@mgravell

I don't think so; my reasoning: the bug you link to threw a fatal execution exception. It was broken in a way that was detectable.

Are you sure? It's not difficult to imagine any number of different ways that data corruption could lead to an exception being raised. For example, if one of the affected arguments was a delegate, you could easily see something like the scenario described in that bug report.

Not saying it is, or even probably is the same bug; just that it looks to me like it could be.

mgravell · 2015-07-27T18:02:19Z

@masonwheeler fair enough; that's a good point

On Mon, 27 Jul 2015 18:34 masonwheeler notifications@github.com wrote:

@mgravell https://github.com/mgravell

I don't think so; my reasoning: the bug you link to threw a fatal
execution exception. It was broken in a way that was detectable.

Are you sure? It's not difficult to imagine any number of different ways
that data corruption could lead to an exception being raised. For example,
if one of the affected arguments was a delegate, you could easily see
something like the scenario described in that bug report.

Not saying it is, or even probably is the same bug; just that it looks
to me like it could be.

—
Reply to this email directly or view it on GitHub
https://github.com/dotnet/coreclr/issues/1296#issuecomment-125281869.

masonwheeler · 2015-07-28T11:40:41Z

It was already plenty small. Are we playing Bug Golf? :P

AndreyAkinshin · 2015-07-28T12:11:14Z

@masonwheeler, in such issues, it is very important to produce a minimal reproducible example, because every excess instruction complicates the analysis of the bug.

mgravell · 2015-07-28T12:31:49Z

Nice; we did try a console repro, but it didn't show initially, so we stuck
with IIS to make sure we had something concrete to show. The fact that it
can be repro'd so minimally is fun.

On 28 July 2015 at 13:11, Andrey Akinshin notifications@github.com wrote:

@masonwheeler https://github.com/masonwheeler, in such issues, it is
very important to produce a minimal reproducible example, because every
excess instruction complicates the analysis of the bug.

—
Reply to this email directly or view it on GitHub
https://github.com/dotnet/coreclr/issues/1296#issuecomment-125578663.

Regards,

Marc

trayburn · 2015-07-28T13:00:56Z

For those who have been deep into the internals for this bug, can anyone confirm that the random memory address that is being read to get the "incorrect" value is 100% Absolutely Without A Doubt coming from within the memory space of the .NET Process? If not, as horrible as this is already, this becomes much worse as an angle of attack to read and/or write random memory locations on boxes running .NET Framework 4.6.

Not being limited to the memory space of the process means this could introduce instability into systems running 4.6, and break the process isolation of most shared hosting environments.

ayende · 2015-07-28T13:03:43Z

There is no way for the process to be just randomly reading some other
process memory.
You need ReadMemory() calls to do that, and those don't just happen, even
with this issue.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Tue, Jul 28, 2015 at 4:00 PM, Tim Rayburn notifications@github.com
wrote:

For those who have been deep into the internals for this bug, can anyone
confirm that the random memory address that is being read to get the
"incorrect" value is 100% Absolutely Without A Doubt coming from within
the memory space of the .NET Process? If not, as horrible as this is
already, this becomes much worse as an angle of attack to read and/or write
random memory locations on boxes running .NET Framework 4.6.

Not being limited to the memory space of the process means this could
introduce instability into systems running 4.6, and break the process
isolation of most shared hosting environments.

—
Reply to this email directly or view it on GitHub
https://github.com/dotnet/coreclr/issues/1296#issuecomment-125597778.

JamesThorpe · 2015-07-28T13:05:49Z

@trayburn Nick addressed this point on his blog:

The exact line we got from Microsoft (when discussing security) was: "The only data that could be erroneously passed would be locals from the current frame". We can't confirm this without seeing the fix applied (not in the public repo yet), but we have no reason to believe otherwise. That being said, we couldn't relate which locals we were seeing as values here (they may not be your locals, for example).

Sebazzz · 2015-07-28T13:24:26Z

Not being limited to the memory space of the process means this could introduce instability into systems running 4.6, and break the process isolation of most shared hosting environments.

It will not break process isolation as Ayende said, because reading from another process like that is not allowed in Windows. However, applications in the same application pool can affect each other since an app domain in an application pool cannot provide isolation for this, I think.

trayburn · 2015-07-28T13:25:48Z

@JamesThorpe I agree I've seen the comment where he passed that along, but given the delving done by @AndreyAkinshin above, his last line:

As we can see, the calculated durationSecs offset is 0C8h, but the actual offset is 0D0h. In this case, the TCO calculation error is 8 bytes.

We see durationSecs being read, via offset, and being undershot from the correct value (0C8h < 0D0h) but if that could ever be an overshoot instead things get much worse, hence I think it worth asking.

Also given the utterly minimal repo we now have from @angelsl and @AndreyAkinshin I'm challenged to think where the value 32763 could possibly be a local, unless we're reading the middle of a pointer, which if we can assign means we could break memory boundaries by simply returning to the original object/pointer.

YuvalItzchakov · 2015-07-28T14:40:22Z

@jamesmanning You're talking about tail-recursion, which is a special case of tail-call.

Wikipedia:

Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is not needed any more, and it can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls).

masonwheeler · 2015-07-28T14:40:30Z

@jamesmanning: Tail recursion is a specific case of tail-call optimization where the tail call is to the same function. It's not the only use for tail-call optimization, though.

jamesmanning · 2015-07-28T14:46:05Z

@YuvalItzchakov thanks - I had never done TCO outside of the context of recursive calls, but now it makes sense. 😄

Sorry for the noise!

AndreyAkinshin · 2015-07-28T17:37:06Z

@mgravell, @angelsl, @trayburn, I have new interesting results: the bug allows us to get the return address. Let's discuss it in detail. My new code:

class Program
{
    static void Main()
    {
        new Program().SetWithPriority<string>(null, null, int.MaxValue, true);
    }

    void SetWithPriority<U>(string key, U val, int? durationSecs, bool isSliding)
    {
        key = key;
        RawSet(key, val, durationSecs, isSliding);
    }

    void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding)
    {
        Hack hack = new Hack();
        hack.NullableInt = durationSecs;
        Console.WriteLine("{0:X}", hack.Long);
    }

    [StructLayout(LayoutKind.Explicit)]
    struct Hack
    {
        [FieldOffset(0)]
        public int? NullableInt;
        [FieldOffset(0)]
        public long Long;
    }
}

Expected value is 7FFFFFFF00000001, but with RyuJIT I have 7FFBAC784303 (you can have another value). We need to go deeper.

Part 1

Position: before the call of SetWithPriority. The address of the next line after the call line is 7FFBAC784303 (return address).

Part 2

Position: the last line of SetWithPriority, jmp instruction. On the picture you can see the correct value of durationSecs: 7FFFFFFF00000001. Ok, let's press F11.

Part 3

Position: the first line of RawSet. As we can see, the jmp instruction puts the return address (7FFBAC784303) before the value of durationSecs. As I wrote earlier, the durationSecs offset error equals to 8 bytes. Thus, RyuJIT uses the return address instead of durationSecs. My Hack structure helps me to get the value of the return address and print it.

mmitche · 2015-07-28T23:29:05Z

Hey all. @richlander has posted about this on the .NET blog: http://blogs.msdn.com/b/dotnet/archive/2015/07/28/ryujit-bug-advisory-in-the-net-framework-4-6.aspx

brandonagr · 2015-07-29T00:13:01Z

@mmitche Am I correct in thinking that the fix for dotnet/coreclr#1299 is a different issue than the fix here? Just adding in the TailCallOpt registry key does not seem to fix 1299 for me locally, but it did fix the regression example you committed earlier for this issue when I ran both locally in the same project.

Eyas · 2015-07-29T00:31:29Z

Yup @schellap indicates its a bug in copy propagation-- a completely different compiler optimization. I don't think there's a public fix for that yet.

forki · 2015-07-29T06:25:45Z

Very cool. Can we expect similar blog posts for dotnet/coreclr#1299 and dotnet/fsharp#536?

RichardD2 · 2015-07-29T15:30:57Z

Disabling RyuJIT caused our 64-bit ASP.NET 4.5 AppPool to die!

An unhandled exception occurred and the process was terminated.

Exception: System.Threading.ThreadAbortException
at System.Web.HttpRuntime.ProcessRequestNotificationPrivate(IIS7WorkerRequest wr, HttpContext context)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)

Followed by:

Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2
Faulting module name: KERNELBASE.dll, version: 6.1.7601.18847, time stamp: 0x554d7df1
Exception code: 0xe0434352
Fault offset: 0x000000000000b3dd
Faulting process id: 0xe58
Faulting application start time: 0x01d0ca0eb6823e3e
Faulting application path: c:\windows\system32\inetsrv\w3wp.exe
Faulting module path: C:\Windows\system32\KERNELBASE.dll

And:

Problem signature:
P1: w3wp.exe
P2: 7.5.7601.17514
P3: 4ce7afa2
P4: KERNELBASE.dll
P5: 6.1.7601.18847
P6: 554d7df1
P7: e0434352
P8: 000000000000b3dd
P9: 
P10:

And finally:

An unhandled win32 exception occurred in w3wp.exe [3672]. Just-In-Time debugging this exception failed with the following error: Debugger could not be started because no user is logged on.

Check the documentation index for 'Just-in-time debugging, errors' for more information.

After that error was logged five times, the AppPool was disabled, and the site returned a 503 error.

After removing the "EnableLegacyJIT" value and restarting the AppPool, the site seems to be back to normal.

This was on Windows Server 2008 R2 SP1 with all updates installed.

mmitche · 2015-07-29T15:47:25Z

@RichardD2 This is interesting. Can you file a bug on Connect: http://connect.microsoft.com/VisualStudio

RichardD2 · 2015-07-29T15:56:25Z

@mmitche Done: https://connect.microsoft.com/VisualStudio/feedback/details/1605438/disabling-ryujit-breaks-asp-net

mmitche · 2015-07-29T16:04:01Z

@RichardD2 Thanks!

Rutix · 2015-07-29T21:13:19Z

@mmitche Now that this is merged when can we expect the hotfix/patch to be rolled out?

mmitche · 2015-07-29T21:31:46Z

@Rutix. We are still working that out. Watch the blog post for details :) http://blogs.msdn.com/b/dotnet/archive/2015/07/28/ryujit-bug-advisory-in-the-net-framework-4-6.aspx

clivetong · 2015-07-31T14:57:56Z

@mmitche I put in a repro for what looks like this 64bit RyuJIT bug some time ago - https://connect.microsoft.com/VisualStudio/feedback/details/1372514
Is it possible to check that this is fixed by this change?

forki · 2015-07-31T15:07:29Z

@clivetong your link is 404 for me

clivetong · 2015-07-31T15:32:17Z

@forki Thanks for the notification. I originally submitted it as private, but then changed it to public several weeks ago. I can't find any other options about visibility and other issues I have submitted as public are visible to others, so I guess it's game over.

forki · 2015-07-31T15:39:11Z

so you say this bug was known before release?!

leecow · 2015-07-31T17:17:10Z

@clivetong - The issue you reported through Connect is the same as what we're talking about in this thread.

dotnet-bot assigned mmitche Jul 27, 2015

mmitche closed this as completed in dotnet/coreclr#1298 Jul 29, 2015

msftgits transferred this issue from dotnet/coreclr Jan 30, 2020

ghost locked as resolved and limited conversation to collaborators Jan 5, 2021

Tail Call bug in RyuJIT - Incorrect parameters passed #4391

Tail Call bug in RyuJIT - Incorrect parameters passed #4391

Comments

NickCraver commented Jul 27, 2015

vcsjones commented Jul 27, 2015

richlander commented Jul 27, 2015

YuvalItzchakov commented Jul 27, 2015

NickCraver commented Jul 27, 2015

vcsjones commented Jul 27, 2015

mgravell commented Jul 27, 2015

vcsjones commented Jul 27, 2015

mgravell commented Jul 27, 2015

masonwheeler commented Jul 27, 2015

richlander commented Jul 27, 2015

ChrisMcKee commented Jul 27, 2015

vcsjones commented Jul 27, 2015

JamesThorpe commented Jul 27, 2015

masonwheeler commented Jul 27, 2015

NickCraver commented Jul 27, 2015

ChrisMcKee commented Jul 27, 2015

NotMyself commented Jul 27, 2015

latkin commented Jul 27, 2015

AndreyAkinshin commented Jul 27, 2015

mgravell commented Jul 27, 2015

mgravell commented Jul 27, 2015

mmitche commented Jul 27, 2015

viboux commented Jul 27, 2015

AndreyAkinshin commented Jul 27, 2015

mmitche commented Jul 27, 2015

masonwheeler commented Jul 27, 2015

mmitche commented Jul 27, 2015

masonwheeler commented Jul 27, 2015

mgravell commented Jul 27, 2015

masonwheeler commented Jul 28, 2015

AndreyAkinshin commented Jul 28, 2015

mgravell commented Jul 28, 2015

trayburn commented Jul 28, 2015

ayende commented Jul 28, 2015

JamesThorpe commented Jul 28, 2015

Sebazzz commented Jul 28, 2015

trayburn commented Jul 28, 2015

YuvalItzchakov commented Jul 28, 2015

masonwheeler commented Jul 28, 2015

jamesmanning commented Jul 28, 2015

AndreyAkinshin commented Jul 28, 2015

Part 1

Part 2

Part 3

mmitche commented Jul 28, 2015

brandonagr commented Jul 29, 2015

Eyas commented Jul 29, 2015

forki commented Jul 29, 2015

RichardD2 commented Jul 29, 2015

mmitche commented Jul 29, 2015

RichardD2 commented Jul 29, 2015

mmitche commented Jul 29, 2015

Rutix commented Jul 29, 2015

mmitche commented Jul 29, 2015

clivetong commented Jul 31, 2015

forki commented Jul 31, 2015

clivetong commented Jul 31, 2015

forki commented Jul 31, 2015

leecow commented Jul 31, 2015