Tail Call bug in RyuJIT - Incorrect parameters passed #1296

Closed
NickCraver opened this Issue Jul 27, 2015 · 63 comments

Projects

None yet
@NickCraver

The following critical bug is known and fixed internally (not released) by Microsoft (per private discussions while security was evaluated). In this bug "we" refers to Marc Gravell (@mgravell) and I who tracked down the issue on the Stack Overflow side on the fence. I am posting it here for public visibility and to help users hitting the same issue. I have also created a blog post addressing discovery and severity here: Why you should wait on upgrading to .Net 4.6.

There is an issue in the .Net 4.6 RTM RyuJIT implementation that incorrectly hooks up parameters during some tail call optimizations. The net result of this is that tail call methods can get the wrong parameters passed. Here's our code exhibiting the issue:

void Set<T>(string key, T val, int? durationSecs, bool sliding, bool broadcastRemoveFromCache = false)
{
    SetWithPriority<T>(key, val, durationSecs, sliding, CacheItemPriority.Default);
}

void SetWithPriority<T>(string key, T val, int? durationSecs, bool isSliding, CacheItemPriority priority)
{
    key = KeyInContext(key);

    RawSet(key, val, durationSecs, isSliding, priority);
}

void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding, CacheItemPriority priority)
{
    var absolute = !isSliding && durationSecs.HasValue 
                   ? DateTime.UtcNow.AddSeconds(durationSecs.Value) 
                   : Cache.NoAbsoluteExpiration;
    var sliding = isSliding && durationSecs.HasValue 
                  ? TimeSpan.FromSeconds(durationSecs.Value) 
                  : Cache.NoSlidingExpiration;

    HttpRuntime.Cache.Insert(cacheKey, val, null, absolute, sliding, priority, null);
}

The RawSet() is the last in the chain, and only when optimizations are enabled, is incorrectly optimized. If for example we call Set<T> with a durationSecs of 3600, we would expect that value to go all the way down the method chain. This doesn't happen. Instead, the value passed to the tail method (RawSet) is seemingly random (our assumption is it pulls some other value from the stack). For example, it may be 30, or 57, or null.

The net result for us is that items are getting cached for drastically shorter durations, or not at all (null often gets passed). On Stack Overflow this causes unpredictable local HTTP cache usage and more hits and heavier load to anything populating it. This is a production blocker on deployment for us, and (we believe) should be for anyone else.

We have collapsed the reproduction of this bug into a project you can run locally, available on GitHub. Here's an example test run (see the repo README for a full description):

Results Window

I can't stress how serious of a bug this is, due to the subtleness of the occurrence, the "only production" likelihood given the RELEASE-only nature, and the scary examples we can easily come up with. When the parameters you're passing aren't the ones the method is getting, all sanity goes out the window.

What if your method says how much stock to buy? What if it gives dosing information to a patient? What if it tells a plane what altitude to climb to?

This bug is critical, I am posting the issue for several reasons:

  • We're not jerks
  • To raise awareness
  • To prevent as many people as possible from deploying affected code
  • To usher a fix and deployment from Microsoft as soon as possible

If you have already deployed .Net 4.6, our recommendation at this point is to disable RyuJIT immediately.

@vcsjones
Contributor

I understand from various sources that given the normal patch workflow, the next hotfix window is September. For the appropriate people at Microsoft, please consider an out-of-band emergency fix for this. This issue is a show-stopper for us in moving to .NET 4.6.

@richlander
Member

Thanks @NickCraver for raising visibility.

I'll be talking with the team later today.

@vcsjones - For clarity, is this a show-stopper because you are concerned or because you have a repro w/your workload?

/cc @CarolEidt, @cmckinsey

@YuvalItzchakov

Has anyone been able to understand the root cause of this? Narrowing the problem to certain use-cases can help "calm" people down.

@NickCraver

@richlander not to be difficult, but given the nature of the bug...how would you (as a developer) know? This is such a subtle production-only bug for most cases I think most wouldn't find it, even in testing.

@YuvalItzchakov the issue here is it can affect any library as well, not just your code. Even if you had specific patterns, I'm not sure that would help any here. In our case it was a RELEASE library distributed by our internal NuGet server...so really like any other NuGet package.

@vcsjones
Contributor

@richlander @NickCraver pointed out that this was originally discovered while investigating an issue in MiniProfiler, which we actively use. We are re-opening a few incidents we had on an application to determine if this is the cause, or if it is related. At the time we did not have a mini-dump available to investigate, but that is the next thing we are working on.

@mgravell

@vcsjones the "mini profiler" thing here was just what was visible; the actual glitch manifested in a different library that was providing a custom storage provider for mini profilier. So: the issue didn't actually happen in mini-profiler, but at the same time: I can't say that mini-profiler isn't affected. Just: don't focus on the mention of mini-profiler.

@vcsjones
Contributor

@mgravell Agree. My point being, this was seen in a real library. It isn't just some mocked together code that uses tricks to make the issue manifest. It doesn't take special work to hit it. Thus it is very likely this affects other real-world applications and it has severe consequences.

@mgravell

Yes; absolutely; that's what scares us. Especially if Windows 10 deploys .NET 4.6... Wednesday could see random eager people installing, and potentially have their desktop business apps start behaving different in subtle, hard to spot, yet dangerous (whether to business well-being, or to personal safety) ways... Plus of course the VS2015 release prompting server-side installs.

@masonwheeler

How likely is this to affect current work in progress to get RyuJit working for the 32- and 64-bit versions of ARM? Is this bug AMD64 specific or platform agnostic?

@richlander
Member

The team is taking this very seriously. We're going to talk about it later today as folks get into the office.

Definitely appreciate all the attention on the issue.

@ChrisMcKee

@mgravell would the release have any effect if the apps are targeting < 4.6 though? Probably too late to slipstream RyuJIT as disabled considering the mode of distribution. 👍 Good find

@vcsjones
Contributor

@ChrisMcKee Yes this affects all applications that target 4.5.x (possibly earlier?) through 4.6. Just having RyuJIT installed is enough to introduce the problem. http://nickcraver.com/blog/2015/07/27/why-you-should-wait-on-dotnet-46/#comment-2159337569

@JamesThorpe

@vcsjones My understanding is that the same JIT compiler would be used in any 4.x environment, possibly even things targetting .NET 2, but running under the 4.x CLR?

@masonwheeler

Wait, this will glitch out on 4.5 as well?

@NickCraver

@masonwheeler Correct, if .Net 4.6 is installed the JIT appears to be used for both. Our repro case targets .Net 4.5 and .Net 4.5.2 in the CorruptionRepro and StackRedis projects respectively.

@ChrisMcKee

@NickCraver even bigger 'good catch'. I was sat whistling, ah well they'll fix 4.6. Nobody expects the regression fail.

@NotMyself

Thanks for the heads up @NickCraver, was going to push to get 4.6 on our build server this week. Going to hold back till a patch is issued now.

@latkin
Member
latkin commented Jul 27, 2015

Is this the same tailcall bug as raised here? Microsoft/visualfsharp#536

@AndreyAkinshin
Contributor

I have found some troubles with asm code.
Part 1.
screen1
The call of OnLogDuration from SetWithPriority:

mov         r8,5BFFFCC3E8h
mov         r8,qword ptr [r8]        ; "LocalCache.SetWithPriority"
mov         rcx,rsi                  ; key
mov         rdx,qword ptr [rsp+70h]  ; durationSecs
call        00007FFA7292EF80         ; OnLogDuration

We can see that durationSecs stores at the address [rsp+70h]. So far, everything looks fine.

Part 2.
screen3
The call of OnLogDuration from RawSet:

mov         r8,5BFFFCC3F0h  
mov         r8,qword ptr [r8]        ; "RawSet"
mov         rcx,rsi                  ; key
mov         rdx,qword ptr [rsp+0C8h] ; durationSecs
call        00007FFA7292EF80         ; OnLogDuration

We can see that RyuJIT tries to read durationSecs from [rsp+0C8h]. But:

RSP                  = 0x0000005787AAD6C0
durationSecs address = 0x0000005787AAD790
durationSecs offset  = 0x00000000000000D0

As we can see, the calculated durationSecs offset is 0C8h, but the actual offset is 0D0h. In this case, the TCO calculation error is 8 bytes.

@mgravell

@latkin I don't think so; my reasoning: the bug you link to threw a fatal execution exception. It was broken in a way that was detectable. Contrast: silently substituting parameter values.

@mgravell

@AndreyAkinshin nice bit of demonstration, thanks

@mmitche
Member
mmitche commented Jul 27, 2015

@latkin I'll try and confirm whether it is the same or not.

@vbouret
vbouret commented Jul 27, 2015

Which parameter is causing the trouble? The "object val" or the "int? durationSecs"?

The int? needs some boxing/unboxing above the real 32 bit memory integer. What is the size of the boxing object (with the hasValue)?

@AndreyAkinshin
Contributor

@vbouret, the parameter is int? durationSecs. It stores on the stack in unboxed form, the size is 8 byte.

@mmitche
Member
mmitche commented Jul 27, 2015

@masonwheeler You should only see this on the clr running in 64-bit mode.

@masonwheeler

@mmitche That's good to know. I know that if you want to load a 32-bit native DLL into the process, it will force everything into 32-bit mode, but what if you're calling an out-of-process COM server that's implemented in a 32-bit DLL? (Yes, this question is based on a very specific real-world use case.)

@mmitche
Member
mmitche commented Jul 27, 2015

@masonwheeler I don't believe that would force it into 32-bit mode. However, IIRC, targeting your app to x86 instead of AnyCPU should cause it to run on the 32-bit CLR.

@masonwheeler

@mgravell

I don't think so; my reasoning: the bug you link to threw a fatal execution exception. It was broken in a way that was detectable.

Are you sure? It's not difficult to imagine any number of different ways that data corruption could lead to an exception being raised. For example, if one of the affected arguments was a delegate, you could easily see something like the scenario described in that bug report.

Not saying it is, or even probably is the same bug; just that it looks to me like it could be.

@mgravell

@masonwheeler fair enough; that's a good point

On Mon, 27 Jul 2015 18:34 masonwheeler notifications@github.com wrote:

@mgravell https://github.com/mgravell

I don't think so; my reasoning: the bug you link to threw a fatal
execution exception. It was broken in a way that was detectable.

Are you sure? It's not difficult to imagine any number of different ways
that data corruption could lead to an exception being raised. For example,
if one of the affected arguments was a delegate, you could easily see
something like the scenario described in that bug report.

Not saying it is, or even probably is the same bug; just that it looks
to me like it could be.


Reply to this email directly or view it on GitHub
#1296 (comment).

@mmitche mmitche was assigned by dotnet-bot Jul 27, 2015
@brandonagr

Not sure if this is the same bug, but we also found a scenario where ryujit produces code that returns incorrect results that deals with nullable booleans. See sample code at: http://stackoverflow.com/questions/31570521/ryujit-producing-incorrect-results

@NickCraver

@mmitche Awesome to see a PR in today, thanks! Are there any thoughts from the team on a release schedule yet?

@mmitche
Member
mmitche commented Jul 27, 2015

@brandonagr We are taking a look at your repro. /cc @CarolEidt

@CarolEidt
Member

@brandonagr I believe we have a fix for this. It is a bug in copy propagation.

@schellap
Contributor

@brandonagr, I've updated the other issue here:
#1299

@angelsl
angelsl commented Jul 28, 2015

A bit late, but I made a very short repro based on the one given above (that doesn't need IIS).

// Compile with
// csc /o /debug- /platform:x64 TailCallBug.cs

using System;

namespace TailCallBug {
    internal class Program {
        private static void Main(string[] args)
        {
            var p = new CacheThing();
            for(int i = 0; i < 5; ++i)
                p.SetWithPriority<string>(null, null, null, true);
        }
    }
    internal class CacheThing {
        private string KeyInContext(string key) {
            return key;
        }

        public void SetWithPriority<U>(string key, U val, int? durationSecs, bool isSliding) {
            key = KeyInContext(key);

            RawSet(key, val, durationSecs, isSliding);
        }

        private void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding) {
            Console.Write("{0}", durationSecs);
        }
    }
}

The expected output is nothing, but (for me) it outputs 3276332763327633276332763.

@AndreyAkinshin
Contributor

@angelsl, great job! I have reduced the size of your repro:

class Program
{
    static void Main()
    {
        new Program().SetWithPriority<string>(null, null, null, true);
    }

    void SetWithPriority<U>(string key, U val, int? durationSecs, bool isSliding)
    {
        key = key;
        RawSet(key, val, durationSecs, isSliding);
    }

    void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding)
    {
        Console.WriteLine("{0}", durationSecs); // 32763 for RyuJIT
    }
}
@masonwheeler

It was already plenty small. Are we playing Bug Golf? :P

@AndreyAkinshin
Contributor

@masonwheeler, in such issues, it is very important to produce a minimal reproducible example, because every excess instruction complicates the analysis of the bug.

@mgravell

Nice; we did try a console repro, but it didn't show initially, so we stuck
with IIS to make sure we had something concrete to show. The fact that it
can be repro'd so minimally is fun.

On 28 July 2015 at 13:11, Andrey Akinshin notifications@github.com wrote:

@masonwheeler https://github.com/masonwheeler, in such issues, it is
very important to produce a minimal reproducible example, because every
excess instruction complicates the analysis of the bug.


Reply to this email directly or view it on GitHub
#1296 (comment).

Regards,

Marc

@trayburn

For those who have been deep into the internals for this bug, can anyone confirm that the random memory address that is being read to get the "incorrect" value is 100% Absolutely Without A Doubt coming from within the memory space of the .NET Process? If not, as horrible as this is already, this becomes much worse as an angle of attack to read and/or write random memory locations on boxes running .NET Framework 4.6.

Not being limited to the memory space of the process means this could introduce instability into systems running 4.6, and break the process isolation of most shared hosting environments.

@ayende
ayende commented Jul 28, 2015

There is no way for the process to be just randomly reading some other
process memory.
You need ReadMemory() calls to do that, and those don't just happen, even
with this issue.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Tue, Jul 28, 2015 at 4:00 PM, Tim Rayburn notifications@github.com
wrote:

For those who have been deep into the internals for this bug, can anyone
confirm that the random memory address that is being read to get the
"incorrect" value is 100% Absolutely Without A Doubt coming from within
the memory space of the .NET Process? If not, as horrible as this is
already, this becomes much worse as an angle of attack to read and/or write
random memory locations on boxes running .NET Framework 4.6.

Not being limited to the memory space of the process means this could
introduce instability into systems running 4.6, and break the process
isolation of most shared hosting environments.


Reply to this email directly or view it on GitHub
#1296 (comment).

@JamesThorpe

@trayburn Nick addressed this point on his blog:

The exact line we got from Microsoft (when discussing security) was: "The only data that could be erroneously passed would be locals from the current frame". We can't confirm this without seeing the fix applied (not in the public repo yet), but we have no reason to believe otherwise. That being said, we couldn't relate which locals we were seeing as values here (they may not be your locals, for example).

@Sebazzz
Sebazzz commented Jul 28, 2015

Not being limited to the memory space of the process means this could introduce instability into systems running 4.6, and break the process isolation of most shared hosting environments.

It will not break process isolation as Ayende said, because reading from another process like that is not allowed in Windows. However, applications in the same application pool can affect each other since an app domain in an application pool cannot provide isolation for this, I think.

@trayburn

@JamesThorpe I agree I've seen the comment where he passed that along, but given the delving done by @AndreyAkinshin above, his last line:

As we can see, the calculated durationSecs offset is 0C8h, but the actual offset is 0D0h. In this case, the TCO calculation error is 8 bytes.

We see durationSecs being read, via offset, and being undershot from the correct value (0C8h < 0D0h) but if that could ever be an overshoot instead things get much worse, hence I think it worth asking.

Also given the utterly minimal repo we now have from @angelsl and @AndreyAkinshin I'm challenged to think where the value 32763 could possibly be a local, unless we're reading the middle of a pointer, which if we can assign means we could break memory boundaries by simply returning to the original object/pointer.

@YuvalItzchakov

@jamesmanning You're talking about tail-recursion, which is a special case of tail-call.

Wikipedia:

Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is not needed any more, and it can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls).

@masonwheeler

@jamesmanning: Tail recursion is a specific case of tail-call optimization where the tail call is to the same function. It's not the only use for tail-call optimization, though.

@jamesmanning

@YuvalItzchakov thanks - I had never done TCO outside of the context of recursive calls, but now it makes sense. 😄

Sorry for the noise!

@AndreyAkinshin
Contributor

@mgravell, @angelsl, @trayburn, I have new interesting results: the bug allows us to get the return address. Let's discuss it in detail. My new code:

class Program
{
    static void Main()
    {
        new Program().SetWithPriority<string>(null, null, int.MaxValue, true);
    }

    void SetWithPriority<U>(string key, U val, int? durationSecs, bool isSliding)
    {
        key = key;
        RawSet(key, val, durationSecs, isSliding);
    }

    void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding)
    {
        Hack hack = new Hack();
        hack.NullableInt = durationSecs;
        Console.WriteLine("{0:X}", hack.Long);
    }

    [StructLayout(LayoutKind.Explicit)]
    struct Hack
    {
        [FieldOffset(0)]
        public int? NullableInt;
        [FieldOffset(0)]
        public long Long;
    }
}

Expected value is 7FFFFFFF00000001, but with RyuJIT I have 7FFBAC784303 (you can have another value). We need to go deeper.

Part 1

screen1
Position: before the call of SetWithPriority. The address of the next line after the call line is 7FFBAC784303 (return address).

Part 2

screen2
Position: the last line of SetWithPriority, jmp instruction. On the picture you can see the correct value of durationSecs: 7FFFFFFF00000001. Ok, let's press F11.

Part 3

screen3
Position: the first line of RawSet. As we can see, the jmp instruction puts the return address (7FFBAC784303) before the value of durationSecs. As I wrote earlier, the durationSecs offset error equals to 8 bytes. Thus, RyuJIT uses the return address instead of durationSecs. My Hack structure helps me to get the value of the return address and print it.

@brandonagr

@mmitche Am I correct in thinking that the fix for #1299 is a different issue than the fix here? Just adding in the TailCallOpt registry key does not seem to fix 1299 for me locally, but it did fix the regression example you committed earlier for this issue when I ran both locally in the same project.

@Eyas
Eyas commented Jul 29, 2015

Yup @schellap indicates its a bug in copy propagation-- a completely different compiler optimization. I don't think there's a public fix for that yet.

@forki
forki commented Jul 29, 2015

Very cool. Can we expect similar blog posts for #1299 and Microsoft/visualfsharp#536?

@mmitche mmitche closed this in #1298 Jul 29, 2015
@RichardD2

Disabling RyuJIT caused our 64-bit ASP.NET 4.5 AppPool to die!

An unhandled exception occurred and the process was terminated.

Exception: System.Threading.ThreadAbortException
at System.Web.HttpRuntime.ProcessRequestNotificationPrivate(IIS7WorkerRequest wr, HttpContext context)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)

Followed by:

Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2
Faulting module name: KERNELBASE.dll, version: 6.1.7601.18847, time stamp: 0x554d7df1
Exception code: 0xe0434352
Fault offset: 0x000000000000b3dd
Faulting process id: 0xe58
Faulting application start time: 0x01d0ca0eb6823e3e
Faulting application path: c:\windows\system32\inetsrv\w3wp.exe
Faulting module path: C:\Windows\system32\KERNELBASE.dll

And:

Problem signature:
P1: w3wp.exe
P2: 7.5.7601.17514
P3: 4ce7afa2
P4: KERNELBASE.dll
P5: 6.1.7601.18847
P6: 554d7df1
P7: e0434352
P8: 000000000000b3dd
P9: 
P10: 

And finally:

An unhandled win32 exception occurred in w3wp.exe [3672]. Just-In-Time debugging this exception failed with the following error: Debugger could not be started because no user is logged on.

Check the documentation index for 'Just-in-time debugging, errors' for more information.

After that error was logged five times, the AppPool was disabled, and the site returned a 503 error.

After removing the "EnableLegacyJIT" value and restarting the AppPool, the site seems to be back to normal.

This was on Windows Server 2008 R2 SP1 with all updates installed.

@mmitche
Member
mmitche commented Jul 29, 2015

@RichardD2 This is interesting. Can you file a bug on Connect: http://connect.microsoft.com/VisualStudio

@mmitche
Member
mmitche commented Jul 29, 2015

@RichardD2 Thanks!

@Rutix
Rutix commented Jul 29, 2015

@mmitche Now that this is merged when can we expect the hotfix/patch to be rolled out?

@mmitche
Member
mmitche commented Jul 29, 2015

@Rutix. We are still working that out. Watch the blog post for details :) http://blogs.msdn.com/b/dotnet/archive/2015/07/28/ryujit-bug-advisory-in-the-net-framework-4-6.aspx

@clivetong

@mmitche I put in a repro for what looks like this 64bit RyuJIT bug some time ago - https://connect.microsoft.com/VisualStudio/feedback/details/1372514
Is it possible to check that this is fixed by this change?

@forki
forki commented Jul 31, 2015

@clivetong your link is 404 for me

@clivetong

@forki Thanks for the notification. I originally submitted it as private, but then changed it to public several weeks ago. I can't find any other options about visibility and other issues I have submitted as public are visible to others, so I guess it's game over.
connect

@forki
forki commented Jul 31, 2015

so you say this bug was known before release?!

@leecow
Member
leecow commented Jul 31, 2015

@clivetong - The issue you reported through Connect is the same as what we're talking about in this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment