New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tail Call bug in RyuJIT - Incorrect parameters passed #1296

Closed
NickCraver opened this Issue Jul 27, 2015 · 63 comments

Comments

Projects
None yet
@NickCraver
Member

NickCraver commented Jul 27, 2015

The following critical bug is known and fixed internally (not released) by Microsoft (per private discussions while security was evaluated). In this bug "we" refers to Marc Gravell (@mgravell) and I who tracked down the issue on the Stack Overflow side on the fence. I am posting it here for public visibility and to help users hitting the same issue. I have also created a blog post addressing discovery and severity here: Why you should wait on upgrading to .Net 4.6.

There is an issue in the .Net 4.6 RTM RyuJIT implementation that incorrectly hooks up parameters during some tail call optimizations. The net result of this is that tail call methods can get the wrong parameters passed. Here's our code exhibiting the issue:

void Set<T>(string key, T val, int? durationSecs, bool sliding, bool broadcastRemoveFromCache = false)
{
    SetWithPriority<T>(key, val, durationSecs, sliding, CacheItemPriority.Default);
}

void SetWithPriority<T>(string key, T val, int? durationSecs, bool isSliding, CacheItemPriority priority)
{
    key = KeyInContext(key);

    RawSet(key, val, durationSecs, isSliding, priority);
}

void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding, CacheItemPriority priority)
{
    var absolute = !isSliding && durationSecs.HasValue 
                   ? DateTime.UtcNow.AddSeconds(durationSecs.Value) 
                   : Cache.NoAbsoluteExpiration;
    var sliding = isSliding && durationSecs.HasValue 
                  ? TimeSpan.FromSeconds(durationSecs.Value) 
                  : Cache.NoSlidingExpiration;

    HttpRuntime.Cache.Insert(cacheKey, val, null, absolute, sliding, priority, null);
}

The RawSet() is the last in the chain, and only when optimizations are enabled, is incorrectly optimized. If for example we call Set<T> with a durationSecs of 3600, we would expect that value to go all the way down the method chain. This doesn't happen. Instead, the value passed to the tail method (RawSet) is seemingly random (our assumption is it pulls some other value from the stack). For example, it may be 30, or 57, or null.

The net result for us is that items are getting cached for drastically shorter durations, or not at all (null often gets passed). On Stack Overflow this causes unpredictable local HTTP cache usage and more hits and heavier load to anything populating it. This is a production blocker on deployment for us, and (we believe) should be for anyone else.

We have collapsed the reproduction of this bug into a project you can run locally, available on GitHub. Here's an example test run (see the repo README for a full description):

Results Window

I can't stress how serious of a bug this is, due to the subtleness of the occurrence, the "only production" likelihood given the RELEASE-only nature, and the scary examples we can easily come up with. When the parameters you're passing aren't the ones the method is getting, all sanity goes out the window.

What if your method says how much stock to buy? What if it gives dosing information to a patient? What if it tells a plane what altitude to climb to?

This bug is critical, I am posting the issue for several reasons:

  • We're not jerks
  • To raise awareness
  • To prevent as many people as possible from deploying affected code
  • To usher a fix and deployment from Microsoft as soon as possible

If you have already deployed .Net 4.6, our recommendation at this point is to disable RyuJIT immediately.

@vcsjones

This comment has been minimized.

Show comment
Hide comment
@vcsjones

vcsjones Jul 27, 2015

Contributor

I understand from various sources that given the normal patch workflow, the next hotfix window is September. For the appropriate people at Microsoft, please consider an out-of-band emergency fix for this. This issue is a show-stopper for us in moving to .NET 4.6.

Contributor

vcsjones commented Jul 27, 2015

I understand from various sources that given the normal patch workflow, the next hotfix window is September. For the appropriate people at Microsoft, please consider an out-of-band emergency fix for this. This issue is a show-stopper for us in moving to .NET 4.6.

@richlander

This comment has been minimized.

Show comment
Hide comment
@richlander

richlander Jul 27, 2015

Member

Thanks @NickCraver for raising visibility.

I'll be talking with the team later today.

@vcsjones - For clarity, is this a show-stopper because you are concerned or because you have a repro w/your workload?

/cc @CarolEidt, @cmckinsey

Member

richlander commented Jul 27, 2015

Thanks @NickCraver for raising visibility.

I'll be talking with the team later today.

@vcsjones - For clarity, is this a show-stopper because you are concerned or because you have a repro w/your workload?

/cc @CarolEidt, @cmckinsey

@YuvalItzchakov

This comment has been minimized.

Show comment
Hide comment
@YuvalItzchakov

YuvalItzchakov Jul 27, 2015

Has anyone been able to understand the root cause of this? Narrowing the problem to certain use-cases can help "calm" people down.

YuvalItzchakov commented Jul 27, 2015

Has anyone been able to understand the root cause of this? Narrowing the problem to certain use-cases can help "calm" people down.

@NickCraver

This comment has been minimized.

Show comment
Hide comment
@NickCraver

NickCraver Jul 27, 2015

Member

@richlander not to be difficult, but given the nature of the bug...how would you (as a developer) know? This is such a subtle production-only bug for most cases I think most wouldn't find it, even in testing.

@YuvalItzchakov the issue here is it can affect any library as well, not just your code. Even if you had specific patterns, I'm not sure that would help any here. In our case it was a RELEASE library distributed by our internal NuGet server...so really like any other NuGet package.

Member

NickCraver commented Jul 27, 2015

@richlander not to be difficult, but given the nature of the bug...how would you (as a developer) know? This is such a subtle production-only bug for most cases I think most wouldn't find it, even in testing.

@YuvalItzchakov the issue here is it can affect any library as well, not just your code. Even if you had specific patterns, I'm not sure that would help any here. In our case it was a RELEASE library distributed by our internal NuGet server...so really like any other NuGet package.

@vcsjones

This comment has been minimized.

Show comment
Hide comment
@vcsjones

vcsjones Jul 27, 2015

Contributor

@richlander @NickCraver pointed out that this was originally discovered while investigating an issue in MiniProfiler, which we actively use. We are re-opening a few incidents we had on an application to determine if this is the cause, or if it is related. At the time we did not have a mini-dump available to investigate, but that is the next thing we are working on.

Contributor

vcsjones commented Jul 27, 2015

@richlander @NickCraver pointed out that this was originally discovered while investigating an issue in MiniProfiler, which we actively use. We are re-opening a few incidents we had on an application to determine if this is the cause, or if it is related. At the time we did not have a mini-dump available to investigate, but that is the next thing we are working on.

@mgravell

This comment has been minimized.

Show comment
Hide comment
@mgravell

mgravell Jul 27, 2015

@vcsjones the "mini profiler" thing here was just what was visible; the actual glitch manifested in a different library that was providing a custom storage provider for mini profilier. So: the issue didn't actually happen in mini-profiler, but at the same time: I can't say that mini-profiler isn't affected. Just: don't focus on the mention of mini-profiler.

mgravell commented Jul 27, 2015

@vcsjones the "mini profiler" thing here was just what was visible; the actual glitch manifested in a different library that was providing a custom storage provider for mini profilier. So: the issue didn't actually happen in mini-profiler, but at the same time: I can't say that mini-profiler isn't affected. Just: don't focus on the mention of mini-profiler.

@vcsjones

This comment has been minimized.

Show comment
Hide comment
@vcsjones

vcsjones Jul 27, 2015

Contributor

@mgravell Agree. My point being, this was seen in a real library. It isn't just some mocked together code that uses tricks to make the issue manifest. It doesn't take special work to hit it. Thus it is very likely this affects other real-world applications and it has severe consequences.

Contributor

vcsjones commented Jul 27, 2015

@mgravell Agree. My point being, this was seen in a real library. It isn't just some mocked together code that uses tricks to make the issue manifest. It doesn't take special work to hit it. Thus it is very likely this affects other real-world applications and it has severe consequences.

@mgravell

This comment has been minimized.

Show comment
Hide comment
@mgravell

mgravell Jul 27, 2015

Yes; absolutely; that's what scares us. Especially if Windows 10 deploys .NET 4.6... Wednesday could see random eager people installing, and potentially have their desktop business apps start behaving different in subtle, hard to spot, yet dangerous (whether to business well-being, or to personal safety) ways... Plus of course the VS2015 release prompting server-side installs.

mgravell commented Jul 27, 2015

Yes; absolutely; that's what scares us. Especially if Windows 10 deploys .NET 4.6... Wednesday could see random eager people installing, and potentially have their desktop business apps start behaving different in subtle, hard to spot, yet dangerous (whether to business well-being, or to personal safety) ways... Plus of course the VS2015 release prompting server-side installs.

@masonwheeler

This comment has been minimized.

Show comment
Hide comment
@masonwheeler

masonwheeler Jul 27, 2015

How likely is this to affect current work in progress to get RyuJit working for the 32- and 64-bit versions of ARM? Is this bug AMD64 specific or platform agnostic?

masonwheeler commented Jul 27, 2015

How likely is this to affect current work in progress to get RyuJit working for the 32- and 64-bit versions of ARM? Is this bug AMD64 specific or platform agnostic?

@richlander

This comment has been minimized.

Show comment
Hide comment
@richlander

richlander Jul 27, 2015

Member

The team is taking this very seriously. We're going to talk about it later today as folks get into the office.

Definitely appreciate all the attention on the issue.

Member

richlander commented Jul 27, 2015

The team is taking this very seriously. We're going to talk about it later today as folks get into the office.

Definitely appreciate all the attention on the issue.

@ChrisMcKee

This comment has been minimized.

Show comment
Hide comment
@ChrisMcKee

ChrisMcKee Jul 27, 2015

@mgravell would the release have any effect if the apps are targeting < 4.6 though? Probably too late to slipstream RyuJIT as disabled considering the mode of distribution. 👍 Good find

ChrisMcKee commented Jul 27, 2015

@mgravell would the release have any effect if the apps are targeting < 4.6 though? Probably too late to slipstream RyuJIT as disabled considering the mode of distribution. 👍 Good find

@vcsjones

This comment has been minimized.

Show comment
Hide comment
@vcsjones

vcsjones Jul 27, 2015

Contributor

@ChrisMcKee Yes this affects all applications that target 4.5.x (possibly earlier?) through 4.6. Just having RyuJIT installed is enough to introduce the problem. http://nickcraver.com/blog/2015/07/27/why-you-should-wait-on-dotnet-46/#comment-2159337569

Contributor

vcsjones commented Jul 27, 2015

@ChrisMcKee Yes this affects all applications that target 4.5.x (possibly earlier?) through 4.6. Just having RyuJIT installed is enough to introduce the problem. http://nickcraver.com/blog/2015/07/27/why-you-should-wait-on-dotnet-46/#comment-2159337569

@JamesThorpe

This comment has been minimized.

Show comment
Hide comment
@JamesThorpe

JamesThorpe Jul 27, 2015

@vcsjones My understanding is that the same JIT compiler would be used in any 4.x environment, possibly even things targetting .NET 2, but running under the 4.x CLR?

JamesThorpe commented Jul 27, 2015

@vcsjones My understanding is that the same JIT compiler would be used in any 4.x environment, possibly even things targetting .NET 2, but running under the 4.x CLR?

@masonwheeler

This comment has been minimized.

Show comment
Hide comment
@masonwheeler

masonwheeler Jul 27, 2015

Wait, this will glitch out on 4.5 as well?

masonwheeler commented Jul 27, 2015

Wait, this will glitch out on 4.5 as well?

@NickCraver

This comment has been minimized.

Show comment
Hide comment
@NickCraver

NickCraver Jul 27, 2015

Member

@masonwheeler Correct, if .Net 4.6 is installed the JIT appears to be used for both. Our repro case targets .Net 4.5 and .Net 4.5.2 in the CorruptionRepro and StackRedis projects respectively.

Member

NickCraver commented Jul 27, 2015

@masonwheeler Correct, if .Net 4.6 is installed the JIT appears to be used for both. Our repro case targets .Net 4.5 and .Net 4.5.2 in the CorruptionRepro and StackRedis projects respectively.

@ChrisMcKee

This comment has been minimized.

Show comment
Hide comment
@ChrisMcKee

ChrisMcKee Jul 27, 2015

@NickCraver even bigger 'good catch'. I was sat whistling, ah well they'll fix 4.6. Nobody expects the regression fail.

ChrisMcKee commented Jul 27, 2015

@NickCraver even bigger 'good catch'. I was sat whistling, ah well they'll fix 4.6. Nobody expects the regression fail.

@NotMyself

This comment has been minimized.

Show comment
Hide comment
@NotMyself

NotMyself Jul 27, 2015

Thanks for the heads up @NickCraver, was going to push to get 4.6 on our build server this week. Going to hold back till a patch is issued now.

NotMyself commented Jul 27, 2015

Thanks for the heads up @NickCraver, was going to push to get 4.6 on our build server this week. Going to hold back till a patch is issued now.

@latkin

This comment has been minimized.

Show comment
Hide comment
@latkin

latkin Jul 27, 2015

Is this the same tailcall bug as raised here? Microsoft/visualfsharp#536

latkin commented Jul 27, 2015

Is this the same tailcall bug as raised here? Microsoft/visualfsharp#536

@AndreyAkinshin

This comment has been minimized.

Show comment
Hide comment
@AndreyAkinshin

AndreyAkinshin Jul 27, 2015

Member

I have found some troubles with asm code.
Part 1.
screen1
The call of OnLogDuration from SetWithPriority:

mov         r8,5BFFFCC3E8h
mov         r8,qword ptr [r8]        ; "LocalCache.SetWithPriority"
mov         rcx,rsi                  ; key
mov         rdx,qword ptr [rsp+70h]  ; durationSecs
call        00007FFA7292EF80         ; OnLogDuration

We can see that durationSecs stores at the address [rsp+70h]. So far, everything looks fine.

Part 2.
screen3
The call of OnLogDuration from RawSet:

mov         r8,5BFFFCC3F0h  
mov         r8,qword ptr [r8]        ; "RawSet"
mov         rcx,rsi                  ; key
mov         rdx,qword ptr [rsp+0C8h] ; durationSecs
call        00007FFA7292EF80         ; OnLogDuration

We can see that RyuJIT tries to read durationSecs from [rsp+0C8h]. But:

RSP                  = 0x0000005787AAD6C0
durationSecs address = 0x0000005787AAD790
durationSecs offset  = 0x00000000000000D0

As we can see, the calculated durationSecs offset is 0C8h, but the actual offset is 0D0h. In this case, the TCO calculation error is 8 bytes.

Member

AndreyAkinshin commented Jul 27, 2015

I have found some troubles with asm code.
Part 1.
screen1
The call of OnLogDuration from SetWithPriority:

mov         r8,5BFFFCC3E8h
mov         r8,qword ptr [r8]        ; "LocalCache.SetWithPriority"
mov         rcx,rsi                  ; key
mov         rdx,qword ptr [rsp+70h]  ; durationSecs
call        00007FFA7292EF80         ; OnLogDuration

We can see that durationSecs stores at the address [rsp+70h]. So far, everything looks fine.

Part 2.
screen3
The call of OnLogDuration from RawSet:

mov         r8,5BFFFCC3F0h  
mov         r8,qword ptr [r8]        ; "RawSet"
mov         rcx,rsi                  ; key
mov         rdx,qword ptr [rsp+0C8h] ; durationSecs
call        00007FFA7292EF80         ; OnLogDuration

We can see that RyuJIT tries to read durationSecs from [rsp+0C8h]. But:

RSP                  = 0x0000005787AAD6C0
durationSecs address = 0x0000005787AAD790
durationSecs offset  = 0x00000000000000D0

As we can see, the calculated durationSecs offset is 0C8h, but the actual offset is 0D0h. In this case, the TCO calculation error is 8 bytes.

@mgravell

This comment has been minimized.

Show comment
Hide comment
@mgravell

mgravell Jul 27, 2015

@latkin I don't think so; my reasoning: the bug you link to threw a fatal execution exception. It was broken in a way that was detectable. Contrast: silently substituting parameter values.

mgravell commented Jul 27, 2015

@latkin I don't think so; my reasoning: the bug you link to threw a fatal execution exception. It was broken in a way that was detectable. Contrast: silently substituting parameter values.

@mgravell

This comment has been minimized.

Show comment
Hide comment
@mgravell

mgravell Jul 27, 2015

@AndreyAkinshin nice bit of demonstration, thanks

mgravell commented Jul 27, 2015

@AndreyAkinshin nice bit of demonstration, thanks

@mmitche

This comment has been minimized.

Show comment
Hide comment
@mmitche

mmitche Jul 27, 2015

Member

@latkin I'll try and confirm whether it is the same or not.

Member

mmitche commented Jul 27, 2015

@latkin I'll try and confirm whether it is the same or not.

@vbouret

This comment has been minimized.

Show comment
Hide comment
@vbouret

vbouret Jul 27, 2015

Which parameter is causing the trouble? The "object val" or the "int? durationSecs"?

The int? needs some boxing/unboxing above the real 32 bit memory integer. What is the size of the boxing object (with the hasValue)?

vbouret commented Jul 27, 2015

Which parameter is causing the trouble? The "object val" or the "int? durationSecs"?

The int? needs some boxing/unboxing above the real 32 bit memory integer. What is the size of the boxing object (with the hasValue)?

@AndreyAkinshin

This comment has been minimized.

Show comment
Hide comment
@AndreyAkinshin

AndreyAkinshin Jul 27, 2015

Member

@vbouret, the parameter is int? durationSecs. It stores on the stack in unboxed form, the size is 8 byte.

Member

AndreyAkinshin commented Jul 27, 2015

@vbouret, the parameter is int? durationSecs. It stores on the stack in unboxed form, the size is 8 byte.

@mmitche

This comment has been minimized.

Show comment
Hide comment
@mmitche

mmitche Jul 27, 2015

Member

@masonwheeler You should only see this on the clr running in 64-bit mode.

Member

mmitche commented Jul 27, 2015

@masonwheeler You should only see this on the clr running in 64-bit mode.

@masonwheeler

This comment has been minimized.

Show comment
Hide comment
@masonwheeler

masonwheeler Jul 27, 2015

@mmitche That's good to know. I know that if you want to load a 32-bit native DLL into the process, it will force everything into 32-bit mode, but what if you're calling an out-of-process COM server that's implemented in a 32-bit DLL? (Yes, this question is based on a very specific real-world use case.)

masonwheeler commented Jul 27, 2015

@mmitche That's good to know. I know that if you want to load a 32-bit native DLL into the process, it will force everything into 32-bit mode, but what if you're calling an out-of-process COM server that's implemented in a 32-bit DLL? (Yes, this question is based on a very specific real-world use case.)

@mmitche

This comment has been minimized.

Show comment
Hide comment
@mmitche

mmitche Jul 27, 2015

Member

@masonwheeler I don't believe that would force it into 32-bit mode. However, IIRC, targeting your app to x86 instead of AnyCPU should cause it to run on the 32-bit CLR.

Member

mmitche commented Jul 27, 2015

@masonwheeler I don't believe that would force it into 32-bit mode. However, IIRC, targeting your app to x86 instead of AnyCPU should cause it to run on the 32-bit CLR.

@masonwheeler

This comment has been minimized.

Show comment
Hide comment
@masonwheeler

masonwheeler Jul 27, 2015

@mgravell

I don't think so; my reasoning: the bug you link to threw a fatal execution exception. It was broken in a way that was detectable.

Are you sure? It's not difficult to imagine any number of different ways that data corruption could lead to an exception being raised. For example, if one of the affected arguments was a delegate, you could easily see something like the scenario described in that bug report.

Not saying it is, or even probably is the same bug; just that it looks to me like it could be.

masonwheeler commented Jul 27, 2015

@mgravell

I don't think so; my reasoning: the bug you link to threw a fatal execution exception. It was broken in a way that was detectable.

Are you sure? It's not difficult to imagine any number of different ways that data corruption could lead to an exception being raised. For example, if one of the affected arguments was a delegate, you could easily see something like the scenario described in that bug report.

Not saying it is, or even probably is the same bug; just that it looks to me like it could be.

@mgravell

This comment has been minimized.

Show comment
Hide comment
@mgravell

mgravell Jul 27, 2015

@masonwheeler fair enough; that's a good point

On Mon, 27 Jul 2015 18:34 masonwheeler notifications@github.com wrote:

@mgravell https://github.com/mgravell

I don't think so; my reasoning: the bug you link to threw a fatal
execution exception. It was broken in a way that was detectable.

Are you sure? It's not difficult to imagine any number of different ways
that data corruption could lead to an exception being raised. For example,
if one of the affected arguments was a delegate, you could easily see
something like the scenario described in that bug report.

Not saying it is, or even probably is the same bug; just that it looks
to me like it could be.


Reply to this email directly or view it on GitHub
#1296 (comment).

mgravell commented Jul 27, 2015

@masonwheeler fair enough; that's a good point

On Mon, 27 Jul 2015 18:34 masonwheeler notifications@github.com wrote:

@mgravell https://github.com/mgravell

I don't think so; my reasoning: the bug you link to threw a fatal
execution exception. It was broken in a way that was detectable.

Are you sure? It's not difficult to imagine any number of different ways
that data corruption could lead to an exception being raised. For example,
if one of the affected arguments was a delegate, you could easily see
something like the scenario described in that bug report.

Not saying it is, or even probably is the same bug; just that it looks
to me like it could be.


Reply to this email directly or view it on GitHub
#1296 (comment).

@brandonagr

This comment has been minimized.

Show comment
Hide comment
@brandonagr

brandonagr Jul 27, 2015

Not sure if this is the same bug, but we also found a scenario where ryujit produces code that returns incorrect results that deals with nullable booleans. See sample code at: http://stackoverflow.com/questions/31570521/ryujit-producing-incorrect-results

brandonagr commented Jul 27, 2015

Not sure if this is the same bug, but we also found a scenario where ryujit produces code that returns incorrect results that deals with nullable booleans. See sample code at: http://stackoverflow.com/questions/31570521/ryujit-producing-incorrect-results

@NickCraver

This comment has been minimized.

Show comment
Hide comment
@NickCraver

NickCraver Jul 27, 2015

Member

@mmitche Awesome to see a PR in today, thanks! Are there any thoughts from the team on a release schedule yet?

Member

NickCraver commented Jul 27, 2015

@mmitche Awesome to see a PR in today, thanks! Are there any thoughts from the team on a release schedule yet?

@mmitche

This comment has been minimized.

Show comment
Hide comment
@mmitche

mmitche Jul 27, 2015

Member

@brandonagr We are taking a look at your repro. /cc @CarolEidt

Member

mmitche commented Jul 27, 2015

@brandonagr We are taking a look at your repro. /cc @CarolEidt

@CarolEidt

This comment has been minimized.

Show comment
Hide comment
@CarolEidt

CarolEidt Jul 27, 2015

Member

@brandonagr I believe we have a fix for this. It is a bug in copy propagation.

Member

CarolEidt commented Jul 27, 2015

@brandonagr I believe we have a fix for this. It is a bug in copy propagation.

@schellap

This comment has been minimized.

Show comment
Hide comment
@schellap

schellap Jul 27, 2015

@brandonagr, I've updated the other issue here:
#1299

schellap commented Jul 27, 2015

@brandonagr, I've updated the other issue here:
#1299

@angelsl

This comment has been minimized.

Show comment
Hide comment
@angelsl

angelsl Jul 28, 2015

A bit late, but I made a very short repro based on the one given above (that doesn't need IIS).

// Compile with
// csc /o /debug- /platform:x64 TailCallBug.cs

using System;

namespace TailCallBug {
    internal class Program {
        private static void Main(string[] args)
        {
            var p = new CacheThing();
            for(int i = 0; i < 5; ++i)
                p.SetWithPriority<string>(null, null, null, true);
        }
    }
    internal class CacheThing {
        private string KeyInContext(string key) {
            return key;
        }

        public void SetWithPriority<U>(string key, U val, int? durationSecs, bool isSliding) {
            key = KeyInContext(key);

            RawSet(key, val, durationSecs, isSliding);
        }

        private void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding) {
            Console.Write("{0}", durationSecs);
        }
    }
}

The expected output is nothing, but (for me) it outputs 3276332763327633276332763.

angelsl commented Jul 28, 2015

A bit late, but I made a very short repro based on the one given above (that doesn't need IIS).

// Compile with
// csc /o /debug- /platform:x64 TailCallBug.cs

using System;

namespace TailCallBug {
    internal class Program {
        private static void Main(string[] args)
        {
            var p = new CacheThing();
            for(int i = 0; i < 5; ++i)
                p.SetWithPriority<string>(null, null, null, true);
        }
    }
    internal class CacheThing {
        private string KeyInContext(string key) {
            return key;
        }

        public void SetWithPriority<U>(string key, U val, int? durationSecs, bool isSliding) {
            key = KeyInContext(key);

            RawSet(key, val, durationSecs, isSliding);
        }

        private void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding) {
            Console.Write("{0}", durationSecs);
        }
    }
}

The expected output is nothing, but (for me) it outputs 3276332763327633276332763.

@AndreyAkinshin

This comment has been minimized.

Show comment
Hide comment
@AndreyAkinshin

AndreyAkinshin Jul 28, 2015

Member

@angelsl, great job! I have reduced the size of your repro:

class Program
{
    static void Main()
    {
        new Program().SetWithPriority<string>(null, null, null, true);
    }

    void SetWithPriority<U>(string key, U val, int? durationSecs, bool isSliding)
    {
        key = key;
        RawSet(key, val, durationSecs, isSliding);
    }

    void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding)
    {
        Console.WriteLine("{0}", durationSecs); // 32763 for RyuJIT
    }
}
Member

AndreyAkinshin commented Jul 28, 2015

@angelsl, great job! I have reduced the size of your repro:

class Program
{
    static void Main()
    {
        new Program().SetWithPriority<string>(null, null, null, true);
    }

    void SetWithPriority<U>(string key, U val, int? durationSecs, bool isSliding)
    {
        key = key;
        RawSet(key, val, durationSecs, isSliding);
    }

    void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding)
    {
        Console.WriteLine("{0}", durationSecs); // 32763 for RyuJIT
    }
}
@masonwheeler

This comment has been minimized.

Show comment
Hide comment
@masonwheeler

masonwheeler Jul 28, 2015

It was already plenty small. Are we playing Bug Golf? :P

masonwheeler commented Jul 28, 2015

It was already plenty small. Are we playing Bug Golf? :P

@AndreyAkinshin

This comment has been minimized.

Show comment
Hide comment
@AndreyAkinshin

AndreyAkinshin Jul 28, 2015

Member

@masonwheeler, in such issues, it is very important to produce a minimal reproducible example, because every excess instruction complicates the analysis of the bug.

Member

AndreyAkinshin commented Jul 28, 2015

@masonwheeler, in such issues, it is very important to produce a minimal reproducible example, because every excess instruction complicates the analysis of the bug.

@mgravell

This comment has been minimized.

Show comment
Hide comment
@mgravell

mgravell Jul 28, 2015

Nice; we did try a console repro, but it didn't show initially, so we stuck
with IIS to make sure we had something concrete to show. The fact that it
can be repro'd so minimally is fun.

On 28 July 2015 at 13:11, Andrey Akinshin notifications@github.com wrote:

@masonwheeler https://github.com/masonwheeler, in such issues, it is
very important to produce a minimal reproducible example, because every
excess instruction complicates the analysis of the bug.


Reply to this email directly or view it on GitHub
#1296 (comment).

Regards,

Marc

mgravell commented Jul 28, 2015

Nice; we did try a console repro, but it didn't show initially, so we stuck
with IIS to make sure we had something concrete to show. The fact that it
can be repro'd so minimally is fun.

On 28 July 2015 at 13:11, Andrey Akinshin notifications@github.com wrote:

@masonwheeler https://github.com/masonwheeler, in such issues, it is
very important to produce a minimal reproducible example, because every
excess instruction complicates the analysis of the bug.


Reply to this email directly or view it on GitHub
#1296 (comment).

Regards,

Marc

@trayburn

This comment has been minimized.

Show comment
Hide comment
@trayburn

trayburn Jul 28, 2015

For those who have been deep into the internals for this bug, can anyone confirm that the random memory address that is being read to get the "incorrect" value is 100% Absolutely Without A Doubt coming from within the memory space of the .NET Process? If not, as horrible as this is already, this becomes much worse as an angle of attack to read and/or write random memory locations on boxes running .NET Framework 4.6.

Not being limited to the memory space of the process means this could introduce instability into systems running 4.6, and break the process isolation of most shared hosting environments.

trayburn commented Jul 28, 2015

For those who have been deep into the internals for this bug, can anyone confirm that the random memory address that is being read to get the "incorrect" value is 100% Absolutely Without A Doubt coming from within the memory space of the .NET Process? If not, as horrible as this is already, this becomes much worse as an angle of attack to read and/or write random memory locations on boxes running .NET Framework 4.6.

Not being limited to the memory space of the process means this could introduce instability into systems running 4.6, and break the process isolation of most shared hosting environments.

@ayende

This comment has been minimized.

Show comment
Hide comment
@ayende

ayende Jul 28, 2015

There is no way for the process to be just randomly reading some other
process memory.
You need ReadMemory() calls to do that, and those don't just happen, even
with this issue.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Tue, Jul 28, 2015 at 4:00 PM, Tim Rayburn notifications@github.com
wrote:

For those who have been deep into the internals for this bug, can anyone
confirm that the random memory address that is being read to get the
"incorrect" value is 100% Absolutely Without A Doubt coming from within
the memory space of the .NET Process? If not, as horrible as this is
already, this becomes much worse as an angle of attack to read and/or write
random memory locations on boxes running .NET Framework 4.6.

Not being limited to the memory space of the process means this could
introduce instability into systems running 4.6, and break the process
isolation of most shared hosting environments.


Reply to this email directly or view it on GitHub
#1296 (comment).

ayende commented Jul 28, 2015

There is no way for the process to be just randomly reading some other
process memory.
You need ReadMemory() calls to do that, and those don't just happen, even
with this issue.

*Hibernating Rhinos Ltd *

Oren Eini* l CEO l *Mobile: + 972-52-548-6969

Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811

On Tue, Jul 28, 2015 at 4:00 PM, Tim Rayburn notifications@github.com
wrote:

For those who have been deep into the internals for this bug, can anyone
confirm that the random memory address that is being read to get the
"incorrect" value is 100% Absolutely Without A Doubt coming from within
the memory space of the .NET Process? If not, as horrible as this is
already, this becomes much worse as an angle of attack to read and/or write
random memory locations on boxes running .NET Framework 4.6.

Not being limited to the memory space of the process means this could
introduce instability into systems running 4.6, and break the process
isolation of most shared hosting environments.


Reply to this email directly or view it on GitHub
#1296 (comment).

@JamesThorpe

This comment has been minimized.

Show comment
Hide comment
@JamesThorpe

JamesThorpe Jul 28, 2015

@trayburn Nick addressed this point on his blog:

The exact line we got from Microsoft (when discussing security) was: "The only data that could be erroneously passed would be locals from the current frame". We can't confirm this without seeing the fix applied (not in the public repo yet), but we have no reason to believe otherwise. That being said, we couldn't relate which locals we were seeing as values here (they may not be your locals, for example).

JamesThorpe commented Jul 28, 2015

@trayburn Nick addressed this point on his blog:

The exact line we got from Microsoft (when discussing security) was: "The only data that could be erroneously passed would be locals from the current frame". We can't confirm this without seeing the fix applied (not in the public repo yet), but we have no reason to believe otherwise. That being said, we couldn't relate which locals we were seeing as values here (they may not be your locals, for example).

@Sebazzz

This comment has been minimized.

Show comment
Hide comment
@Sebazzz

Sebazzz Jul 28, 2015

Not being limited to the memory space of the process means this could introduce instability into systems running 4.6, and break the process isolation of most shared hosting environments.

It will not break process isolation as Ayende said, because reading from another process like that is not allowed in Windows. However, applications in the same application pool can affect each other since an app domain in an application pool cannot provide isolation for this, I think.

Sebazzz commented Jul 28, 2015

Not being limited to the memory space of the process means this could introduce instability into systems running 4.6, and break the process isolation of most shared hosting environments.

It will not break process isolation as Ayende said, because reading from another process like that is not allowed in Windows. However, applications in the same application pool can affect each other since an app domain in an application pool cannot provide isolation for this, I think.

@trayburn

This comment has been minimized.

Show comment
Hide comment
@trayburn

trayburn Jul 28, 2015

@JamesThorpe I agree I've seen the comment where he passed that along, but given the delving done by @AndreyAkinshin above, his last line:

As we can see, the calculated durationSecs offset is 0C8h, but the actual offset is 0D0h. In this case, the TCO calculation error is 8 bytes.

We see durationSecs being read, via offset, and being undershot from the correct value (0C8h < 0D0h) but if that could ever be an overshoot instead things get much worse, hence I think it worth asking.

Also given the utterly minimal repo we now have from @angelsl and @AndreyAkinshin I'm challenged to think where the value 32763 could possibly be a local, unless we're reading the middle of a pointer, which if we can assign means we could break memory boundaries by simply returning to the original object/pointer.

trayburn commented Jul 28, 2015

@JamesThorpe I agree I've seen the comment where he passed that along, but given the delving done by @AndreyAkinshin above, his last line:

As we can see, the calculated durationSecs offset is 0C8h, but the actual offset is 0D0h. In this case, the TCO calculation error is 8 bytes.

We see durationSecs being read, via offset, and being undershot from the correct value (0C8h < 0D0h) but if that could ever be an overshoot instead things get much worse, hence I think it worth asking.

Also given the utterly minimal repo we now have from @angelsl and @AndreyAkinshin I'm challenged to think where the value 32763 could possibly be a local, unless we're reading the middle of a pointer, which if we can assign means we could break memory boundaries by simply returning to the original object/pointer.

@YuvalItzchakov

This comment has been minimized.

Show comment
Hide comment
@YuvalItzchakov

YuvalItzchakov Jul 28, 2015

@jamesmanning You're talking about tail-recursion, which is a special case of tail-call.

Wikipedia:

Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is not needed any more, and it can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls).

YuvalItzchakov commented Jul 28, 2015

@jamesmanning You're talking about tail-recursion, which is a special case of tail-call.

Wikipedia:

Tail calls can be implemented without adding a new stack frame to the call stack. Most of the frame of the current procedure is not needed any more, and it can be replaced by the frame of the tail call, modified as appropriate (similar to overlay for processes, but for function calls).

@masonwheeler

This comment has been minimized.

Show comment
Hide comment
@masonwheeler

masonwheeler Jul 28, 2015

@jamesmanning: Tail recursion is a specific case of tail-call optimization where the tail call is to the same function. It's not the only use for tail-call optimization, though.

masonwheeler commented Jul 28, 2015

@jamesmanning: Tail recursion is a specific case of tail-call optimization where the tail call is to the same function. It's not the only use for tail-call optimization, though.

@jamesmanning

This comment has been minimized.

Show comment
Hide comment
@jamesmanning

jamesmanning Jul 28, 2015

@YuvalItzchakov thanks - I had never done TCO outside of the context of recursive calls, but now it makes sense. 😄

Sorry for the noise!

jamesmanning commented Jul 28, 2015

@YuvalItzchakov thanks - I had never done TCO outside of the context of recursive calls, but now it makes sense. 😄

Sorry for the noise!

@AndreyAkinshin

This comment has been minimized.

Show comment
Hide comment
@AndreyAkinshin

AndreyAkinshin Jul 28, 2015

Member

@mgravell, @angelsl, @trayburn, I have new interesting results: the bug allows us to get the return address. Let's discuss it in detail. My new code:

class Program
{
    static void Main()
    {
        new Program().SetWithPriority<string>(null, null, int.MaxValue, true);
    }

    void SetWithPriority<U>(string key, U val, int? durationSecs, bool isSliding)
    {
        key = key;
        RawSet(key, val, durationSecs, isSliding);
    }

    void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding)
    {
        Hack hack = new Hack();
        hack.NullableInt = durationSecs;
        Console.WriteLine("{0:X}", hack.Long);
    }

    [StructLayout(LayoutKind.Explicit)]
    struct Hack
    {
        [FieldOffset(0)]
        public int? NullableInt;
        [FieldOffset(0)]
        public long Long;
    }
}

Expected value is 7FFFFFFF00000001, but with RyuJIT I have 7FFBAC784303 (you can have another value). We need to go deeper.

Part 1

screen1
Position: before the call of SetWithPriority. The address of the next line after the call line is 7FFBAC784303 (return address).

Part 2

screen2
Position: the last line of SetWithPriority, jmp instruction. On the picture you can see the correct value of durationSecs: 7FFFFFFF00000001. Ok, let's press F11.

Part 3

screen3
Position: the first line of RawSet. As we can see, the jmp instruction puts the return address (7FFBAC784303) before the value of durationSecs. As I wrote earlier, the durationSecs offset error equals to 8 bytes. Thus, RyuJIT uses the return address instead of durationSecs. My Hack structure helps me to get the value of the return address and print it.

Member

AndreyAkinshin commented Jul 28, 2015

@mgravell, @angelsl, @trayburn, I have new interesting results: the bug allows us to get the return address. Let's discuss it in detail. My new code:

class Program
{
    static void Main()
    {
        new Program().SetWithPriority<string>(null, null, int.MaxValue, true);
    }

    void SetWithPriority<U>(string key, U val, int? durationSecs, bool isSliding)
    {
        key = key;
        RawSet(key, val, durationSecs, isSliding);
    }

    void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding)
    {
        Hack hack = new Hack();
        hack.NullableInt = durationSecs;
        Console.WriteLine("{0:X}", hack.Long);
    }

    [StructLayout(LayoutKind.Explicit)]
    struct Hack
    {
        [FieldOffset(0)]
        public int? NullableInt;
        [FieldOffset(0)]
        public long Long;
    }
}

Expected value is 7FFFFFFF00000001, but with RyuJIT I have 7FFBAC784303 (you can have another value). We need to go deeper.

Part 1

screen1
Position: before the call of SetWithPriority. The address of the next line after the call line is 7FFBAC784303 (return address).

Part 2

screen2
Position: the last line of SetWithPriority, jmp instruction. On the picture you can see the correct value of durationSecs: 7FFFFFFF00000001. Ok, let's press F11.

Part 3

screen3
Position: the first line of RawSet. As we can see, the jmp instruction puts the return address (7FFBAC784303) before the value of durationSecs. As I wrote earlier, the durationSecs offset error equals to 8 bytes. Thus, RyuJIT uses the return address instead of durationSecs. My Hack structure helps me to get the value of the return address and print it.

@mmitche

This comment has been minimized.

Show comment
Hide comment
@brandonagr

This comment has been minimized.

Show comment
Hide comment
@brandonagr

brandonagr Jul 29, 2015

@mmitche Am I correct in thinking that the fix for #1299 is a different issue than the fix here? Just adding in the TailCallOpt registry key does not seem to fix 1299 for me locally, but it did fix the regression example you committed earlier for this issue when I ran both locally in the same project.

brandonagr commented Jul 29, 2015

@mmitche Am I correct in thinking that the fix for #1299 is a different issue than the fix here? Just adding in the TailCallOpt registry key does not seem to fix 1299 for me locally, but it did fix the regression example you committed earlier for this issue when I ran both locally in the same project.

@Eyas

This comment has been minimized.

Show comment
Hide comment
@Eyas

Eyas Jul 29, 2015

Yup @schellap indicates its a bug in copy propagation-- a completely different compiler optimization. I don't think there's a public fix for that yet.

Eyas commented Jul 29, 2015

Yup @schellap indicates its a bug in copy propagation-- a completely different compiler optimization. I don't think there's a public fix for that yet.

@forki

This comment has been minimized.

Show comment
Hide comment
@forki

forki Jul 29, 2015

Very cool. Can we expect similar blog posts for #1299 and Microsoft/visualfsharp#536?

forki commented Jul 29, 2015

Very cool. Can we expect similar blog posts for #1299 and Microsoft/visualfsharp#536?

@RichardD2

This comment has been minimized.

Show comment
Hide comment
@RichardD2

RichardD2 Jul 29, 2015

Disabling RyuJIT caused our 64-bit ASP.NET 4.5 AppPool to die!

An unhandled exception occurred and the process was terminated.

Exception: System.Threading.ThreadAbortException
at System.Web.HttpRuntime.ProcessRequestNotificationPrivate(IIS7WorkerRequest wr, HttpContext context)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)

Followed by:

Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2
Faulting module name: KERNELBASE.dll, version: 6.1.7601.18847, time stamp: 0x554d7df1
Exception code: 0xe0434352
Fault offset: 0x000000000000b3dd
Faulting process id: 0xe58
Faulting application start time: 0x01d0ca0eb6823e3e
Faulting application path: c:\windows\system32\inetsrv\w3wp.exe
Faulting module path: C:\Windows\system32\KERNELBASE.dll

And:

Problem signature:
P1: w3wp.exe
P2: 7.5.7601.17514
P3: 4ce7afa2
P4: KERNELBASE.dll
P5: 6.1.7601.18847
P6: 554d7df1
P7: e0434352
P8: 000000000000b3dd
P9: 
P10: 

And finally:

An unhandled win32 exception occurred in w3wp.exe [3672]. Just-In-Time debugging this exception failed with the following error: Debugger could not be started because no user is logged on.

Check the documentation index for 'Just-in-time debugging, errors' for more information.

After that error was logged five times, the AppPool was disabled, and the site returned a 503 error.

After removing the "EnableLegacyJIT" value and restarting the AppPool, the site seems to be back to normal.

This was on Windows Server 2008 R2 SP1 with all updates installed.

RichardD2 commented Jul 29, 2015

Disabling RyuJIT caused our 64-bit ASP.NET 4.5 AppPool to die!

An unhandled exception occurred and the process was terminated.

Exception: System.Threading.ThreadAbortException
at System.Web.HttpRuntime.ProcessRequestNotificationPrivate(IIS7WorkerRequest wr, HttpContext context)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotificationHelper(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)
   at System.Web.Hosting.PipelineRuntime.ProcessRequestNotification(IntPtr rootedObjectsPointer, IntPtr nativeRequestContext, IntPtr moduleData, Int32 flags)

Followed by:

Faulting application name: w3wp.exe, version: 7.5.7601.17514, time stamp: 0x4ce7afa2
Faulting module name: KERNELBASE.dll, version: 6.1.7601.18847, time stamp: 0x554d7df1
Exception code: 0xe0434352
Fault offset: 0x000000000000b3dd
Faulting process id: 0xe58
Faulting application start time: 0x01d0ca0eb6823e3e
Faulting application path: c:\windows\system32\inetsrv\w3wp.exe
Faulting module path: C:\Windows\system32\KERNELBASE.dll

And:

Problem signature:
P1: w3wp.exe
P2: 7.5.7601.17514
P3: 4ce7afa2
P4: KERNELBASE.dll
P5: 6.1.7601.18847
P6: 554d7df1
P7: e0434352
P8: 000000000000b3dd
P9: 
P10: 

And finally:

An unhandled win32 exception occurred in w3wp.exe [3672]. Just-In-Time debugging this exception failed with the following error: Debugger could not be started because no user is logged on.

Check the documentation index for 'Just-in-time debugging, errors' for more information.

After that error was logged five times, the AppPool was disabled, and the site returned a 503 error.

After removing the "EnableLegacyJIT" value and restarting the AppPool, the site seems to be back to normal.

This was on Windows Server 2008 R2 SP1 with all updates installed.

@mmitche

This comment has been minimized.

Show comment
Hide comment
@mmitche

mmitche Jul 29, 2015

Member

@RichardD2 This is interesting. Can you file a bug on Connect: http://connect.microsoft.com/VisualStudio

Member

mmitche commented Jul 29, 2015

@RichardD2 This is interesting. Can you file a bug on Connect: http://connect.microsoft.com/VisualStudio

@mmitche

This comment has been minimized.

Show comment
Hide comment
@mmitche
Member

mmitche commented Jul 29, 2015

@RichardD2 Thanks!

@Rutix

This comment has been minimized.

Show comment
Hide comment
@Rutix

Rutix Jul 29, 2015

@mmitche Now that this is merged when can we expect the hotfix/patch to be rolled out?

Rutix commented Jul 29, 2015

@mmitche Now that this is merged when can we expect the hotfix/patch to be rolled out?

@mmitche

This comment has been minimized.

Show comment
Hide comment
@mmitche
Member

mmitche commented Jul 29, 2015

@Rutix. We are still working that out. Watch the blog post for details :) http://blogs.msdn.com/b/dotnet/archive/2015/07/28/ryujit-bug-advisory-in-the-net-framework-4-6.aspx

@clivetong

This comment has been minimized.

Show comment
Hide comment
@clivetong

clivetong Jul 31, 2015

@mmitche I put in a repro for what looks like this 64bit RyuJIT bug some time ago - https://connect.microsoft.com/VisualStudio/feedback/details/1372514
Is it possible to check that this is fixed by this change?

clivetong commented Jul 31, 2015

@mmitche I put in a repro for what looks like this 64bit RyuJIT bug some time ago - https://connect.microsoft.com/VisualStudio/feedback/details/1372514
Is it possible to check that this is fixed by this change?

@forki

This comment has been minimized.

Show comment
Hide comment
@forki

forki Jul 31, 2015

@clivetong your link is 404 for me

forki commented Jul 31, 2015

@clivetong your link is 404 for me

@clivetong

This comment has been minimized.

Show comment
Hide comment
@clivetong

clivetong Jul 31, 2015

@forki Thanks for the notification. I originally submitted it as private, but then changed it to public several weeks ago. I can't find any other options about visibility and other issues I have submitted as public are visible to others, so I guess it's game over.
connect

clivetong commented Jul 31, 2015

@forki Thanks for the notification. I originally submitted it as private, but then changed it to public several weeks ago. I can't find any other options about visibility and other issues I have submitted as public are visible to others, so I guess it's game over.
connect

@forki

This comment has been minimized.

Show comment
Hide comment
@forki

forki Jul 31, 2015

so you say this bug was known before release?!

forki commented Jul 31, 2015

so you say this bug was known before release?!

@leecow

This comment has been minimized.

Show comment
Hide comment
@leecow

leecow Jul 31, 2015

Member

@clivetong - The issue you reported through Connect is the same as what we're talking about in this thread.

Member

leecow commented Jul 31, 2015

@clivetong - The issue you reported through Connect is the same as what we're talking about in this thread.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment