-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tail Call bug in RyuJIT - Incorrect parameters passed #4391
Comments
I understand from various sources that given the normal patch workflow, the next hotfix window is September. For the appropriate people at Microsoft, please consider an out-of-band emergency fix for this. This issue is a show-stopper for us in moving to .NET 4.6. |
Thanks @NickCraver for raising visibility. I'll be talking with the team later today. @vcsjones - For clarity, is this a show-stopper because you are concerned or because you have a repro w/your workload? /cc @CarolEidt, @cmckinsey |
Has anyone been able to understand the root cause of this? Narrowing the problem to certain use-cases can help "calm" people down. |
@richlander not to be difficult, but given the nature of the bug...how would you (as a developer) know? This is such a subtle production-only bug for most cases I think most wouldn't find it, even in testing. @YuvalItzchakov the issue here is it can affect any library as well, not just your code. Even if you had specific patterns, I'm not sure that would help any here. In our case it was a RELEASE library distributed by our internal NuGet server...so really like any other NuGet package. |
@richlander @NickCraver pointed out that this was originally discovered while investigating an issue in MiniProfiler, which we actively use. We are re-opening a few incidents we had on an application to determine if this is the cause, or if it is related. At the time we did not have a mini-dump available to investigate, but that is the next thing we are working on. |
@vcsjones the "mini profiler" thing here was just what was visible; the actual glitch manifested in a different library that was providing a custom storage provider for mini profilier. So: the issue didn't actually happen in mini-profiler, but at the same time: I can't say that mini-profiler isn't affected. Just: don't focus on the mention of mini-profiler. |
@mgravell Agree. My point being, this was seen in a real library. It isn't just some mocked together code that uses tricks to make the issue manifest. It doesn't take special work to hit it. Thus it is very likely this affects other real-world applications and it has severe consequences. |
Yes; absolutely; that's what scares us. Especially if Windows 10 deploys .NET 4.6... Wednesday could see random eager people installing, and potentially have their desktop business apps start behaving different in subtle, hard to spot, yet dangerous (whether to business well-being, or to personal safety) ways... Plus of course the VS2015 release prompting server-side installs. |
How likely is this to affect current work in progress to get RyuJit working for the 32- and 64-bit versions of ARM? Is this bug AMD64 specific or platform agnostic? |
The team is taking this very seriously. We're going to talk about it later today as folks get into the office. Definitely appreciate all the attention on the issue. |
@mgravell would the release have any effect if the apps are targeting < 4.6 though? Probably too late to slipstream RyuJIT as disabled considering the mode of distribution. 👍 Good find |
@ChrisMcKee Yes this affects all applications that target 4.5.x (possibly earlier?) through 4.6. Just having RyuJIT installed is enough to introduce the problem. http://nickcraver.com/blog/2015/07/27/why-you-should-wait-on-dotnet-46/#comment-2159337569 |
@vcsjones My understanding is that the same JIT compiler would be used in any 4.x environment, possibly even things targetting .NET 2, but running under the 4.x CLR? |
Wait, this will glitch out on 4.5 as well? |
@masonwheeler Correct, if .Net 4.6 is installed the JIT appears to be used for both. Our repro case targets .Net 4.5 and .Net 4.5.2 in the CorruptionRepro and StackRedis projects respectively. |
@NickCraver even bigger 'good catch'. I was sat whistling, ah well they'll fix 4.6. Nobody expects the regression fail. |
Thanks for the heads up @NickCraver, was going to push to get 4.6 on our build server this week. Going to hold back till a patch is issued now. |
Is this the same tailcall bug as raised here? dotnet/fsharp#536 |
@latkin I don't think so; my reasoning: the bug you link to threw a fatal execution exception. It was broken in a way that was detectable. Contrast: silently substituting parameter values. |
@AndreyAkinshin nice bit of demonstration, thanks |
@latkin I'll try and confirm whether it is the same or not. |
Which parameter is causing the trouble? The "object val" or the "int? durationSecs"? The int? needs some boxing/unboxing above the real 32 bit memory integer. What is the size of the boxing object (with the hasValue)? |
@vbouret, the parameter is |
@masonwheeler You should only see this on the clr running in 64-bit mode. |
@mmitche That's good to know. I know that if you want to load a 32-bit native DLL into the process, it will force everything into 32-bit mode, but what if you're calling an out-of-process COM server that's implemented in a 32-bit DLL? (Yes, this question is based on a very specific real-world use case.) |
@masonwheeler I don't believe that would force it into 32-bit mode. However, IIRC, targeting your app to x86 instead of AnyCPU should cause it to run on the 32-bit CLR. |
Are you sure? It's not difficult to imagine any number of different ways that data corruption could lead to an exception being raised. For example, if one of the affected arguments was a delegate, you could easily see something like the scenario described in that bug report. Not saying it is, or even probably is the same bug; just that it looks to me like it could be. |
@masonwheeler fair enough; that's a good point On Mon, 27 Jul 2015 18:34 masonwheeler notifications@github.com wrote:
|
It was already plenty small. Are we playing Bug Golf? :P |
@masonwheeler, in such issues, it is very important to produce a minimal reproducible example, because every excess instruction complicates the analysis of the bug. |
Nice; we did try a console repro, but it didn't show initially, so we stuck On 28 July 2015 at 13:11, Andrey Akinshin notifications@github.com wrote:
Regards, Marc |
For those who have been deep into the internals for this bug, can anyone confirm that the random memory address that is being read to get the "incorrect" value is 100% Absolutely Without A Doubt coming from within the memory space of the .NET Process? If not, as horrible as this is already, this becomes much worse as an angle of attack to read and/or write random memory locations on boxes running .NET Framework 4.6. Not being limited to the memory space of the process means this could introduce instability into systems running 4.6, and break the process isolation of most shared hosting environments. |
There is no way for the process to be just randomly reading some other *Hibernating Rhinos Ltd * Oren Eini* l CEO l *Mobile: + 972-52-548-6969 Office: +972-4-622-7811 *l *Fax: +972-153-4-622-7811 On Tue, Jul 28, 2015 at 4:00 PM, Tim Rayburn notifications@github.com
|
@trayburn Nick addressed this point on his blog:
|
It will not break process isolation as Ayende said, because reading from another process like that is not allowed in Windows. However, applications in the same application pool can affect each other since an app domain in an application pool cannot provide isolation for this, I think. |
@JamesThorpe I agree I've seen the comment where he passed that along, but given the delving done by @AndreyAkinshin above, his last line:
We see Also given the utterly minimal repo we now have from @angelsl and @AndreyAkinshin I'm challenged to think where the value |
@jamesmanning You're talking about tail-recursion, which is a special case of tail-call.
|
@jamesmanning: Tail recursion is a specific case of tail-call optimization where the tail call is to the same function. It's not the only use for tail-call optimization, though. |
@YuvalItzchakov thanks - I had never done TCO outside of the context of recursive calls, but now it makes sense. 😄 Sorry for the noise! |
@mgravell, @angelsl, @trayburn, I have new interesting results: the bug allows us to get the return address. Let's discuss it in detail. My new code: class Program
{
static void Main()
{
new Program().SetWithPriority<string>(null, null, int.MaxValue, true);
}
void SetWithPriority<U>(string key, U val, int? durationSecs, bool isSliding)
{
key = key;
RawSet(key, val, durationSecs, isSliding);
}
void RawSet(string cacheKey, object val, int? durationSecs, bool isSliding)
{
Hack hack = new Hack();
hack.NullableInt = durationSecs;
Console.WriteLine("{0:X}", hack.Long);
}
[StructLayout(LayoutKind.Explicit)]
struct Hack
{
[FieldOffset(0)]
public int? NullableInt;
[FieldOffset(0)]
public long Long;
}
} Expected value is Part 1
Part 2
Part 3
|
Hey all. @richlander has posted about this on the .NET blog: http://blogs.msdn.com/b/dotnet/archive/2015/07/28/ryujit-bug-advisory-in-the-net-framework-4-6.aspx |
@mmitche Am I correct in thinking that the fix for dotnet/coreclr#1299 is a different issue than the fix here? Just adding in the TailCallOpt registry key does not seem to fix 1299 for me locally, but it did fix the regression example you committed earlier for this issue when I ran both locally in the same project. |
Yup @schellap indicates its a bug in copy propagation-- a completely different compiler optimization. I don't think there's a public fix for that yet. |
Very cool. Can we expect similar blog posts for dotnet/coreclr#1299 and dotnet/fsharp#536? |
Disabling RyuJIT caused our 64-bit ASP.NET 4.5 AppPool to die!
Followed by:
And:
And finally:
After that error was logged five times, the AppPool was disabled, and the site returned a 503 error. After removing the "EnableLegacyJIT" value and restarting the AppPool, the site seems to be back to normal. This was on Windows Server 2008 R2 SP1 with all updates installed. |
@RichardD2 This is interesting. Can you file a bug on Connect: http://connect.microsoft.com/VisualStudio |
@RichardD2 Thanks! |
@mmitche Now that this is merged when can we expect the hotfix/patch to be rolled out? |
@Rutix. We are still working that out. Watch the blog post for details :) http://blogs.msdn.com/b/dotnet/archive/2015/07/28/ryujit-bug-advisory-in-the-net-framework-4-6.aspx |
@mmitche I put in a repro for what looks like this 64bit RyuJIT bug some time ago - https://connect.microsoft.com/VisualStudio/feedback/details/1372514 |
@clivetong your link is 404 for me |
@forki Thanks for the notification. I originally submitted it as private, but then changed it to public several weeks ago. I can't find any other options about visibility and other issues I have submitted as public are visible to others, so I guess it's game over. |
so you say this bug was known before release?! |
@clivetong - The issue you reported through Connect is the same as what we're talking about in this thread. |
The following critical bug is known and fixed internally (not released) by Microsoft (per private discussions while security was evaluated). In this bug "we" refers to Marc Gravell (@mgravell) and I who tracked down the issue on the Stack Overflow side on the fence. I am posting it here for public visibility and to help users hitting the same issue. I have also created a blog post addressing discovery and severity here: Why you should wait on upgrading to .Net 4.6.
There is an issue in the .Net 4.6 RTM RyuJIT implementation that incorrectly hooks up parameters during some tail call optimizations. The net result of this is that tail call methods can get the wrong parameters passed. Here's our code exhibiting the issue:
The
RawSet()
is the last in the chain, and only when optimizations are enabled, is incorrectly optimized. If for example we callSet<T>
with adurationSecs
of 3600, we would expect that value to go all the way down the method chain. This doesn't happen. Instead, the value passed to the tail method (RawSet
) is seemingly random (our assumption is it pulls some other value from the stack). For example, it may be 30, or 57, ornull
.The net result for us is that items are getting cached for drastically shorter durations, or not at all (
null
often gets passed). On Stack Overflow this causes unpredictable local HTTP cache usage and more hits and heavier load to anything populating it. This is a production blocker on deployment for us, and (we believe) should be for anyone else.We have collapsed the reproduction of this bug into a project you can run locally, available on GitHub. Here's an example test run (see the repo README for a full description):
I can't stress how serious of a bug this is, due to the subtleness of the occurrence, the "only production" likelihood given the RELEASE-only nature, and the scary examples we can easily come up with. When the parameters you're passing aren't the ones the method is getting, all sanity goes out the window.
What if your method says how much stock to buy? What if it gives dosing information to a patient? What if it tells a plane what altitude to climb to?
This bug is critical, I am posting the issue for several reasons:
If you have already deployed .Net 4.6, our recommendation at this point is to disable RyuJIT immediately.
The text was updated successfully, but these errors were encountered: