Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KB4487017 wreaking havoc on CoreCLR #12038

Closed
redknightlois opened this issue Feb 14, 2019 · 20 comments
Closed

KB4487017 wreaking havoc on CoreCLR #12038

redknightlois opened this issue Feb 14, 2019 · 20 comments
Labels
tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Milestone

Comments

@redknightlois
Copy link

Long story short. We got some isolated reports 2 days ago of our product not starting with an error like this:

image

Essentially the server started, and died. No information on logs or anything... on the event viewer this was found:

Application: Raven.Server.exe
CoreCLR Version: 4.6.27129.4
Description: The process was terminated due to an internal error in the .NET Runtime at IP 00007FF8E8693B8D (00007FF8E84F0000) with exit code c0000005.

At first we thought it was our fault, but suddenly overnight one of our environments starts failing. The cause was that KB4487017 was killing us with an Access Violation. Some of our devs went to try uninstalling it and then we could run normally... But lighting striked twice, after reinstall it to double check we got this:

image

This issue then was labelled critical on our side, to the point that we are issuing a notice to all our clients to delay security patching until we can figure out the issue.

Our whole team has been investigating the issue for the last couple of hours. We got the following information:

  • Error disappears after uninstalling KB4487017, therefore both are linked
  • Different CoreCLR versions are being affected.
  • We catched the error under the debugger in random locations both on managed and unmanaged calls.
  • Apparently we are not the only ones suffering from it https://www.reddit.com/r/Warframe/comments/aqj7n9/crash_to_desktop_on_login_pc/
  • Only version 1803 is affected. 2019 fall update doesn't include this security patch (it got a different one) and works fine

Will update this post with any new information we are able to uncover.

@redknightlois
Copy link
Author

Not directly related but could be a lead...

One of our customers had a similar issue discovered yesterday when an update released the same day (not sure which though) removed System.Threading.Tasks.Extensions v4.1.0.0 from the machine and it was an indirect dependency of something he was using.

It just crashed the process and didn't cause a blue screen.

He had to create a binding redirect to a newer version to get round it.

Not the same issue but possibly related and so may help point you in the right direction.

@karelz
Copy link
Member

karelz commented Feb 14, 2019

@redknightlois sorry for the trouble.
Is it .NET Core or .NET Framework problem?
Will you be able to share some dumps (privately) for investigation if we need them?

@redknightlois
Copy link
Author

redknightlois commented Feb 14, 2019

@karelz .Net Core and yes. I am on it :)

@karelz
Copy link
Member

karelz commented Feb 14, 2019

Which version of .NET Core? 2.1 or 2.2?
To submit (private) dumps, etc. please use: https://developercommunity.visualstudio.com (when needed) - it allows to upload data for MS eyes only. Just give us the report link when you create it and upload some.

@redknightlois
Copy link
Author

redknightlois commented Feb 14, 2019

We made it fail on both. 2.2.2 and 2.2.1 for sure. I am checking if we tried on 2.1 (because I don't remember what RDB 4.0 is running on). It is fair to say though, that it fails on all our CoreCLR versions in use.

EDIT: Confirmed with the team that fails on 2.1.6, 2.1.7 and 2.1.8.

@redknightlois
Copy link
Author

redknightlois commented Feb 14, 2019

I activated gflags.exe with silent monitoring. This is the dump. It is an empty instance (no user data).
Mini Dump: RDB.Empty.zip
Heap Dump: Raven.Server.exe-(PID-35924)-150431375.zip
Heap Dump with tiered compilation disabled: Raven.Server.exe-(PID-20984)-755937.zip

@AndyAyersMS
Copy link
Member

In the full dump, thread 32 is hitting some kind of fatal error during jitting.

00 00007ffc`9d07b153 : coreclr!EEPolicy::HandleFatalError+0x7a [e:\a\_work\62\s\src\vm\eepolicy.cpp @ 1522] 
01 00007ffd`3fc5f7dd : coreclr!ProcessCLRException+0x1081c3 [e:\a\_work\62\s\src\vm\exceptionhandling.cpp @ 1029] 
02 00007ffd`3fbcd856 : ntdll!RtlpExecuteHandlerForException+0xd [minkernel\ntos\rtl\amd64\xcptmisc.asm @ 131] 
03 00007ffd`3fbcbe9a : ntdll!RtlDispatchException+0x3c6 [minkernel\ntos\rtl\amd64\exdsptch.c @ 569] 
04 00007ffd`3c21a388 : ntdll!RtlRaiseException+0x31a [minkernel\ntos\rtl\amd64\raise.c @ 178] 
05 00007ffc`9cfe44e1 : KERNELBASE!RaiseException+0x68 [minkernel\kernelbase\xcpt.c @ 922] 
06 00007ffd`3fc5ed63 : coreclr!__CxxCallCatchBlock+0x151 [f:\dd\vctools\crt\vcruntime\src\eh\frame.cpp @ 1186] 
07 00007ffc`9cf154c6 : ntdll!RcFrameConsolidation+0x3 [minkernel\ntos\rtl\amd64\capture.asm @ 653] 
08 00007ffc`9d04cc7f : coreclr!MethodDesc::JitCompileCodeLocked+0x212 [e:\a\_work\62\s\src\vm\prestub.cpp @ 841] 

@redknightlois
Copy link
Author

OK. Update to now. We built a version of the executable with PrefetchVirtualMemory disabled and it doesn't crash. At least we are onto something.

@redknightlois
Copy link
Author

redknightlois commented Feb 14, 2019

Repro steps:

  • Windows 10 Version 1803
  • Install KB 4487017
  • Download latest stable 1.4.1 from https://ravendb.net/download
  • Execute run.ps1 or Raven.Server.exe

@leculver
Copy link
Contributor

@redknightlois Thank you! We really appreciate the detailed bug report. As a result of your last post, my teammate Chris Ahna has successfully reproduced this issue locally. We are working as quickly as possible to figure out the root cause of this issue. We currently suspect something has gone wrong in the Windows memory manager (but that's just a best-guess right now).

I will post updates to this thread as I have them. It sounds like you are unblocked but please let me know if there's anything we can do to help lower the impact for you and your customers as we chase this issue down.

@redknightlois
Copy link
Author

redknightlois commented Feb 15, 2019

@leculver we will issue a hotfix for those that have the issue, at the expense of performance. I would say not push KB4487017 to windows update until fixed would be a good idea :) ... we got the problem on one of our machines (which was good as we couldnt reproduce) because Azure forced update the VM.

@redknightlois
Copy link
Author

@leculver after careful consideration we are not going to issue the workaround. If the error (as what we know right now) is deep into the memory manager, there is no guarantee that workaround works, and is not just making it harder to happen or if it breaks other memory guarantees required to ensure data consistency and safety. For now our recommendation to pull the plug on the security patch until the real impact assessment is clearer is the safe course of action.

@PureKrome
Copy link
Contributor

@redknightlois Just to clarify. Does:

our recommendation to pull the plug on the security patch

mean: uninstall KB4487017?

@arekpalinski
Copy link

You can either uninstall KB4487017 or upgrade Windows to version 1809 (October 2018 Update)

@redknightlois
Copy link
Author

@PureKrome yes, though my meaning there was that the KB should be retired from compulsive Windows Update installation altogether.

@ayende
Copy link
Contributor

ayende commented Feb 20, 2019

Any news on this?

@ayende
Copy link
Contributor

ayende commented Feb 25, 2019

Current status: the KB was re-installed silently and I ended up with
image

@redknightlois
Copy link
Author

Bump?

@redknightlois
Copy link
Author

redknightlois commented Mar 24, 2019

Another update, another 3AM call with a production server going down for 2 hours because Azure decided it was a good idea to push a KB on a server on Sunday. Not fun. Any idea what the OS and Azure guys are doing with this? Client opened a ticket on Azure and no response in 3 weeks about the issue.

@karelz
Copy link
Member

karelz commented Mar 26, 2019

This seems to be addressed in 3B OS patch - KB 4489868 - https://support.microsoft.com/en-us/help/4489868/windows-10-update-kb4489868

@redknightlois confirmed the problem does not reproduce on Windows Update 1809 (it was reproducing on 1803).

Closing as addressed.

@karelz karelz closed this as completed Mar 26, 2019
@msftgits msftgits transferred this issue from dotnet/coreclr Jan 31, 2020
@msftgits msftgits added this to the 3.0 milestone Jan 31, 2020
@dotnet dotnet locked as resolved and limited conversation to collaborators Dec 14, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
tracking-external-issue The issue is caused by external problem (e.g. OS) - nothing we can do to fix it directly
Projects
None yet
Development

No branches or pull requests

8 participants