-
Notifications
You must be signed in to change notification settings - Fork 984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky tests: ImageList
randomly crashes tests and result in AccessViolationException
#3358
Comments
ImageCollection_IListItem_Set_GetReturnsExpected
result in AccessViolationExceptionImageList
randomly crashes tests and result in AccessViolationException
If you want to get minidumps on crashes there is a registry setting for that you could set in a powershell script, but its probably all just AccessViolations after some memory corruption so the stack traces probably won't be too useful. (If an AccessViolation happens outside managed code then it just terminates the process.) I'll update my master branch and see if these are reproducable locally. |
I'll try to get some time to plugin the screenshot capture and a memory dump functionality into the CI pipeline. |
Not getting any AVs or process crashes locally. |
One thing I didn't notice before, only x86 tests appear to fail, which suggests bitness-related issues. |
Another IOE in Release x64:
https://dev.azure.com/dnceng/public/_build/results?buildId=672867&view=results |
I had that same These failures make no sense, makes memory corruption the most likely cause. |
Also failure in
|
I can't seem to repro this issue locally... I think it fails here: winforms/src/System.Windows.Forms/src/System/Windows/Forms/ImageList.cs Lines 402 to 415 in d4cfc2d
If I found the right code for |
This is really weird. I know its not related to your theory here, but could @weltkante's comment here #3324 (comment) cause this? Perhaps there's some odd/incorrect/overlapping memory copying happening? |
I really doubt its a bug in the native ImageList implementation itself, QueryInterface failing would mean the object is corrupted (playing into the memory corruption idea) or already released (which I'd expect to be handled gracefully returning an error, e.g. the image list handle should also be invalid and have immediately complained about that).
Sure, you can cause arbitrary damage by bad memory copying. But I don't think that particular linked comment was right, later in the discussion it became clear that the caller made/provided a lot of assumptions that should have made the code safe.
I had some locally, but very rare (one in several dozen runs), and never under a debugger. I think the best chance is getting some dumps (probably need a full dump though not a minidump). I'll look into what options we have, I don't think we can set the usual machine wide setting to capture dumps on process crash, so to trigger a dump being written we might need some native DLL stub to write out the dump (SetUnhandledExceptionFilter/MiniDumpWriteDump), or a secondary helper process. |
@MattGal we seem to have "progressed" to level 2 of test nightmares, tests randomly fail with no clear indication of the underlying cause. |
Looking at a log for a failed build in #3224 (comment) it appears we may have a a minidump (snip):
A question is whether we get a mini dump when a test runner crashes. |
I thought |
We disabled it, but this is a new test being developed that hasn't been skipped. |
Do you have access to it? Its not in the artifact zip you can publically download (it just has log and binlog files). Also its not guaranteed that there will be any relation to this issue, RemoteExecutor is likely to be different failures than this On second thought the dump probably will just say the process is waiting for RemoteExecutor (separate process), probably not much use, sorry to get your hopes up. |
no, not with the existing tooling, you need to let the OS take a dump or insert your own code in SetUnhandledExceptionHandler. I'm not aware of other notifications that a process crashed, maybe powershell has some hook to be notified of crashes in processes it started, but I doubt it. Dumps taken on a timeout are only helpful for hangs. |
Turned on the dump collection flags locally and have been repeating testruns for a few hours, but I got a full crash dump now. No entry in log but the crashing test is error/stacktrace
Now the hard part will be figuring out what could be causing the corruption 😓 I'll be digging around a bit in the dump in hope the offender isn't already long gone. [edit] not much I can gather from a sample size of one dump, but there are surprisingly few tests running compared to other dumps I've seen; don't have time for further test runs today but I guess I'll keep running and gathering dumps over the next few days, if I can get a few different of these failures we may be able to cross relate which tests are running in each Still running:
Recently terminated:
|
Not only is it a possibility, we've already done it for you! In fact for the run in question I think this is a dump... doesn't seem to have your specific failure? where the exception is "The thread tried to read from or write to a virtual address for which it does not have the appropriate access" and from looking at the various threads running in this dump I wonder if it's just that you're stressing windows with a ton of native pinvokes that may not be intended to be called concurrently in the same process. Just an idea, it's been a long time since I've been on the Windows team :) I've got two ideas to try to help you. Note I haven't done much more than review your code and builds and open up the dmp file above and looked at the AV in it.
Finally, I can (if you can get a corpnet jump box or use mine) provide you with a machine running our Server 2019 image, though I expect simply using the latest version from the Azure Gallery is likely to be able to fail the same way these build machines do, if that turns out to be the problem. |
I'm not sure if this is release specific - see crash in https://dev.azure.com/dnceng/public/_build/results?buildId=678166&view=logs&jobId=1f132584-ab08-5fbf-e5d0-5685de454342
This is interesting because these tests are (relatively) new. There be a chance that bad interop definitions could be causing this. For example we did see that |
@RussKie might want to learn where to find those ;)
I think thats a misunderstanding, its just failing very unreliably. Took me hours of local test runs to fail once with dumps enabled, but then it failed twice in a row (second dump here didn't get around to upload earlier)
Definitely not release-only, I had it 4 times in debug build test runs so far, two of them with dumps (linked above).
Its easy to misread this, but really there isn't that much going on. There is a lot of congestion at thread startup/shutdown and there is no throttling so it looks like a lot of tests are running in parallel, but they really aren't, most are still starting up/winding down. Theres certainly potential for optimizing by e.g. throttling the number of UI threads running at a time to the number of cores or something, but I'd rather first fix the memory corruption while we still can reproduce it.
There is no specific failure, these are random test failures, most likely memory corruption. In fact the crashing thread in the dump you linked seems to have is stack corrupted. So far I've seen three dumps (two from me, linked above, and the one you linked). I guess I've been lucky that my first dump had so little active threads, the other two are much more work to look through. The three active tests (WebBrowser, HtmlDocument, ImageLists related) are all in common in the dumps so that doesn't help so far. I agree with @hughbe that WebBrowser/Html code could have made a mistake during interop refactoring and are worth looking at the changesets again. Unfortunately WebBrowser engine is one of the largest infrastructures in WinForms (it includes ActiveX hosting) so might not be easy to spot by review alone. I'll try to build reduced test suites which just include the set of 3 test classes identified above and see if they corrupt memory if run repeatedly, maybe that can narrow down what we have to look at. Of course feel free to run your own investigations if you have the time to do so, just wanted to let you know I'm on it. |
Since these are builds you'll have to get Kusto access and use the AzDO API to figure out the "orchestration id" of your job to get this working. It's a bit tricky (it'd be free if your tests ran on helix test machines). @RussKie let me know if you want to send some instructions (and whether you have Kusto read access to engsrvprod). |
Observed yet another heap corruption linked to |
Yes, we had hopes that #3526 would help, but obviously we're wrong. We'll have to continue digging... |
Did that PR help? It seems like there are like 5 different problems here - web browser, disposal, sharing of image lists, etc |
To me it looks pretty clear. While there were a lot of secondary issues found while investigating there is only two sources of crashes identified:
Everything else has been split off into separate issues and has not been the source of crashes seen here. Maybe it was not clear that turning on Checkboxes causes the crashes? It will trigger a double-free of the underlying native ImageList because there is a small chance that the same numeric handle value is reused again for an independent ImageList. Fixing this requires handling ImageList ownership correctly when turning checkbox mode on, the native code is assuming it owns the ImageList and releases it, WinForms must not retain its copy of the handle. Since you don't control who else is using it you have to duplicate it before giving it to the ListView. Alternatively it might be possible to work around this by resetting the ImageList on the native control before turning checkbox mode on. |
I don't believe WebBrowser has anything to do with these failures. I haven't observed it in any of the dumps I have. Here's one of the latests AV crashes:
This crash also results in a popup that hangs builds:
As far as our research went this is the most likely cause, but so far we were unable to get to the bottom of the issue... |
They are reproducible though, even without any ImageList in play, see my other repro scenario in the repository. Its not always easy to separate dumps caused from memory corruption, in hindsight you may be right and the CI dumps are probably mostly ImageList handle collisions. The WebBrowser and ImageList tests always happened to run at roughly the same time and corruption from WebBrowser was much more frequent, at least for me running tests locally, so it reproduced more easily and was discovered first. I don't think there is any point separating them into an issue of their own considering the WinForms side of the investigation is already over and the workaround in place, but if you want to track the response of the external investigation so you can reconsider removing the sequential attributes from tests I can create a separate issue if you want. |
Made a PR which disables ImageList ownership sharing and instead manages ownership explicitly by creating unique ImageList handles when assigning to the native control. Your choice of whether you want to take it or live with the broken tests for now (until you can fix it), I've no strong opinion, I don't think it happens much in practice outside the unit tests |
It would be really interesting to see if there’s some report of this snuck away in stack overflow or forums somewhere |
For general reference, stumbled across this post that alleges an imagelist may be double destroyed if it is still bound to a control |
I wouldn't put too much weight on that, he just saw the same thing I did. In particular its no indication of where the bug is, without repro to follow along you can't tell who is responsible for the off-by-one refcount. The incorrect release could have happened anywhere, it will just manifest in a bug when the last reference is released, so it will almost always blame the wrong location in the callstack, as the refcount could have been off-by-one for a long time. |
Revert "Explicit ImageList ownership management. (dotnet#3601)" This reverts commit 03db3fb. Revert "Fix `ListView` no longer displays images (dotnet#4184)" This reverts commit d0608e7. We have observed an instability of tests under stress (and reasonably high degree of concurrency) presumably caused by ImageList lifetime handling (notably dotnet#3358). The changes introduced in dotnet#3601 have helped with tests stability, however resulted in a number of regressions, such as dotnet#4169 and dotnet#4275. Restore the original implementation given it worked reasonably well for the past so many years.
Revert "Explicit ImageList ownership management. (#3601)" This reverts commit 03db3fb. Revert "Fix `ListView` no longer displays images (#4184)" This reverts commit d0608e7. We have observed an instability of tests under stress (and reasonably high degree of concurrency) presumably caused by ImageList lifetime handling (notably #3358). The changes introduced in #3601 have helped with tests stability, however resulted in a number of regressions, such as #4169 and #4275. Restore the original implementation given it worked reasonably well for the past so many years.
Revert "Explicit ImageList ownership management. (dotnet#3601)" This reverts commit 03db3fb. Revert "Fix `ListView` no longer displays images (dotnet#4184)" This reverts commit d0608e7. We have observed an instability of tests under stress (and reasonably high degree of concurrency) presumably caused by ImageList lifetime handling (notably dotnet#3358). The changes introduced in dotnet#3601 have helped with tests stability, however resulted in a number of regressions, such as dotnet#4169 and dotnet#4275. Restore the original implementation given it worked reasonably well for the past so many years. (cherry picked from commit b0ee5da)
Revert "Explicit ImageList ownership management. (#3601)" This reverts commit 03db3fb. Revert "Fix `ListView` no longer displays images (#4184)" This reverts commit d0608e7. We have observed an instability of tests under stress (and reasonably high degree of concurrency) presumably caused by ImageList lifetime handling (notably #3358). The changes introduced in #3601 have helped with tests stability, however resulted in a number of regressions, such as #4169 and #4275. Restore the original implementation given it worked reasonably well for the past so many years. (cherry picked from commit b0ee5da)
For the past few days I'm observing tests randomly fail in Release builds, with some builds providing no logs of failures.
Few other logs I found are below:
The text was updated successfully, but these errors were encountered: