-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/dds/test/dap/integration tests are flaky #55313
Comments
Is this consistently failing or was it just once? Is it only on ARM? (I can't repro with my x86 Windows PC). |
It doesn't appear to be a flaky failure as the bot has flagged it as an error. |
I think it only failed once, the test is back to passing now on all configurations (https://dart-current-results.web.app/#/filter=pkg/dds/test/dap/integration/debug_attach_test&showAll). |
Closing this based on comment above. |
While investigating #55705 I'm seeing this issue locally quite reliably. Re-opening this to move the discussion here. The error is caused by the test teardown trying to send a terminate() request to the debug adapter in case tests didn't close it down. However if the test did and the debug adapter was shutting down, there is a race that can result in this error. What's odd, is that no matter what I try to do to catch this error, I cannot. I've tried try/catch for sync code, around the Copying my comment from #55705 (comment) (cc @aam):
That's the conclusion I arrived at yesterday, though I was unable to handle this error. I can "fix" it by changing the test to not auto-resume-on-exit, but I feel like the teardown should be more robust. I tried adding try/catch for the synchronous + async parts, and the write to the stream is even in a zone, but the error always seems to cause the test failure. Do you know why I don't seem to be able to handle this (the Here I inlined the write call (removing the zone) and added try/catch but it still fails (these are the only write calls). |
So, I can reproduce the issue seen here in an isolated script.. It runs "dart --version" and tries to write to the stdin stream even though the process will is exiting: import 'dart:convert';
import 'dart:io';
Future<void> main(List<String> arguments) async {
final process =
await Process.start(Platform.resolvedExecutable, ['--version']);
process.stdout.transform(utf8.decoder).listen(print);
process.stderr.transform(utf8.decoder).listen(print);
final exitFuture =
process.exitCode.then((code) => print('Exited with code $code'));
for (int i = 0; i < 100; i++) {
try {
process.stdin.add(ascii.encode('Hello'));
} catch (e) {
print('caught write error');
}
await Future.delayed(Duration.zero);
}
await exitFuture;
} On most runs, it will fail like this: Although it seemed like the try/catch would work. Changing to a zone does not help: try {
runZonedGuarded(
() {
process.stdin.add(ascii.encode('Hello'));
},
(error, stack) {
print('caught write error via zone');
},
);
} catch (e) {
print('caught write error');
} Is this a bug, or is there some other way I should be able to handle this error? |
Maybe related to #48501 Seems there was an attempted fix to allow calling I'm not sure how we can handle this, but I'll ping on that issue to see if there are any current options (otherwise I may just change this test to not auto-resume which will avoid the race for this test - though it might show up in others). |
If you want to catch async exceptions you need to ensure the relevant futures are created in the zone with exception handler.
|
Ah, I see. I'll see if I can easily make it work that way. It's not super convenient to have to handle all the exceptions in one handler away from the calling code, but fortunately the surface here is quite small and just this test code. It would be nice if the changes discussed in #48501 could be made to work (my understanding is that then we could just call |
@aam I tried implementing handling the error there (and some other tweaks to handle cases where things shut down), but sometimes I see the test just hang. I attached DevTools and I see this isolate which seems to have this "open port" here: It's not clear to me why it's waiting here. The very bottom stack frame shown here in the screenshot is in protocol_stream.dart
There's no Any idea what might be happening here? |
I'm not sure whether this "Allocation Location" points to a place where it's waiting or the place where the receive port was created. @bkonyi Perhaps you can add debugging prints to confirm it's indeed hanging there. Otherwise it might be that receive port created there simply remains opened and that what prevents app from shutting down. In that case you would want to confirm that the code that closes it is indeed executed. |
Sorry, I think you're right. Although it's not clear to me what I need to close here - the underlying stream was closed by the target process, so Here's an isolated repro... this hangs at the end but I can't see anything else I could try to close: import 'dart:async';
import 'dart:convert';
import 'dart:io';
Future<void> main(List<String> arguments) async {
final process = await runZonedGuarded(
() => Process.start(Platform.resolvedExecutable, ['--version']),
(e, st) {
print('OutOfProcess caught error: $e $st - ignoring');
},
)!;
final s1 = process.stdout.transform(utf8.decoder).listen(print);
final s2 = process.stderr.transform(utf8.decoder).listen(print);
final exitFuture =
process.exitCode.then((code) => print('Exited with code $code'));
for (int i = 0; i < 100; i++) {
try {
runZonedGuarded(
() {
process.stdin.add(ascii.encode('Hello'));
},
(error, stack) {
print('caught write error via zone');
},
);
} catch (e) {
print('caught write error');
}
await Future.delayed(Duration.zero);
}
await exitFuture;
print('Done!');
// None of this helps, we still hang at the end.
s1.cancel();
s2.cancel();
process.kill();
print('Done2!');
} |
(if it makes sense to open a new issue about this, let me know - it's probably not entirely related to the issue here, but it may prevent fixing it reliably) |
@DanTup wrote
How are you running this? Are you running this as part of debug_attach_test.dart? |
Looking at the sample code, doesn't awaiting a future returned from |
My understanding of that is that the However, if I move all the code into the callback and remove that await, it doesn't change the behaviour - the script still completes but the VM never terminates at the end: |
That makes sense. I just wanted to make sure it wasn't something related to that. As for the allocation location of the remaining open port, it looks like there's more entries in that stack based on the presence of the scroll bar. Can you confirm? |
Here's the complete stack (using the original repro above):
|
Thanks - I've opened a CL to remove the zone. Although, we got a bit sidetracked - I don't think that was the original issue here, which is an exception trying to write to a closed stream. I don't believe this issue is fixed yet, although it hasn't reproduced for me locally today (after being quite reproducible two weeks ago when I wrote #55313 (comment)). I still don't know what the fix is for the original issue - I can put a zone around the spawning of the process (even though it feels like the wrong place to me) to catch the error, but it results in the app never terminating (see repro above at #55313 (comment)) which feels like a bug. @bkonyi does the stack trace above reveal anything? Should I open a new issue with this repro to avoid confusing this one further? |
Nothing obvious stands out to me. I'm guessing this is related to the issues related to writing to a closed stdin sink? |
I'm not sure... catching the error being a bit awkward is one thing, but that repro results in a VM that doesn't terminate, so I'm not sure if that's a different issue? |
I'm not currently sure how we can fix this. The ideas suggested above can result in a VM that never terminates so if we did that here, this test would presumably just timeout instead. (repro is in my comment above at #55313 (comment)) |
@DanTup wrote
Does this repro still hang for you?
|
@aam sorry, I missed your reply here! I just tested, and on 3.5.0 it never terminates, but on a 3.6 build it does. I will try to apply to the tests, thanks! |
Struggling to repro the issue locally atm, though I notice in the logs this test is failing frequently with a different error too (which I'm also failing to reproduce, but needs fixing in addition to the above):
|
I've had the test running thousands of times in-proc, out-of-proc and with various delays inserted in places I thought might lead to the two issues above, but I've failed to reproduce it even once today. I'm going to open a change that enables verbose logging for the tests temporarily to see if a copy of the DAP traffic might help understand the order that things are happening (or if the process is terminating for an unexpected reason, maybe there will be some useful output logged). Edit: Opened https://dart-review.googlesource.com/c/sdk/+/383640 |
This comment was marked as off-topic.
This comment was marked as off-topic.
@FMorschel is this trying to run the DAP tests in the SDK, or your own tests? |
This comment was marked as off-topic.
This comment was marked as off-topic.
@FMorschel can you still reproduce it? If so, could you file a separate issue (with steps to repro) as this issue is specifically about the flaky DAP tests (and the issue is believed to be specific to the tests, so fixes here likely won't solve any other problems). Thanks! |
This may help track down the flakes discussed in #55313 (see #55313 (comment)). Change-Id: I1a2bc2996ab45498c9b34834df692d5a51982fff Cq-Include-Trybots: luci.dart.try:pkg-win-release-arm64-try,pkg-win-release-try Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/383640 Commit-Queue: Helin Shiah <helinx@google.com> Reviewed-by: Ben Konyi <bkonyi@google.com> Reviewed-by: Helin Shiah <helinx@google.com>
This test failed a few times since I added the verbose logging, though for a different reason to the above. It appears that the DAP client calls I'll try to repro, or review the code to see why we might be calling two resumes.
|
Struggling to repro locally, but I think I understand the problem.. The two resumes are actually correct. The first ( I think there are two possibilities here:
@bkonyi does (2) seem feasible? We do handle a lot of errors when trying to resume, however we don't currently handle a "Service connection disposed"-style error. I feel that we probably should though - for a resume from stopped-on-exit, it seems like there is a good chance that the VM could vanish before we handle the response to the resume? |
If the test target exits immediately after being resumed, 2. seems like the most likely case. I think your CL should fix this issue. |
This will hopefully fix the flaky tests noted in #55313. I believe when we resume from pause-on-exit, the VM might shut down fast enough that the response is not handled and throws "Service connection disposed" (see #55313 (comment)). This change will discard any known shutdown errors while sending the request instead. Change-Id: I7c10aa3eddd6e651114ddba948b0a17688352720 Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/386806 Reviewed-by: Helin Shiah <helinx@google.com> Commit-Queue: Helin Shiah <helinx@google.com> Reviewed-by: Ben Konyi <bkonyi@google.com>
All of the recent events (since 26th Sept when my change above landed) on https://dart-ci.firebaseapp.com/#showLatestFailures=false&test=pkg/dds/test/dap/integration/debug_attach_test seem to be the status changing to passed. I don't know if that means the fix completely solved this though - in particular I don't know why it took until Oct 1st for it to change from flakey -> Pass for unittest-asserts-release-win-x64. Does that mean it was still flaky from 26th Sept to 1st Oct? Is there a way to find the failure logs for one of the runs between those dates? (I tried the "Failures" tab, but it doesn't list any results at all). |
The test needs to pass some number of times in a row to change from flaky -> Pass. I'm not sure what the exact number is, I think it's at least 100.
The best way I know of is to go to the builder page, in this case: https://ci.chromium.org/ui/p/dart/builders/ci.sandbox/pkg-win-release, use the "created before" filter, and then open a build page and look at the "ignored flaky test failure logs" under the "test results" step. |
Ah, that explains the delay. I think this is resolved then - but I'll check back in around 1wk and ensure none of them have gone back flaky from my link above, and if so close this. Thanks! |
@derekxu16 looking through the link you posted and the one I posted earlier, I can't find any failures/flakes for this test. I think we can close this, but I'd like a second opinion in case I've missed something 😄 Thanks! |
Hmm, not sure - let's keep this open and I'll dig through those. Thanks! |
The tests
are failing on configurations
https://ci.chromium.org/ui/p/dart/builders/ci.sandbox/pkg-win-release-arm64/1155/overview
This could be a Windows pipe problem, or it could be a problem with how DAP is using pipes.
cc @brianquinlan and @DanTup
The text was updated successfully, but these errors were encountered: