Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/dds/test/dap/integration tests are flaky #55313

Open
derekxu16 opened this issue Mar 27, 2024 · 43 comments
Open

pkg/dds/test/dap/integration tests are flaky #55313

derekxu16 opened this issue Mar 27, 2024 · 43 comments
Assignees
Labels
area-tools A meta category for issues that should be addressed by tooling (prefer more concrete areas). dds-dap DDS issues related to the Debug Adapter Protocol (DAP) implementation gardening

Comments

@derekxu16
Copy link
Member

derekxu16 commented Mar 27, 2024

The tests

pkg/dds/test/dap/integration/debug_attach_test RuntimeError (expected Pass)

are failing on configurations

unittest-asserts-release-win-arm64
00:01 �[32m+0�[0m�[31m -1�[0m: debug mode can attach to a simple script using vmServiceUri �[1m�[31m[E]�[0m�[0m

  SocketException: Write failed (OS Error: The pipe is being closed.

  , errno = 232), port = 0

  dart:io-patch/socket_patch.dart 1246:34       _NativeSocket.write
  dart:io-patch/socket_patch.dart 2004:15       _RawSocket.write
  dart:io-patch/socket_patch.dart 2481:18       _Socket._write
  dart:io-patch/socket_patch.dart 2216:28       _SocketStreamConsumer.write
  dart:io-patch/socket_patch.dart 2168:11       _SocketStreamConsumer.addStream.<fn>
  dart:async/zone.dart 1407:47                  _rootRunUnary
  dart:async/zone.dart 1308:19                  _CustomZone.runUnary
  dart:async/zone.dart 1217:7                   _CustomZone.runUnaryGuarded
  dart:async/stream_impl.dart 365:11            _BufferingStreamSubscription._sendData
  dart:async/stream_impl.dart 541:14            _DelayedData.perform
  dart:async/stream_impl.dart 646:11            _PendingEvents.handleNext
  dart:async/stream_impl.dart 617:7             _PendingEvents.schedule.<fn>
  dart:async/zone.dart 1391:47                  _rootRun
  dart:async/zone.dart 1301:19                  _CustomZone.run
  dart:async/zone.dart 1209:7                   _CustomZone.runGuarded
  dart:async/zone.dart 1249:23                  _CustomZone.bindCallbackGuarded.<fn>
  dart:async/zone.dart 1399:13                  _rootRun
  dart:async/zone.dart 1301:19                  _CustomZone.run
  dart:async/zone.dart 1209:7                   _CustomZone.runGuarded
  dart:async/zone.dart 1249:23                  _CustomZone.bindCallbackGuarded.<fn>
  dart:async/schedule_microtask.dart 40:21      _microtaskLoop
  dart:async/schedule_microtask.dart 49:5       _startMicrotaskLoop
  dart:isolate-patch/isolate_patch.dart 118:13  _runPendingImmediateCallback
  dart:isolate-patch/isolate_patch.dart 185:5   _RawReceivePort._handleMessage

https://ci.chromium.org/ui/p/dart/builders/ci.sandbox/pkg-win-release-arm64/1155/overview

This could be a Windows pipe problem, or it could be a problem with how DAP is using pipes.

cc @brianquinlan and @DanTup

@DanTup
Copy link
Collaborator

DanTup commented Mar 28, 2024

Is this consistently failing or was it just once? Is it only on ARM? (I can't repro with my x86 Windows PC).

@a-siva a-siva added area-tools A meta category for issues that should be addressed by tooling (prefer more concrete areas). dds-dap DDS issues related to the Debug Adapter Protocol (DAP) implementation labels Mar 28, 2024
@a-siva
Copy link
Contributor

a-siva commented Mar 28, 2024

Is this consistently failing or was it just once? Is it only on ARM? (I can't repro with my x86 Windows PC).

It doesn't appear to be a flaky failure as the bot has flagged it as an error.
If you wanted to try out some logging it is possible to use the CI to run that code on a Windows ARM64 machine.

@derekxu16
Copy link
Member Author

Is this consistently failing or was it just once? Is it only on ARM? (I can't repro with my x86 Windows PC).

I think it only failed once, the test is back to passing now on all configurations (https://dart-current-results.web.app/#/filter=pkg/dds/test/dap/integration/debug_attach_test&showAll).

@a-siva
Copy link
Contributor

a-siva commented May 10, 2024

Closing this based on comment above.

@a-siva a-siva closed this as completed May 10, 2024
@DanTup
Copy link
Collaborator

DanTup commented May 15, 2024

While investigating #55705 I'm seeing this issue locally quite reliably. Re-opening this to move the discussion here.

The error is caused by the test teardown trying to send a terminate() request to the debug adapter in case tests didn't close it down. However if the test did and the debug adapter was shutting down, there is a race that can result in this error.

What's odd, is that no matter what I try to do to catch this error, I cannot. I've tried try/catch for sync code, around the await, and the code is already in a runZoneGuarded to handle any errors.

Copying my comment from #55705 (comment) (cc @aam):


This one is due to how during test session tearDown, client queues command "terminate" to the server's stdin, which races with server process itself shutting down first closing stdin socket, before that "terminate" command has a chance of being written into that server's stdin.

That's the conclusion I arrived at yesterday, though I was unable to handle this error. I can "fix" it by changing the test to not auto-resume-on-exit, but I feel like the teardown should be more robust. I tried adding try/catch for the synchronous + async parts, and the write to the stream is even in a zone, but the error always seems to cause the test failure. Do you know why I don't seem to be able to handle this (the terminate call seems to synchronously write to the stream, so I would've expected a try/catch, or the zone to handle it).

Here I inlined the write call (removing the zone) and added try/catch but it still fails (these are the only write calls).

image

@DanTup DanTup reopened this May 15, 2024
@DanTup DanTup changed the title pkg/dds/test/dap/integration/debug_attach_test is failing on unittest-asserts-release-win-arm64 pkg/dds/test/dap/integration/debug_attach_test is failing on Windows May 15, 2024
@DanTup
Copy link
Collaborator

DanTup commented May 15, 2024

So, I can reproduce the issue seen here in an isolated script.. It runs "dart --version" and tries to write to the stdin stream even though the process will is exiting:

import 'dart:convert';
import 'dart:io';

Future<void> main(List<String> arguments) async {
  final process =
      await Process.start(Platform.resolvedExecutable, ['--version']);

  process.stdout.transform(utf8.decoder).listen(print);
  process.stderr.transform(utf8.decoder).listen(print);

  final exitFuture =
      process.exitCode.then((code) => print('Exited with code $code'));

  for (int i = 0; i < 100; i++) {
    try {
      process.stdin.add(ascii.encode('Hello'));
    } catch (e) {
      print('caught write error');
    }
    await Future.delayed(Duration.zero);
  }

  await exitFuture;
}

On most runs, it will fail like this:

image

Although it seemed like the try/catch would work.

Changing to a zone does not help:

try {
  runZonedGuarded(
    () {
      process.stdin.add(ascii.encode('Hello'));
    },
    (error, stack) {
      print('caught write error via zone');
    },
  );
} catch (e) {
  print('caught write error');
}

image

Is this a bug, or is there some other way I should be able to handle this error?

@DanTup
Copy link
Collaborator

DanTup commented May 15, 2024

Maybe related to #48501

Seems there was an attempted fix to allow calling flush() to trigger the exception so it can be caught, but that was reverted in 97d8445

I'm not sure how we can handle this, but I'll ping on that issue to see if there are any current options (otherwise I may just change this test to not auto-resume which will avoid the race for this test - though it might show up in others).

@aam
Copy link
Contributor

aam commented May 15, 2024

If you want to catch async exceptions you need to ensure the relevant futures are created in the zone with exception handler.
So for example to catch failures with attempts to write to pipe of spawned process in the test you can do the following:

diff --git a/pkg/dds/test/dap/integration/test_server.dart b/pkg/dds/test/dap/integration/test_server.dart
index 4e0d2345e36..c4268dce7e6 100644
--- a/pkg/dds/test/dap/integration/test_server.dart
+++ b/pkg/dds/test/dap/integration/test_server.dart
@@ -137,7 +137,9 @@ class OutOfProcessDapTestServer extends DapTestServer {
       ...?additionalArgs,
     ];

-    final process = await Process.start(Platform.resolvedExecutable, args);
+    final process = await runZonedGuarded(
+        () => Process.start(Platform.resolvedExecutable, args),
+        (e, st) { print('OutOfProcess caught error: $e $st - ignoring'); })!;

@DanTup
Copy link
Collaborator

DanTup commented May 15, 2024

Ah, I see. I'll see if I can easily make it work that way. It's not super convenient to have to handle all the exceptions in one handler away from the calling code, but fortunately the surface here is quite small and just this test code.

It would be nice if the changes discussed in #48501 could be made to work (my understanding is that then we could just call flush() after add() and catch the exception on that).

@DanTup
Copy link
Collaborator

DanTup commented May 20, 2024

@aam I tried implementing handling the error there (and some other tweaks to handle cases where things shut down), but sometimes I see the test just hang. I attached DevTools and I see this isolate which seems to have this "open port" here:

image

It's not clear to me why it's waiting here. The very bottom stack frame shown here in the screenshot is in protocol_stream.dart ByteStreamServerChannel._write which looks like this:

/// Send [bytes] to [_output].
void _write(List<int> bytes) {
  runZonedGuarded(
    () => _output.add(bytes),
    (e, s) => close(),
  );
}

There's no await here so it's not clear to me why this seems to be stalled (as far as I can tell, it's on the call to _output.add(). As far as I can tell, the isolate is not paused on an exception either.

Any idea what might be happening here?

@aam
Copy link
Contributor

aam commented May 20, 2024

I attached DevTools and I see this isolate which seems to have this "open port" here:
It's not clear to me why it's waiting here.

I'm not sure whether this "Allocation Location" points to a place where it's waiting or the place where the receive port was created. @bkonyi Perhaps you can add debugging prints to confirm it's indeed hanging there.

Otherwise it might be that receive port created there simply remains opened and that what prevents app from shutting down. In that case you would want to confirm that the code that closes it is indeed executed.

@DanTup
Copy link
Collaborator

DanTup commented May 21, 2024

I'm not sure whether this "Allocation Location" points to a place where it's waiting or the place where the receive port was created.

Sorry, I think you're right. Although it's not clear to me what I need to close here - the underlying stream was closed by the target process, so .add() was throwing, which I handled by creating a process in an error zone as you suggested in the code example above.

Here's an isolated repro... this hangs at the end but I can't see anything else I could try to close:

import 'dart:async';
import 'dart:convert';
import 'dart:io';

Future<void> main(List<String> arguments) async {
  final process = await runZonedGuarded(
    () => Process.start(Platform.resolvedExecutable, ['--version']),
    (e, st) {
      print('OutOfProcess caught error: $e $st - ignoring');
    },
  )!;

  final s1 = process.stdout.transform(utf8.decoder).listen(print);
  final s2 = process.stderr.transform(utf8.decoder).listen(print);

  final exitFuture =
      process.exitCode.then((code) => print('Exited with code $code'));

  for (int i = 0; i < 100; i++) {
    try {
      runZonedGuarded(
        () {
          process.stdin.add(ascii.encode('Hello'));
        },
        (error, stack) {
          print('caught write error via zone');
        },
      );
    } catch (e) {
      print('caught write error');
    }
    await Future.delayed(Duration.zero);
  }

  await exitFuture;
  print('Done!');

  // None of this helps, we still hang at the end.
  s1.cancel();
  s2.cancel();
  process.kill();
  print('Done2!');
}

image

@DanTup
Copy link
Collaborator

DanTup commented May 21, 2024

(if it makes sense to open a new issue about this, let me know - it's probably not entirely related to the issue here, but it may prevent fixing it reliably)

@aam
Copy link
Contributor

aam commented May 21, 2024

@DanTup wrote

Here's an isolated repro... this hangs at the end but I can't see anything else I could try to close

How are you running this? Are you running this as part of debug_attach_test.dart?

@DanTup
Copy link
Collaborator

DanTup commented May 22, 2024

How are you running this? Are you running this as part of debug_attach_test.dart?

Just in a standalone Dart script with dart foo.dart (on Windows). Happens at least with a previous Dart 3.3 and also a recent 3.5 build.

image

@bkonyi
Copy link
Contributor

bkonyi commented May 22, 2024

Looking at the sample code, doesn't awaiting a future returned from runZonedGuarded sometimes result in unexpected behavior since the returned Future is crossing zones?

@DanTup
Copy link
Collaborator

DanTup commented May 22, 2024

My understanding of that is that the Future returned by runZonedGuarded isn't guaranteed to complete. In this case it is, because the rest of the program is running.

However, if I move all the code into the callback and remove that await, it doesn't change the behaviour - the script still completes but the VM never terminates at the end:

image

@bkonyi
Copy link
Contributor

bkonyi commented May 22, 2024

That makes sense. I just wanted to make sure it wasn't something related to that.

As for the allocation location of the remaining open port, it looks like there's more entries in that stack based on the presence of the scroll bar. Can you confirm?

@DanTup
Copy link
Collaborator

DanTup commented May 22, 2024

Here's the complete stack (using the original repro above):


#0      new _RawReceivePort._ (dart:isolate-patch/isolate_patch.dart:138:55)
#1      new _RawReceivePort (dart:isolate-patch/isolate_patch.dart:132:34)
#2      new RawReceivePort (dart:isolate-patch/isolate_patch.dart:65:34)
#3      _NativeSocket.connectToEventHandler (dart:io-patch/socket_patch.dart:1599:23)
#4      _NativeSocket.sendToEventHandler (dart:io-patch/socket_patch.dart:1592:5)
#5      _NativeSocket.setListening (dart:io-patch/socket_patch.dart:1537:7)
#6      _RawSocket._resume (dart:io-patch/socket_patch.dart:2056:13)
#7      _RawSocket._onSubscriptionStateChange (dart:io-patch/socket_patch.dart:2069:7)
#8      _runGuarded (dart:async/stream_controller.dart:824:24)
#9      _StreamController._subscribe.<anonymous closure> (dart:async/stream_controller.dart:702:7)
#10     _BufferingStreamSubscription._guardCallback (dart:async/stream_impl.dart:441:13)
#11     _StreamController._subscribe (dart:async/stream_controller.dart:701:18)
#12     _ControllerStream._createSubscription (dart:async/stream_controller.dart:837:19)
#13     _StreamImpl.listen (dart:async/stream_impl.dart:497:9)
#14     _RawSocket.listen (dart:io-patch/socket_patch.dart:1976:31)
#15     _Socket._ensureRawSocketSubscription (dart:io-patch/socket_patch.dart:2412:27)
#16     _SocketStreamConsumer.addStream (dart:io-patch/socket_patch.dart:2159:12)
#17     _StreamSinkImpl._controller (dart:io/io_sink.dart:242:15)
#18     _StreamSinkImpl.add (dart:io/io_sink.dart:154:5)
#19     _Socket.add (dart:io-patch/socket_patch.dart:2323:38)
#20     _StdSink.add (dart:io/stdio.dart:442:11)
#21     main.<anonymous closure> (file:///C:/Dev/Test%20Projects/write_to_exited_process/bin/write_to_exited_process.dart:23:25)
#22     _rootRun (dart:async/zone.dart:1399:13)
#23     _CustomZone.run (dart:async/zone.dart:1301:19)
#24     _runZoned (dart:async/zone.dart:1826:10)
#25     runZonedGuarded (dart:async/zone.dart:1814:12)
#26     main (file:///C:/Dev/Test%20Projects/write_to_exited_process/bin/write_to_exited_process.dart:21:7)
#27     _FutureListener.handleValue (dart:async/future_impl.dart:163:18)
#28     Future._propagateToListeners.handleValueCallback (dart:async/future_impl.dart:847:45)
#29     Future._propagateToListeners (dart:async/future_impl.dart:876:13)
#30     Future._completeWithValue (dart:async/future_impl.dart:652:5)
#31     Future._asyncCompleteWithValue.<anonymous closure> (dart:async/future_impl.dart:722:7)
#32     _rootRun (dart:async/zone.dart:1399:13)
#33     _CustomZone.run (dart:async/zone.dart:1301:19)
#34     _CustomZone.runGuarded (dart:async/zone.dart:1209:7)
#35     _CustomZone.bindCallbackGuarded.<anonymous closure> (dart:async/zone.dart:1249:23)
#36     _microtaskLoop (dart:async/schedule_microtask.dart:40:21)
#37     _startMicrotaskLoop (dart:async/schedule_microtask.dart:49:5)
#38     _runPendingImmediateCallback (dart:isolate-patch/isolate_patch.dart:118:13)
#39     _Timer._runTimers (dart:isolate-patch/timer_impl.dart:405:11)
#40     _Timer._handleMessage (dart:isolate-patch/timer_impl.dart:429:5)
#41     _RawReceivePort._handleMessage (dart:isolate-patch/isolate_patch.dart:184:12)

@derekxu16
Copy link
Member Author

@DanTup in case you haven't seen it yet, I've landed a fix for #55731. So it should be possible to make this test work by removing the runZonedGuarded call that you added to DAP.

@DanTup
Copy link
Collaborator

DanTup commented May 29, 2024

Thanks - I've opened a CL to remove the zone.

Although, we got a bit sidetracked - I don't think that was the original issue here, which is an exception trying to write to a closed stream. I don't believe this issue is fixed yet, although it hasn't reproduced for me locally today (after being quite reproducible two weeks ago when I wrote #55313 (comment)).

I still don't know what the fix is for the original issue - I can put a zone around the spawning of the process (even though it feels like the wrong place to me) to catch the error, but it results in the app never terminating (see repro above at #55313 (comment)) which feels like a bug.

@bkonyi does the stack trace above reveal anything? Should I open a new issue with this repro to avoid confusing this one further?

@bkonyi
Copy link
Contributor

bkonyi commented May 30, 2024

@bkonyi does the stack trace above reveal anything? Should I open a new issue with this repro to avoid confusing this one further?

Nothing obvious stands out to me. I'm guessing this is related to the issues related to writing to a closed stdin sink?

@DanTup
Copy link
Collaborator

DanTup commented May 31, 2024

Nothing obvious stands out to me. I'm guessing this is related to the issues related to writing to a closed stdin sink?

I'm not sure... catching the error being a bit awkward is one thing, but that repro results in a VM that doesn't terminate, so I'm not sure if that's a different issue?

@DanTup
Copy link
Collaborator

DanTup commented Jul 19, 2024

I'm not currently sure how we can fix this. The ideas suggested above can result in a VM that never terminates so if we did that here, this test would presumably just timeout instead.

(repro is in my comment above at #55313 (comment))

@aam
Copy link
Contributor

aam commented Jul 29, 2024

@aam
Copy link
Contributor

aam commented Jul 29, 2024

@DanTup wrote

(repro is in my comment above at #55313 (comment))

Does this repro still hang for you?
I just tried it from command line dart and from vs code and the sample app finishes fine, doesn't hang:

$ out/DebugX64/dart hangproc.dart 
OutOfProcess caught error: SocketException: Write failed (OS Error: Broken pipe, errno = 32), port = 0 #0      _NativeSocket.write (dart:io-patch/socket_patch.dart:1252:34)
#1      _RawSocket.write (dart:io-patch/socket_patch.dart:2010:15)
#2      _Socket._write (dart:io-patch/socket_patch.dart:2487:18)
#3      _SocketStreamConsumer.write (dart:io-patch/socket_patch.dart:2222:28)
#4      _SocketStreamConsumer.addStream.<anonymous closure> (dart:io-patch/socket_patch.dart:2174:11)
#5      _rootRunUnary (dart:async/zone.dart:1407:47)
#6      _CustomZone.runUnary (dart:async/zone.dart:1308:19)
#7      _CustomZone.runUnaryGuarded (dart:async/zone.dart:1217:7)
#8      _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:365:11)
#9      _BufferingStreamSubscription._add (dart:async/stream_impl.dart:297:7)
#10     _SyncStreamControllerDispatch._sendData (dart:async/stream_controller.dart:784:19)
#11     _StreamController._add (dart:async/stream_controller.dart:658:7)
#12     _StreamController.add (dart:async/stream_controller.dart:606:5)
#13     _StreamSinkImpl.add (dart:io/io_sink.dart:154:17)
#14     _Socket.add (dart:io-patch/socket_patch.dart:2329:38)
#15     _StdSink.add (dart:io/stdio.dart:442:11)
#16     main.<anonymous closure> (file:///usr/local/google/home/aam/p/d/d2/sdk/hangproc.dart:23:25)
#17     _rootRun (dart:async/zone.dart:1399:13)
#18     _CustomZone.run (dart:async/zone.dart:1301:19)
#19     _runZoned (dart:async/zone.dart:1826:10)
#20     runZonedGuarded (dart:async/zone.dart:1814:12)
#21     main (file:///usr/local/google/home/aam/p/d/d2/sdk/hangproc.dart:21:7)
<asynchronous suspension>
 - ignoring
Exited with code 0
Dart SDK version: 3.6.0-edge (main) (Unknown timestamp) on "linux_x64"

Done!
Done2!

@DanTup
Copy link
Collaborator

DanTup commented Sep 4, 2024

@aam sorry, I missed your reply here! I just tested, and on 3.5.0 it never terminates, but on a 3.6 build it does. I will try to apply to the tests, thanks!

@DanTup
Copy link
Collaborator

DanTup commented Sep 4, 2024

Struggling to repro the issue locally atm, though I notice in the logs this test is failing frequently with a different error too (which I'm also failing to reproduce, but needs fixing in addition to the above):

00:05 �[32m+1�[0m�[31m -1�[0m: debug mode can attach to a simple script using vmServiceInfoFile �[1m�[31m[E]�[0m�[0m

  Request "continue" failed:
  {
      "seq": 20,
      "type": "response",
      "body": {
          "error": {
              "format": "{message}",
              "id": 1,
              "showUser": false,
              "variables": {
                  "message": "resume: (-32000) Service connection disposed",
                  "stack": "#0      new _OutstandingRequest (package:vm_service/src/vm_service.dart:261:34)\n#1      VmService._call.<anonymous closure> (package:vm_service/src/vm_service.dart:1778:25)\n#2      VmService._call (package:vm_service/src/vm_service.dart:1790:8)\n#3      VmService.resume (package:vm_service/src/vm_service.dart:1480:7)\n#4      IsolateManager._resume (package:dds/src/dap/isolate_manager.dart:363:33)\n#5      IsolateManager.resumeThread (package:dds/src/dap/isolate_manager.dart:291:11)\n#6      DartDebugAdapter.continueRequest (package:dds/src/dap/adapters/dart.dart:943:26)\n#7      BaseDebugAdapter.handle (package:dds/src/dap/base_debug_adapter.dart:141:20)\n#8      BaseDebugAdapter.handleIncomingRequest (package:dds/src/dap/base_debug_adapter.dart:400:7)\n#9      BaseDebugAdapter._handleIncomingMessage (package:dds/src/dap/base_debug_adapter.dart:311:7)\n#10     ByteStreamServerChannel._readMessage (package:dds/src/dap/protocol_stream.dart:82:18)\n#11     ByteStreamServerChannel.listen.<anonymous closure> (package:dds/src/dap/protocol_stream.dart:53:24)\n#12     _RootZone.runUnaryGuarded (dart:async/zone.dart:1609:10)\n#13     _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:366:11)\n#14     _DelayedData.perform (dart:async/stream_impl.dart:542:14)\n#15     _PendingEvents.handleNext (dart:async/stream_impl.dart:647:11)\n#16     _PendingEvents.schedule.<anonymous closure> (dart:async/stream_impl.dart:618:7)\n#17     _microtaskLoop (dart:async/schedule_microtask.dart:40:21)\n#18     _startMicrotaskLoop (dart:async/schedule_microtask.dart:49:5)\n#19     _runPendingImmediateCallback (dart:isolate-patch/isolate_patch.dart:118:13)\n#20     _RawReceivePort._handleMessage (dart:isolate-patch/isolate_patch.dart:185:5)\n"
              }
          }
      },
      "command": "continue",
      "message": "resume: (-32000) Service connection disposed",
      "request_seq": 7,
      "success": false
  }

@DanTup
Copy link
Collaborator

DanTup commented Sep 4, 2024

I've had the test running thousands of times in-proc, out-of-proc and with various delays inserted in places I thought might lead to the two issues above, but I've failed to reproduce it even once today.

I'm going to open a change that enables verbose logging for the tests temporarily to see if a copy of the DAP traffic might help understand the order that things are happening (or if the process is terminating for an unexpected reason, maybe there will be some useful output logged).

Edit: Opened https://dart-review.googlesource.com/c/sdk/+/383640

@FMorschel

This comment was marked as off-topic.

@DanTup
Copy link
Collaborator

DanTup commented Sep 13, 2024

@FMorschel is this trying to run the DAP tests in the SDK, or your own tests?

@FMorschel

This comment was marked as off-topic.

@DanTup
Copy link
Collaborator

DanTup commented Sep 13, 2024

@FMorschel can you still reproduce it? If so, could you file a separate issue (with steps to repro) as this issue is specifically about the flaky DAP tests (and the issue is believed to be specific to the tests, so fixes here likely won't solve any other problems). Thanks!

copybara-service bot pushed a commit that referenced this issue Sep 13, 2024
This may help track down the flakes discussed in #55313 (see #55313 (comment)).

Change-Id: I1a2bc2996ab45498c9b34834df692d5a51982fff
Cq-Include-Trybots: luci.dart.try:pkg-win-release-arm64-try,pkg-win-release-try
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/383640
Commit-Queue: Helin Shiah <helinx@google.com>
Reviewed-by: Ben Konyi <bkonyi@google.com>
Reviewed-by: Helin Shiah <helinx@google.com>
@DanTup
Copy link
Collaborator

DanTup commented Sep 26, 2024

This test failed a few times since I added the verbose logging, though for a different reason to the above. It appears that the DAP client calls continue when the thread stops-on-exit, and then we send both a readyToResume and a resume for the same isolate (this seems wrong?) and then get the error below.

I'll try to repro, or review the code to see why we might be calling two resumes.

00:00 �[32m+0�[0m: debug mode can attach to a simple script using vmServiceUri�[0m
<== [DAP] {"seq":89,"type":"event","body":{"reason":"entry","threadId":1},"event":"stopped"}
==> [DAP] {"seq":5,"type":"request","arguments":{"levels":1,"startFrame":0,"threadId":1},"command":"stackTrace"}
<== [DAP] {"seq":90,"type":"event","body":{"message":"==> [VM] {\"jsonrpc\":\"2.0\",\"id\":\"37\",\"method\":\"readyToResume\",\"params\":{\"isolateId\":\"isolates/6673880339557787\"}}"},"event":"dart.log"}
<== [DAP] {"seq":91,"type":"event","body":{"message":"<== [VM] {\"jsonrpc\":\"2.0\",\"result\":{\"type\":\"Success\"},\"id\":\"37\"}"},"event":"dart.log"}
<== [DAP] {"seq":92,"type":"event","body":{"message":"==> [VM] {\"jsonrpc\":\"2.0\",\"id\":\"38\",\"method\":\"getStack\",\"params\":{\"isolateId\":\"isolates/6673880339557787\",\"limit\":1}}"},"event":"dart.log"}
<== [DAP] {"seq":93,"type":"event","body":{"message":"<== [VM] {\"jsonrpc\":\"2.0\",\"result\":{\"type\":\"Stack\",\"frames\":[],\"truncated\":false,\"messages\":[{\"type\":\"Message\",\"name\":\"Isolate Message (2752eb4cda0)\",\"messageObjectId\":\"messages/2752eb4cda0\",\"size\":0,\"index\":0,\"_destinationPort\":\"8074002093852787\",\"_priority\":\"Normal\",\"handler\":{\"type\":\"@Function\",\"fixedId\":true,\"id\":\"libraries/@1026248/closures/7\",\"name\":\"<anonymous closure>\",\"owner\":{\"type\":\"@Function\",\"fixedId\":true,\"id\":\"libraries/@1026248/functions/_delayEntrypointInvocation%401026248\",\"name\":\"_delayEntrypointInvocation\",\"_vmName\":\"_delayEntrypointInvocation@1026248\",\"owner\":{\"type\":\"@Library\",\"fixedId\":true,\"id\":\"libraries/@1026248\",\"name\":\"dart.isolate\",\"uri\":\"dart:isolate\"},\"_kind\":\"RegularFunction\",\"static\":true,\"const\":false,\"implicit\":false,\"abstract\":false,\"_intrinsic\":false,\"_native\":false,\"isGetter\":false,\"isSetter\":false,\"location\":{\"type\":\"SourceLocation\",\"script\":{\"type\":\"@Script\",\"fixedId\":true,\"id\":\"libraries/@1026248/scripts/dart%3Aisolate-patch%2Fisolate_patch.dart/0\",\"uri\":\"dart:isolate-patch/isolate_patch.dart\",\"_kind\":\"kernel\"},\"tokenPos\":8277,\"endTokenPos\":8813,\"line\":286,\"column\":1}},\"_kind\":\"ClosureFunction\",\"static\":true,\"const\":false,\"implicit\":false,\"abstract\":false,\"_intrinsic\":false,\"_native\":false,\"isGetter\":false,\"isSetter\":false,\"location\":{\"type\":\"SourceLocation\",\"script\":{\"type\":\"@Script\",\"fixedId\":true,\"id\":\"libraries/@1026248/scripts/dart%3Aisolate-patch%2Fisolate_patch.dart/0\",\"uri\":\"dart:isolate-patch/isolate_patch.dart\",\"_kind\":\"kernel\"},\"tokenPos\":8451,\"endTokenPos\":8783,\"line\":289,\"column\":18}},\"location\":{\"type\":\"SourceLocation\",\"script\":{\"type\":\"@Script\",\"fixedId\":true,\"id\":\"libraries/@1026248/scripts/dart%3Aisolate-patch%2Fisolate_patch.dart/0\",\"uri\":\"dart:isolate-patch/isolate_patch.dart\",\"_kind\":\"kernel\"},\"tokenPos\":8451,\"endTokenPos\":8783,\"line\":289,\"column\":18}}]},\"id\":\"38\"}"},"event":"dart.log"}
<== [DAP] {"seq":94,"type":"response","body":{"stackFrames":[],"totalFrames":0},"command":"stackTrace","request_seq":5,"success":true}
==> [DAP] {"seq":6,"type":"request","arguments":{"threadId":1},"command":"continue"}
<== [DAP] {"seq":95,"type":"event","body":{"message":"==> [VM] {\"jsonrpc\":\"2.0\",\"id\":\"39\",\"method\":\"resume\",\"params\":{\"isolateId\":\"isolates/6673880339557787\"}}"},"event":"dart.log"}
<== [DAP] {"seq":96,"type":"event","body":{"message":"<== [VM] {\"jsonrpc\":\"2.0\",\"method\":\"streamNotify\",\"params\":{\"streamId\":\"Debug\",\"event\":{\"type\":\"Event\",\"kind\":\"Resume\",\"isolateGroup\":{\"type\":\"@IsolateGroup\",\"id\":\"isolateGroups/4223293308404791\",\"name\":\"test_file.dart\",\"number\":\"4223293308404791\",\"isSystemIsolateGroup\":false},\"isolate\":{\"type\":\"@Isolate\",\"id\":\"isolates/6673880339557787\",\"name\":\"main\",\"number\":\"6673880339557787\",\"isSystemIsolate\":false,\"isolateGroupId\":\"isolateGroups/4223293308404791\"},\"timestamp\":1727321975311}}}"},"event":"dart.log"}
<== [DAP] {"seq":97,"type":"event","body":{"threadId":1},"event":"continued"}
<== [DAP] {"seq":98,"type":"event","body":{"message":"<== [VM] {\"jsonrpc\":\"2.0\",\"result\":{\"type\":\"Success\"},\"id\":\"39\"}"},"event":"dart.log"}
<== [DAP] {"seq":99,"type":"response","body":{"allThreadsContinued":false},"command":"continue","request_seq":6,"success":true}
<== [DAP] {"seq":100,"type":"event","body":{"message":"<== [VM] {\"jsonrpc\":\"2.0\",\"method\":\"streamNotify\",\"params\":{\"streamId\":\"Stdout\",\"event\":{\"type\":\"Event\",\"kind\":\"WriteEvent\",\"isolateGroup\":{\"type\":\"@IsolateGroup\",\"id\":\"isolateGroups/4223293308404791\",\"name\":\"test_file.dart\",\"number\":\"4223293308404791\",\"isSystemIsolateGroup\":false},\"isolate\":{\"type\":\"@Isolate\",\"id\":\"isolates/6673880339557787\",\"name\":\"main\",\"number\":\"6673880339557787\",\"isSystemIsolate\":false,\"isolateGroupId\":\"isolateGroups/4223293308404791\"},\"timestamp\":1727321975311,\"bytes\":\"SGVsbG8hCg==\"}}}"},"event":"dart.log"}
<== [DAP] {"seq":101,"type":"event","body":{"category":"stdout","output":"Hello!\n"},"event":"output"}
<== [DAP] {"seq":102,"type":"event","body":{"message":"<== [VM] {\"jsonrpc\":\"2.0\",\"method\":\"streamNotify\",\"params\":{\"streamId\":\"Stdout\",\"event\":{\"type\":\"Event\",\"kind\":\"WriteEvent\",\"isolateGroup\":{\"type\":\"@IsolateGroup\",\"id\":\"isolateGroups/4223293308404791\",\"name\":\"test_file.dart\",\"number\":\"4223293308404791\",\"isSystemIsolateGroup\":false},\"isolate\":{\"type\":\"@Isolate\",\"id\":\"isolates/6673880339557787\",\"name\":\"main\",\"number\":\"6673880339557787\",\"isSystemIsolate\":false,\"isolateGroupId\":\"isolateGroups/4223293308404791\"},\"timestamp\":1727321975311,\"bytes\":\"V29ybGQhCg==\"}}}"},"event":"dart.log"}
<== [DAP] {"seq":103,"type":"event","body":{"category":"stdout","output":"World!\n"},"event":"output"}
<== [DAP] {"seq":104,"type":"event","body":{"message":"<== [VM] {\"jsonrpc\":\"2.0\",\"method\":\"streamNotify\",\"params\":{\"streamId\":\"Stdout\",\"event\":{\"type\":\"Event\",\"kind\":\"WriteEvent\",\"isolateGroup\":{\"type\":\"@IsolateGroup\",\"id\":\"isolateGroups/4223293308404791\",\"name\":\"test_file.dart\",\"number\":\"4223293308404791\",\"isSystemIsolateGroup\":false},\"isolate\":{\"type\":\"@Isolate\",\"id\":\"isolates/6673880339557787\",\"name\":\"main\",\"number\":\"6673880339557787\",\"isSystemIsolate\":false,\"isolateGroupId\":\"isolateGroups/4223293308404791\"},\"timestamp\":1727321975327,\"bytes\":\"YXJnczogW29uZSwgdHdvXQo=\"}}}"},"event":"dart.log"}
<== [DAP] {"seq":105,"type":"event","body":{"category":"stdout","output":"args: [one, two]\n"},"event":"output"}
<== [DAP] {"seq":106,"type":"event","body":{"message":"<== [VM] {\"jsonrpc\":\"2.0\",\"method\":\"streamNotify\",\"params\":{\"streamId\":\"Debug\",\"event\":{\"type\":\"Event\",\"kind\":\"PauseExit\",\"isolateGroup\":{\"type\":\"@IsolateGroup\",\"id\":\"isolateGroups/4223293308404791\",\"name\":\"test_file.dart\",\"number\":\"4223293308404791\",\"isSystemIsolateGroup\":false},\"isolate\":{\"type\":\"@Isolate\",\"id\":\"isolates/6673880339557787\",\"name\":\"main\",\"number\":\"6673880339557787\",\"isSystemIsolate\":false,\"isolateGroupId\":\"isolateGroups/4223293308404791\"},\"timestamp\":1727321975328}}}"},"event":"dart.log"}
<== [DAP] {"seq":107,"type":"event","body":{"reason":"exit","threadId":1},"event":"stopped"}
==> [DAP] {"seq":7,"type":"request","arguments":{"threadId":1},"command":"continue"}
<== [DAP] {"seq":108,"type":"event","body":{"message":"==> [VM] {\"jsonrpc\":\"2.0\",\"id\":\"40\",\"method\":\"readyToResume\",\"params\":{\"isolateId\":\"isolates/6673880339557787\"}}"},"event":"dart.log"}
<== [DAP] {"seq":109,"type":"event","body":{"message":"<== [VM] {\"jsonrpc\":\"2.0\",\"result\":{\"type\":\"Success\"},\"id\":\"40\"}"},"event":"dart.log"}
<== [DAP] {"seq":110,"type":"event","body":{"message":"==> [VM] {\"jsonrpc\":\"2.0\",\"id\":\"41\",\"method\":\"resume\",\"params\":{\"isolateId\":\"isolates/6673880339557787\"}}"},"event":"dart.log"}
<== [DAP] {
    "seq": 111,
    "type": "response",
    "body": {
        "error": {
            "format": "{message}",
            "id": 1,
            "showUser": false,
            "variables": {
                "message": "resume: (-32000) Service connection disposed",
                "stack": "
#0      new _OutstandingRequest (package:vm_service/src/vm_service.dart:265:34)
#1      VmService._call.<anonymous closure> (package:vm_service/src/vm_service.dart:1921:25)
#2      VmService._call (package:vm_service/src/vm_service.dart:1933:8)
#3      VmService.resume (package:vm_service/src/vm_service.dart:1623:7)
#4      IsolateManager._resume (package:dds/src/dap/isolate_manager.dart:363:33)
#5      IsolateManager.resumeThread (package:dds/src/dap/isolate_manager.dart:291:11)
#6      DartDebugAdapter.continueRequest (package:dds/src/dap/adapters/dart.dart:943:26)
#7      BaseDebugAdapter.handle (package:dds/src/dap/base_debug_adapter.dart:141:20)
#8      BaseDebugAdapter.handleIncomingRequest (package:dds/src/dap/base_debug_adapter.dart:400:7)
#9      BaseDebugAdapter._handleIncomingMessage (package:dds/src/dap/base_debug_adapter.dart:311:7)
#10     ByteStreamServerChannel._readMessage (package:dds/src/dap/protocol_stream.dart:82:18)
#11     ByteStreamServerChannel.listen.<anonymous closure> (package:dds/src/dap/protocol_stream.dart:53:24)
#12     _RootZone.runUnaryGuarded (dart:async/zone.dart:1609:10)
#13     _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:366:11)
#14     _DelayedData.perform (dart:async/stream_impl.dart:542:14)
#15     _PendingEvents.handleNext (dart:async/stream_impl.dart:647:11)
#16     _PendingEvents.schedule.<anonymous closure> (dart:async/stream_impl.dart:618:7)
#17     _microtaskLoop (dart:async/schedule_microtask.dart:40:21)
#18     _startMicrotaskLoop (dart:async/schedule_microtask.dart:49:5)
#19     _runPendingImmediateCallback (dart:isolate-patch/isolate_patch.dart:118:13)
#20     _RawReceivePort._handleMessage (dart:isolate-patch/isolate_patch.dart:185:5)
"
            }
        }
    },
    "command": "continue",
    "message": "resume: (-32000) Service connection disposed",
    "request_seq": 7,
    "success": false
}

@DanTup
Copy link
Collaborator

DanTup commented Sep 26, 2024

Struggling to repro locally, but I think I understand the problem..

The two resumes are actually correct. The first (readyToResume) is the debug adapter stating that it is ready to resume, and the resume is the user(test)-initiated resume. Both of these are required for the VM to resume.

I think there are two possibilities here:

  1. The first call (readyToResume) is actually causing the VM to resume (which then exits), however requireUserPermissionToResume is being set correctly and in testing and in my testing that works correctly
  2. The second call (resume) is correctly what's causing the VM to resume, but it exits so quickly, that the response to the resume call is not received by the client before the VM service connection drops, which results in this error

@bkonyi does (2) seem feasible? We do handle a lot of errors when trying to resume, however we don't currently handle a "Service connection disposed"-style error. I feel that we probably should though - for a resume from stopped-on-exit, it seems like there is a good chance that the VM could vanish before we handle the response to the resume?

@bkonyi
Copy link
Contributor

bkonyi commented Sep 26, 2024

If the test target exits immediately after being resumed, 2. seems like the most likely case. I think your CL should fix this issue.

copybara-service bot pushed a commit that referenced this issue Sep 26, 2024
This will hopefully fix the flaky tests noted in #55313. I believe when we resume from pause-on-exit, the VM might shut down fast enough that the response is not handled and throws "Service connection disposed" (see #55313 (comment)).

This change will discard any known shutdown errors while sending the request instead.

Change-Id: I7c10aa3eddd6e651114ddba948b0a17688352720
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/386806
Reviewed-by: Helin Shiah <helinx@google.com>
Commit-Queue: Helin Shiah <helinx@google.com>
Reviewed-by: Ben Konyi <bkonyi@google.com>
@DanTup
Copy link
Collaborator

DanTup commented Oct 9, 2024

All of the recent events (since 26th Sept when my change above landed) on https://dart-ci.firebaseapp.com/#showLatestFailures=false&test=pkg/dds/test/dap/integration/debug_attach_test seem to be the status changing to passed.

I don't know if that means the fix completely solved this though - in particular I don't know why it took until Oct 1st for it to change from flakey -> Pass for unittest-asserts-release-win-x64. Does that mean it was still flaky from 26th Sept to 1st Oct? Is there a way to find the failure logs for one of the runs between those dates? (I tried the "Failures" tab, but it doesn't list any results at all).

@derekxu16
Copy link
Member Author

In particular I don't know why it took until Oct 1st for it to change from flakey -> Pass for unittest-asserts-release-win-x64.

The test needs to pass some number of times in a row to change from flaky -> Pass. I'm not sure what the exact number is, I think it's at least 100.

Is there a way to find the failure logs for one of the runs between those dates?

The best way I know of is to go to the builder page, in this case: https://ci.chromium.org/ui/p/dart/builders/ci.sandbox/pkg-win-release, use the "created before" filter, and then open a build page and look at the "ignored flaky test failure logs" under the "test results" step.

@DanTup
Copy link
Collaborator

DanTup commented Oct 9, 2024

The test needs to pass some number of times in a row to change from flaky -> Pass. I'm not sure what the exact number is, I think it's at least 100.

Ah, that explains the delay. I think this is resolved then - but I'll check back in around 1wk and ensure none of them have gone back flaky from my link above, and if so close this. Thanks!

@DanTup
Copy link
Collaborator

DanTup commented Oct 21, 2024

@derekxu16 looking through the link you posted and the one I posted earlier, I can't find any failures/flakes for this test. I think we can close this, but I'd like a second opinion in case I've missed something 😄

Thanks!

@a-siva
Copy link
Contributor

a-siva commented Oct 21, 2024

Screenshot 2024-10-21 at 10 02 16 AM

Does not seem to show flakes for this test.

I do see some flakes for other dap tests, are they also related to the same issue ?

Screenshot 2024-10-21 at 10 06 26 AM

@DanTup
Copy link
Collaborator

DanTup commented Oct 21, 2024

I do see some flakes for other dap tests, are they also related to the same issue ?

Hmm, not sure - let's keep this open and I'll dig through those. Thanks!

@DanTup DanTup self-assigned this Oct 21, 2024
@DanTup DanTup changed the title pkg/dds/test/dap/integration/debug_attach_test is failing on Windows pkg/dds/test/dap/integration tests are flaky Oct 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-tools A meta category for issues that should be addressed by tooling (prefer more concrete areas). dds-dap DDS issues related to the Debug Adapter Protocol (DAP) implementation gardening
Projects
None yet
Development

No branches or pull requests

7 participants