Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mac tool_integration_tests_3_4 is 2.02% flaky #146879

Open
fluttergithubbot opened this issue Apr 17, 2024 · 13 comments
Open

Mac tool_integration_tests_3_4 is 2.02% flaky #146879

fluttergithubbot opened this issue Apr 17, 2024 · 13 comments
Assignees
Labels
c: flake Tests that sometimes, but not always, incorrectly pass P1 High-priority issues at the top of the work list team-tool Owned by Flutter Tool team triaged-tool Triaged by Flutter Tool team

Comments

@fluttergithubbot
Copy link
Contributor

The post-submit test builder Mac tool_integration_tests_3_4 had a flaky ratio 2.02% for the past (up to) 100 commits, which is above our 2.00% threshold.

One recent flaky example for a same commit: https://ci.chromium.org/ui/p/flutter/builders/prod/Mac%20tool_integration_tests_3_4/16001
Commit: c3445dc

Flaky builds:
https://ci.chromium.org/ui/p/flutter/builders/prod/Mac%20tool_integration_tests_3_4/16001
https://ci.chromium.org/ui/p/flutter/builders/prod/Mac%20tool_integration_tests_3_4/15916

Recent test runs:
https://flutter-dashboard.appspot.com/#/build?taskFilter=Mac%20tool_integration_tests_3_4

Please follow https://github.com/flutter/flutter/wiki/Reducing-Test-Flakiness#fixing-flaky-tests to fix the flakiness and enable the test back after validating the fix (internal dashboard to validate: go/flutter_test_flakiness).

@fluttergithubbot fluttergithubbot added c: flake Tests that sometimes, but not always, incorrectly pass P0 Critical issues such as a build break or regression team-tool Owned by Flutter Tool team labels Apr 17, 2024
@christopherfujino
Copy link
Member

These look very similar to #145812 (comment), but

19:25 +8 -1: test/integration.shard/vmservice_integration_test.dart: Flutter Tool VMService method hotRestart can be called [E]
  TimeoutException after 0:15:00.000000: Test timed out after 15 minutes.

https://logs.chromium.org/logs/flutter/buildbucket/cr-buildbucket/8750489356623213857/+/u/run_test.dart_for_tool_integration_tests_shard_and_subshard_3_4/stdout

@fluttergithubbot
Copy link
Contributor Author

[prod pool] flaky ratio for the past (up to) 100 commits between 2024-04-23 and 2024-04-30 is 1.01%. Flaky number: 1; total number: 99.
One recent flaky example for a same commit: https://ci.chromium.org/ui/p/flutter/builders/prod/Mac%20tool_integration_tests_3_4/16208
Commit: 241078a
Flaky builds:
https://ci.chromium.org/ui/p/flutter/builders/prod/Mac%20tool_integration_tests_3_4/16208

Recent test runs:
https://flutter-dashboard.appspot.com/#/build?taskFilter=Mac%20tool_integration_tests_3_4

auto-submit bot pushed a commit that referenced this issue May 2, 2024
In service of #146879 and #145812. In these issues, we see what appears to be the flutter tool getting stuck somewhere during hot reload. It may help if we knew were exactly where we are getting stuck (preparing assets, writing them to device, etc.).

This PR adds a new parameter to `FlutterTestDriver::run`, `verbose`. When verbose is set, `FlutterTestDriver` will run `flutter` with `--verbose` in its tests. Keep in mind that `FlutterTestDriver` only prints logs from `flutter` when a test fails, so this shouldn't spam the logs of passing tests.

This PR sets the parameter when invoking the flaky tests from #146879 and #145812, so we should see more detailed logs in future flakes.

While this is a low risk PR, you can verify the change by intentionally breaking hot reload code, clearing the cached tool binaries, and then running either of these tests.
@fluttergithubbot
Copy link
Contributor Author

[prod pool] flaky ratio for the past (up to) 100 commits between 2024-05-01 and 2024-05-07 is 1.01%. Flaky number: 1; total number: 99.
One recent flaky example for a same commit: https://ci.chromium.org/ui/p/flutter/builders/prod/Mac%20tool_integration_tests_3_4/16274
Commit: abbaaea
Flaky builds:
https://ci.chromium.org/ui/p/flutter/builders/prod/Mac%20tool_integration_tests_3_4/16274

Recent test runs:
https://flutter-dashboard.appspot.com/#/build?taskFilter=Mac%20tool_integration_tests_3_4

@andrewkolos andrewkolos added the triaged-tool Triaged by Flutter Tool team label May 8, 2024
@andrewkolos
Copy link
Contributor

Here is a recent failure (5/4):

https://ci.chromium.org/ui/p/flutter/builders/prod/Mac%20tool_integration_tests_3_4/16348/overview

and a quick link to one of the flaky tests:

testWithoutContext('hotRestart can be called', () async {

From the logs, it seems we get to stuck somewhere in HotRunner::_restartFromSources.

Interestingly, if I'm reading things correctly, it appears that the app restart just fine since we see the app print twice:

<=stdout=  [  +34 ms] topLevelFunction

109s <=stdout=  [ +497 ms] topLevelFunction

I'll be adding some more traces very soon to try to narrow down where we get stuck.

auto-submit bot pushed a commit that referenced this issue May 8, 2024
In service of #146879 and #145812.

This PR adds some more traces to help us figure out where we are getting blocked. See #146879 (comment) for more context.
@andrewkolos
Copy link
Contributor

andrewkolos commented May 12, 2024

With #147997 landing, we are getting some new information here. Here is an example (quick link to logs).

In the logs, we see we get far enough to print Finished waiting on operations. (callsite), but we don't see the Hot restart performed in Xms that should come shortly after. Given this, I believe the tool is getting stuck somewhere in HotRunner::_launchFromDevFS. The only long-running/async operation I see within is _launchInView, which in turn makes two VM service calls, _flutter.listViews and _flutter.runInView. However, I don't see either of these outgoing calls the logs1, which is really confusing.

Footnotes

  1. From what I can tell, any communications to and from the VM service should appear in the log. Documentation. We see that we listen to communications by listening to some broadcast streams exposed by the VM service here, and we see that any VM service call should write to the onSend stream here. Are we never getting to this line?

@andrewkolos
Copy link
Contributor

The only long-running/async operation I see within is _launchInView, which in turn makes two VM service calls, _flutter.listViews and _flutter.runInView. However, I don't see either of these outgoing calls the logs1, which is really confusing.

I tried running the test locally with a working emulator. Even then, _flutter.listViews or _flutter.runInView don't appear in the logs, so I simply don't understand something there. Regardless, this means that these calls might be occurring as they should, but they might be getting stuck or that listViews is yielding an empty list.

auto-submit bot pushed a commit that referenced this issue May 13, 2024
In service of #146879. See #146879 (comment) in particular for the purpose of this PR. This adds a few more `printTrace` calls to try to see where we are getting stuck.
@andrewkolos
Copy link
Contributor

From the logs of a recent failure, I see
Finished _flutter.runInView, but I do not see Finished running <path here>/main.dart.dill in view _flutterView/0x7f8fc781e420.... I believe the tool is getting stuck here in FlutterVmService::runInView:

Here is the definition of this future for quick reference:

final Future<void> onRunnable = service.onIsolateEvent.firstWhere((vm_service.Event event) {
return event.kind == vm_service.EventKind.kIsolateRunnable;
});

The next step is to figure out why we are rarely not getting an IsolateRunnable event from the VM service. Perhaps we aren't listening properly?

auto-submit bot pushed a commit that referenced this issue May 14, 2024
In service of #146879. See #146879 (comment). I wonder if something is going wrong in the process of subscribing to the isolate event stream in the VM service. This adds a print trace to the `catch` clause of this subscription call.
@andrewkolos andrewkolos added P1 High-priority issues at the top of the work list and removed P0 Critical issues such as a build break or regression labels May 14, 2024
@andrewkolos
Copy link
Contributor

Downgrading to P1 since flakiness is under the 2% treshold.

@andrewkolos
Copy link
Contributor

andrewkolos commented May 18, 2024

Another more recent log with more prints. Here we see Error: streamListen: (103) Stream already subscribed, which means that the tool is properly listening for isolate events.

Reading the logs more carefully, I see the IsolateRunnable event:

<=vm=      {"jsonrpc":"2.0","method":"streamNotify","params":{"streamId":"Isolate","event":{"type":"Event","kind":"IsolateRunnable","isolateGroup":{"type":"@IsolateGroup","id":"isolateGroups/2422844065420505","name":"main.dart","number":"2422844065420505","isSystemIsolateGroup":false},"isolate":{"type":"@Isolate","id":"isolates/762944440870147","name":"main","number":"762944440870147","isSystemIsolate":false,"isolateGroupId":"isolateGroups/2422844065420505"},"timestamp":1715853445413}}}

Here is what is printing this:

_vmService!.onReceive.listen((String s) => _debugPrint(s, topic: '<=vm='));

This appears after we make the _flutter.runInView call, so it seems like the event is occurring, yet onRunnable never completes.

final Future<void> onRunnable = service.onIsolateEvent.firstWhere((vm_service.Event event) {
return event.kind == vm_service.EventKind.kIsolateRunnable;
});

This begs the question: is something wrong with VmService::onIsolateEvent? What if we listened to onReceive (and parsed for IsolateRunnable) instead of listening to onIsolateEvent? I think this is worth continued investigation in case there is some bug in package:vm_service or, even worse, Dart streams.

auto-submit bot pushed a commit that referenced this issue May 20, 2024
…48596)

In service of #146879. Please see #146879 (comment).

In summary, when the test flakes, `onRunnable` is not completing despite the VmService object receiving an `IsolateRunnable` event. 

This PR adds some logging code to print all events received inside of `FlutterVmService::runInView`.
@fluttergithubbot
Copy link
Contributor Author

[prod pool] flaky ratio for the past (up to) 100 commits between 2024-05-14 and 2024-05-21 is 0.00%. Flaky number: 0; total number: 100.

@fluttergithubbot
Copy link
Contributor Author

[prod pool] flaky ratio for the past (up to) 100 commits between 2024-05-21 and 2024-05-28 is 0.00%. Flaky number: 0; total number: 98.

@fluttergithubbot
Copy link
Contributor Author

[prod pool] flaky ratio for the past (up to) 100 commits between 2024-05-28 and 2024-06-04 is 0.00%. Flaky number: 0; total number: 99.

auto-submit bot pushed a commit that referenced this issue Jun 5, 2024
…logged incoming messages (#149756)

In service of #146879. In summary, a bunch of `printTrace` calls were added to try to troubleshoot this flake, but the flake has stopped happening.

This PR more-or-less reverts #148596, since this was the only change that could have _theoretically_ (somehow maybe) stopped the flake. I'll consider reverting the rest of the `printTrace` calls if the flake does not reappear after this PR lands.
@fluttergithubbot
Copy link
Contributor Author

[prod pool] flaky ratio for the past (up to) 100 commits between 2024-06-05 and 2024-06-11 is 0.00%. Flaky number: 0; total number: 95.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
c: flake Tests that sometimes, but not always, incorrectly pass P1 High-priority issues at the top of the work list team-tool Owned by Flutter Tool team triaged-tool Triaged by Flutter Tool team
Projects
None yet
Development

No branches or pull requests

3 participants