Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: fix flaking e2e test (DRAFT) #901

Merged
merged 35 commits into from Mar 27, 2024
Merged

test: fix flaking e2e test (DRAFT) #901

merged 35 commits into from Mar 27, 2024

Conversation

gkc
Copy link
Contributor

@gkc gkc commented Mar 25, 2024

- What I did

  • fix the flakiness of the npt_to_port_22 test
  • Fixes e2e test 'npt_to_port_22' is flaking #900
  • There is more to do in the future with regards to making diagnosis easier. The changes in this PR have made it much easier to diagnose problems which are srv-related, but we need to do more. I will noodle on this for a while.

- How I did it

  • Actual fix is in this commit
  • Everything else in this PR is enhancements of programmes' log output and of the test framework itself to make it possible to even diagnose what the actual problem was
    • Significant changes were made to the logging approach of packages/dart/sshnoports/bin/srv.dart - it now logs to a tmp file as well as (initially) to stderr; later on it stops logging to stderr and only to the tmp file. If it exits cleanly then it removes the tmp file; if not it leaves the tmp file in place so we have some hope of doing diagnosis.

- How to verify it

  • Tests should not fail unless there is a genuine problem

@gkc
Copy link
Contributor Author

gkc commented Mar 25, 2024

Problem identified, cause not yet clear. Log excerpt from this test run

2024-03-25 12:29:57.515 | Test 27 of 72 | Test script: npt_to_port_22 Client: Dart  (branch) Daemon: Dart  (branch)
2024-03-25 12:29:57.546 | Executing /home/ubuntu/noports/tests/e2e_all/runtime/29d8e7d7/binaries/branch/npt -f @apricoteventual -d 29d8e7d7dcf  -t @combinedbelgian -r @rv_am  --root-domain root.atsign.org --remote-port 22 --exit-when-connected
2024-03-25 12:30:01.571 | Executing ssh -p 35233 -o StrictHostKeyChecking=accept-new -o IdentitiesOnly=yes ubuntu@localhost -i /home/ubuntu/.ssh/e2e_all.29d8e7d7
    test execution's stdout: 
        2024-03-25 12:29:57.546 | Executing /home/ubuntu/noports/tests/e2e_all/runtime/29d8e7d7/binaries/branch/npt -f @apricoteventual -d 29d8e7d7dcf  -t @combinedbelgian -r @rv_am  --root-domain root.atsign.org --remote-port 22 --exit-when-connected

        2024-03-25 12:30:01.546 | Running ps for the spawned srv process BEFORE running ssh
        ==> NOTE THAT THERE IS NO srv process running for the port number which npt generated (35233)

        2024-03-25 12:30:01.571 | Executing ssh -p 35233 -o StrictHostKeyChecking=accept-new -o IdentitiesOnly=yes ubuntu@localhost -i /home/ubuntu/.ssh/e2e_all.29d8e7d7
        2024-03-25 12:30:01.602 | Running ps for the spawned srv process AFTER running ssh
    test execution's stderr: 
Connecting ... Connected
        2024-03-25 12:29:58.883168 : Sending daemon feature check request
        2024-03-25 12:29:58.885847 : Fetching host and port from srvd
        2024-03-25 12:30:00.123866 : Received host and port from srvd
        2024-03-25 12:30:00.123932 : Waiting for daemon feature check response
        2024-03-25 12:30:00.123941 : Received daemon feature check response
        2024-03-25 12:30:00.125941 : Required daemon features are supported
        2024-03-25 12:30:00.653588 : Sending session request to the device daemon
        2024-03-25 12:30:00.976026 : Waiting for response from the device daemon
        2024-03-25 12:30:01.429323 : Received response from the device daemon
        2024-03-25 12:30:01.429723 : Finding an available local port
        2024-03-25 12:30:01.429778 : Creating connection to socket rendezvous
        2024-03-25 12:30:01.533381 : requested localPort 0 ;  actual localPort 35233
        2024-03-25 12:30:01.533473 : Cleaning up temporary files

        ==> FAILING because the srv is not running! ssh: connect to host localhost port 35233: Connection refused

@gkc
Copy link
Contributor Author

gkc commented Mar 27, 2024

By wrapping the srv in a shell script and having the client execute that shell script instead of the srv binary, I have determined that there is indeed an unhandled exception somewhere (because the exit code is 255). What I don't understand yet is why the runZonedGuarded error handler does not appear to be handling that exception

For reference, the shell script

#!/bin/bash

# Sometimes while testing it's handy to execute this script rather than executing the srv directly

binDir="$(dirname -- "$0")"

mkdir -p /tmp/noports

# "${binDir}/srv" "$@" 2>&1 | tee -a /tmp/noports/srv.sh.$$.log
"${binDir}/srv" "$@"
echo "Exit code was $? " >> /tmp/noports/srv.sh.$$.log

gkc added 4 commits March 27, 2024 14:41
…rr when spawned by a parent which has exited

- Have bin/srv.dart set a LoggingHandler which will write to both a temporary file and stderr while the parent should still be alive but will only write to the temporary file once the parent can be expected to have exited.
- The temporary file is placed in the platform-appropriate place (/tmp on unix, %TEMP% on Windows)
- The temporary file is removed by the srv process itself if it exits normally, but is not removed if it exits abnormally. This is to enable diagnosis if problems occur.

Yes, this is all a bit annoying.
@gkc
Copy link
Contributor Author

gkc commented Mar 27, 2024

I've finally got a smoking gun after making some substantial changes to how the logging works in the srv process.

The core problem is that the parent process (sshnp or npt) reads what it needs from the child process's stderr and then exits but somehow, sometimes, this happens before the child process has finished using stderr.

SHOUT|2024-03-27 14:55:40.236002| srv.main |Unhandled exception FileSystemException: writeFrom failed, path = '' (OS Error: Broken pipe, errno = 32); stackTrace follows
#0      _RandomAccessFile.writeFromSync (dart:io/file_impl.dart:905)
#1      _StdConsumer.addStream.<anonymous closure> (dart:io/stdio.dart:309)
#2      _rootRunUnary (dart:async/zone.dart:1407)
#3      _CustomZone.runUnary (dart:async/zone.dart:1308)
#4      _CustomZone.runUnaryGuarded (dart:async/zone.dart:1217)
#5      _BufferingStreamSubscription._sendData (dart:async/stream_impl.dart:339)
#6      _BufferingStreamSubscription._add (dart:async/stream_impl.dart:271)
#7      _SyncStreamControllerDispatch._sendData (dart:async/stream_controller.dart:784)
#8      _StreamController._add (dart:async/stream_controller.dart:658)
#9      _StreamController.add (dart:async/stream_controller.dart:606)
#10     _StreamSinkImpl.add (dart:io/io_sink.dart:154)
#11     _IOSinkImpl.write (dart:io/io_sink.dart:287)
#12     _IOSinkImpl.writeln (dart:io/io_sink.dart:308)
#13     _StdSink.writeln (dart:io/stdio.dart:342)
#14     SrvImplDart.run (package:noports_core/src/srv/srv_impl.dart:421)

srv_impl.dart line 421 is : stderr.writeln(Srv.startedString);

gkc added 4 commits March 27, 2024 16:03
- in SrvImplExec, add a grace period of 100 milliseconds after detecting what the child process has written to its stderr
- in SrvImplDart, catch any exception thrown by stderr.writeln
…etries). atServers need to get more reliable
@gkc gkc marked this pull request as ready for review March 27, 2024 17:01
await rvPortBound.future.timeout(Duration(seconds: 2));
await rvPortBound.future.timeout(Duration(seconds: 3));

await Future.delayed(Duration(milliseconds: 100));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A short grace period which should mean that the child process doesn't encounter an exception from stderr.writeln

' to stderr: ${e.toString()} ;'
' stackTrace follows:\n'
'$st');
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the child process does encounter an exception from stderr.writeln, then catch it, log it, but continue to execute

@gkc gkc merged commit 555b304 into trunk Mar 27, 2024
7 checks passed
@gkc gkc deleted the gkc/fix-flaking-test branch March 27, 2024 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

e2e test 'npt_to_port_22' is flaking
2 participants