-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Housekeeping disconnect #62
Comments
Summary of log file
Relevant information from the test
Next
|
In general, I think a good bailout pattern when finding bad data on receiving is: if (error_condition) {
std::vector<uint8_t> error_reply(4096); // or other large value to catch all input.
size_t error_reply_size = TransportLayerMachine::read_some(socket, error_reply, sys_man);
return return_value;
} This could be used wherever a reply is expected, but a zero-length reply is received. Should not call this immediately after finding a zero-length reply, there should be a little wait for data to come in. |
This example demonstrates the same pattern I currently use for |
Something else to consider. If a timeout occurs, could it be due to shared resource conflicts between the TCP sockets in e.g. Or in |
Feb 6 2024 ran system a few times, never encountered this issue on first run after power cycle. Stopped those tests ~1 hr + in. Saw issue between 2 and 30 minutes after starting run. |
Feb 9 2024 saw this issue many times during sequence test. Occurred near turn on of CdTe or CMOS systems, or biasing CdTe. Need to check the run log files from this day for better understanding of timing. When it occurs, the timeout is caught by the new |
Added |
One less invasive option for fixing: add |
Feb 10 2024 made progress after troubled sequence and vibe tests. Built two versions of the main software:
I replicated the problem both times with in 2 of 2 runs of |
…ts a socket argument, and tcp_local_receive_swap is not used in favor of a local variable.
Feb 12 2024 able to operate HK system enough to power on. There are still dropped packets, which in the running version (caaf803) causes the whole HK to be |
Observed behavior
During long data taking runs, after some point the data requests to the Housekeeping system all start to time out. A long time (several minutes? can calculate from saved terminal) later, Formatter hangs when requesting from Housekeeping.
I believe the hang occurs because the TCP connection finally fails.
Sometimes when this issue occurs, I am unable to ping Housekeeping afterwards. Sometimes I still can ping, but never
connect
.For flight, critical to detect disconnect before it becomes problematic. Because this seems unrecoverable without a power cycle, it is better to preserve Formatter command forwarding functionality to shut down systems than try to retain ability to command power off. Uplink can always cut all power at flight end.
The text was updated successfully, but these errors were encountered: