Housekeeping disconnect #62

thanasipantazides · 2024-02-04T04:23:20Z

Observed behavior

During long data taking runs, after some point the data requests to the Housekeeping system all start to time out. A long time (several minutes? can calculate from saved terminal) later, Formatter hangs when requesting from Housekeeping.

I believe the hang occurs because the TCP connection finally fails.

Sometimes when this issue occurs, I am unable to ping Housekeeping afterwards. Sometimes I still can ping, but never connect.

For flight, critical to detect disconnect before it becomes problematic. Because this seems unrecoverable without a power cycle, it is better to preserve Formatter command forwarding functionality to shut down systems than try to retain ability to command power off. Uplink can always cut all power at flight end.

The text was updated successfully, but these errors were encountered:

thanasipantazides · 2024-02-05T05:40:13Z

Summary of log file

Checking for this error in feb3/run12/formatter_terminal.txt
Housekeeping system is hit 2187 times (number of main manage_systems() loop iterations)
- Timeout on read for the AD7490 packet in loop iteration 2046, line 887976 in formatter_terminal.txt. Still getting status, clock, and error packets form Housekeeping board. RTD data receive/downlink information is not printed in terminal for this version, so unclear if it is ok for this loop iteration.
- On next iteration (2047), I get values from Housekeeping system:

managing housekeeping system
in sync_send_buffer_commands_to_system()
no commands in queue
in sync_send_command_to_system(), sending 0x01 f2 
in sync_send_command_to_system(), sending 0x02 f2 
in sync_send_command_to_system(), sending 0x04 20 
in sync_send_command_to_system(), sending 0x07 0b 
in sync_send_command_to_system(), sending 0x07 0f 
in sync_send_command_to_system(), sending 0x07 0e 
adc:	0x09b614982a263a36491358a0688b7899888c99afa8aab8adc8a9d8abe8cdf8d0
stt:	0x0000
clk:	0x19ee
err:	0x0000

Compare these to the Housekeeping data received in iteration 2046, in which no AD7490 data was received:

managing housekeeping system
in sync_send_buffer_commands_to_system()
no commands in queue
in sync_send_command_to_system(), sending 0x01 f2 
in sync_send_command_to_system(), sending 0x02 f2 
in sync_send_command_to_system(), sending 0x04 20 
TransportLayerMachine::read() attempt 0 failed.
All TransportLayerMachine::read() attempts failed!
in sync_send_command_to_system(), sending 0x07 0b 
in sync_send_command_to_system(), sending 0x07 0f 
in sync_send_command_to_system(), sending 0x07 0e 
adc:	0x
stt:	0xa203
clk:	0x003c
err:	0x0026

And to the subsequent iteration 2047:

managing housekeeping system
in sync_send_buffer_commands_to_system()
no commands in queue
in sync_send_command_to_system(), sending 0x01 f2 
in sync_send_command_to_system(), sending 0x02 f2 
in sync_send_command_to_system(), sending 0x04 20 
in sync_send_command_to_system(), sending 0x07 0b 
in sync_send_command_to_system(), sending 0x07 0f 
in sync_send_command_to_system(), sending 0x07 0e 
adc:	0x00006020010004a8010040f00000418201001aff000040c00000420109b61499
stt:	0x3a38
clk:	0x2a26
err:	0x4918

By iteration 2048, I am receiving no data from Housekeeping.

Relevant information from the test

No uplink commands received between 2045 and 2047 (to any system).
Last good exchange with Housekeeping system starts line 887558 in log.
In 2048, byte pattern is reminiscent of RTD response. Here it is again, separated into fours (note leading 0x01s and 0x00s):

0x00006020 010004a8 010040f0 00004182 01001aff 000040c0 00004201 09b61499

There are some unrealistic temperature values present though (16 ºC, 1 ºC). Recall this test was done with focal plane at -10 ºC, the rest at room temp.
The iteration at which the read failure appears is 2046, which is eerily close to 2047/2048 (12 bits).
No uplink commands were sent to Housekeeping system during this run at all.
I do not print received RTD data from Housekeeping system in the terminal. This information would be helpful to have (to determine if the data received in iteration 2047 is RTD data).
There is not a GSE log file for this run.

Add printing (or logging) of all received Housekeeping data on the Formatter.
Run Formatter + Housekeeping readout, while logging, until this issue is observed again.
Consider using ::read_some() instead of ::read() for Housekeeping data, or calling ::read_some() after a read error (to flush input buffer).
Use spare Housekeeping board and a Raspberry Pi to do stress test, logging.

thanasipantazides · 2024-02-05T05:48:08Z

In general, I think a good bailout pattern when finding bad data on receiving is:

if (error_condition) {
    std::vector<uint8_t> error_reply(4096); // or other large value to catch all input.
    size_t error_reply_size = TransportLayerMachine::read_some(socket, error_reply, sys_man);
    return return_value;
}

This could be used wherever a reply is expected, but a zero-length reply is received. Should not call this immediately after finding a zero-length reply, there should be a little wait for data to come in.

thanasipantazides · 2024-02-06T08:09:34Z

This example demonstrates the same pattern I currently use for read timeouts, but for writes. I expect write to timeout if TCP connection is bad. Could implement this for Housekeeping communication.

thanasipantazides · 2024-02-06T08:14:23Z

Something else to consider. If a timeout occurs, could it be due to shared resource conflicts between the TCP sockets in e.g. ::run_tcp_context()?

Or in ::tcp_local_receive_swap?

…s an unproven attempt for #62; closes #60

thanasipantazides · 2024-02-07T07:26:56Z

Feb 6 2024 ran system a few times, never encountered this issue on first run after power cycle. Stopped those tests ~1 hr + in. Saw issue between 2 and 30 minutes after starting run.

thanasipantazides · 2024-02-10T01:03:23Z

Feb 9 2024 saw this issue many times during sequence test. Occurred near turn on of CdTe or CMOS systems, or biasing CdTe. Need to check the run log files from this day for better understanding of timing. When it occurs, the timeout is caught by the new if in line 566, but the socket reconnection never returns.

thanasipantazides · 2024-02-10T03:10:22Z

Added ABANDONment to the aforementioned if, but the socket object is still touched during timeout operations. So the nominal plan moving forward is to re-divorce the Housekeeping socket from the SPMU socket in all TransportLayer methods.

thanasipantazides · 2024-02-10T18:00:35Z

One less invasive option for fixing: add boost::asio::ip::tcp::socket argument to TransportLayerMachine::run_tcp_context, (or the same for serial port functions). Then just call ::cancel() on the passed socket object. The socket information required for this is passed to the calling context of run_tcp_context() anyway, just not into the inner function.

thanasipantazides · 2024-02-11T04:03:49Z

Feb 10 2024 made progress after troubled sequence and vibe tests. Built two versions of the main software:

formatter_rtdhk which only queries only the RTDs in the HK system (no introspection data);
formatter_nointrohk which queries both the RTDs and the power board ADC (also no introspection data).

I replicated the problem both times with in 2 of 2 runs of nointrohk and in 0 of 2 runs of formatter_rtdhk. So suspect the issue is in the interface with the AD7490 via the Housekeeping board.

…ts a socket argument, and tcp_local_receive_swap is not used in favor of a local variable.

thanasipantazides · 2024-02-12T17:32:36Z

Feb 12 2024 able to operate HK system enough to power on. There are still dropped packets, which in the running version (caaf803) causes the whole HK to be ABANDONed. Should check if it is possible to handle packet loss without cutting out the whole HK system, i.e. if RTD non-response is recoverable or not.

thanasipantazides pushed a commit that referenced this issue Feb 7, 2024

adds hk readout for all detector systems; configures uplink UART; add…

5508c71

…s an unproven attempt for #62; closes #60

thanasipantazides pushed a commit that referenced this issue Feb 11, 2024

fixes to TransportLayerMachine related to #62. ::run_tcp_context() ge…

caaf803

…ts a socket argument, and tcp_local_receive_swap is not used in favor of a local variable.

thanasipantazides mentioned this issue Oct 8, 2024

Fix lockup foxsi/foxsi4-hk#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Housekeeping disconnect #62

Housekeeping disconnect #62

thanasipantazides commented Feb 4, 2024

thanasipantazides commented Feb 5, 2024 •

edited

Loading

thanasipantazides commented Feb 5, 2024

thanasipantazides commented Feb 6, 2024

thanasipantazides commented Feb 6, 2024

thanasipantazides commented Feb 7, 2024

thanasipantazides commented Feb 10, 2024 •

edited

Loading

thanasipantazides commented Feb 10, 2024

thanasipantazides commented Feb 10, 2024

thanasipantazides commented Feb 11, 2024

thanasipantazides commented Feb 12, 2024

Housekeeping disconnect #62

Housekeeping disconnect #62

Comments

thanasipantazides commented Feb 4, 2024

Observed behavior

thanasipantazides commented Feb 5, 2024 • edited Loading

Summary of log file

Relevant information from the test

Next

thanasipantazides commented Feb 5, 2024

thanasipantazides commented Feb 6, 2024

thanasipantazides commented Feb 6, 2024

thanasipantazides commented Feb 7, 2024

thanasipantazides commented Feb 10, 2024 • edited Loading

thanasipantazides commented Feb 10, 2024

thanasipantazides commented Feb 10, 2024

thanasipantazides commented Feb 11, 2024

thanasipantazides commented Feb 12, 2024

thanasipantazides commented Feb 5, 2024 •

edited

Loading

thanasipantazides commented Feb 10, 2024 •

edited

Loading