Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Housekeeping disconnect #62

Open
thanasipantazides opened this issue Feb 4, 2024 · 10 comments
Open

Housekeeping disconnect #62

thanasipantazides opened this issue Feb 4, 2024 · 10 comments

Comments

@thanasipantazides
Copy link
Contributor

Observed behavior

During long data taking runs, after some point the data requests to the Housekeeping system all start to time out. A long time (several minutes? can calculate from saved terminal) later, Formatter hangs when requesting from Housekeeping.

I believe the hang occurs because the TCP connection finally fails.

Sometimes when this issue occurs, I am unable to ping Housekeeping afterwards. Sometimes I still can ping, but never connect.

For flight, critical to detect disconnect before it becomes problematic. Because this seems unrecoverable without a power cycle, it is better to preserve Formatter command forwarding functionality to shut down systems than try to retain ability to command power off. Uplink can always cut all power at flight end.

@thanasipantazides
Copy link
Contributor Author

thanasipantazides commented Feb 5, 2024

Summary of log file

  • Checking for this error in feb3/run12/formatter_terminal.txt
  • Housekeeping system is hit 2187 times (number of main manage_systems() loop iterations)
    • Timeout on read for the AD7490 packet in loop iteration 2046, line 887976 in formatter_terminal.txt. Still getting status, clock, and error packets form Housekeeping board. RTD data receive/downlink information is not printed in terminal for this version, so unclear if it is ok for this loop iteration.
    • On next iteration (2047), I get values from Housekeeping system:
managing housekeeping system
in sync_send_buffer_commands_to_system()
no commands in queue
in sync_send_command_to_system(), sending 0x01 f2 
in sync_send_command_to_system(), sending 0x02 f2 
in sync_send_command_to_system(), sending 0x04 20 
in sync_send_command_to_system(), sending 0x07 0b 
in sync_send_command_to_system(), sending 0x07 0f 
in sync_send_command_to_system(), sending 0x07 0e 
adc:	0x09b614982a263a36491358a0688b7899888c99afa8aab8adc8a9d8abe8cdf8d0
stt:	0x0000
clk:	0x19ee
err:	0x0000
  • Compare these to the Housekeeping data received in iteration 2046, in which no AD7490 data was received:
managing housekeeping system
in sync_send_buffer_commands_to_system()
no commands in queue
in sync_send_command_to_system(), sending 0x01 f2 
in sync_send_command_to_system(), sending 0x02 f2 
in sync_send_command_to_system(), sending 0x04 20 
TransportLayerMachine::read() attempt 0 failed.
All TransportLayerMachine::read() attempts failed!
in sync_send_command_to_system(), sending 0x07 0b 
in sync_send_command_to_system(), sending 0x07 0f 
in sync_send_command_to_system(), sending 0x07 0e 
adc:	0x
stt:	0xa203
clk:	0x003c
err:	0x0026
  • And to the subsequent iteration 2047:
managing housekeeping system
in sync_send_buffer_commands_to_system()
no commands in queue
in sync_send_command_to_system(), sending 0x01 f2 
in sync_send_command_to_system(), sending 0x02 f2 
in sync_send_command_to_system(), sending 0x04 20 
in sync_send_command_to_system(), sending 0x07 0b 
in sync_send_command_to_system(), sending 0x07 0f 
in sync_send_command_to_system(), sending 0x07 0e 
adc:	0x00006020010004a8010040f00000418201001aff000040c00000420109b61499
stt:	0x3a38
clk:	0x2a26
err:	0x4918
  • By iteration 2048, I am receiving no data from Housekeeping.

Relevant information from the test

  • No uplink commands received between 2045 and 2047 (to any system).
  • Last good exchange with Housekeeping system starts line 887558 in log.
  • In 2048, byte pattern is reminiscent of RTD response. Here it is again, separated into fours (note leading 0x01s and 0x00s):
0x00006020 010004a8 010040f0 00004182 01001aff 000040c0 00004201 09b61499
  • There are some unrealistic temperature values present though (16 ºC, 1 ºC). Recall this test was done with focal plane at -10 ºC, the rest at room temp.
  • The iteration at which the read failure appears is 2046, which is eerily close to 2047/2048 (12 bits).
  • No uplink commands were sent to Housekeeping system during this run at all.
  • I do not print received RTD data from Housekeeping system in the terminal. This information would be helpful to have (to determine if the data received in iteration 2047 is RTD data).
  • There is not a GSE log file for this run.

Next

  • Add printing (or logging) of all received Housekeeping data on the Formatter.
  • Run Formatter + Housekeeping readout, while logging, until this issue is observed again.
  • Consider using ::read_some() instead of ::read() for Housekeeping data, or calling ::read_some() after a read error (to flush input buffer).
  • Use spare Housekeeping board and a Raspberry Pi to do stress test, logging.

@thanasipantazides
Copy link
Contributor Author

In general, I think a good bailout pattern when finding bad data on receiving is:

if (error_condition) {
    std::vector<uint8_t> error_reply(4096); // or other large value to catch all input.
    size_t error_reply_size = TransportLayerMachine::read_some(socket, error_reply, sys_man);
    return return_value;
}

This could be used wherever a reply is expected, but a zero-length reply is received. Should not call this immediately after finding a zero-length reply, there should be a little wait for data to come in.

@thanasipantazides
Copy link
Contributor Author

This example demonstrates the same pattern I currently use for read timeouts, but for writes. I expect write to timeout if TCP connection is bad. Could implement this for Housekeeping communication.

@thanasipantazides
Copy link
Contributor Author

Something else to consider. If a timeout occurs, could it be due to shared resource conflicts between the TCP sockets in e.g. ::run_tcp_context()?

Or in ::tcp_local_receive_swap?

thanasipantazides pushed a commit that referenced this issue Feb 7, 2024
@thanasipantazides
Copy link
Contributor Author

Feb 6 2024 ran system a few times, never encountered this issue on first run after power cycle. Stopped those tests ~1 hr + in. Saw issue between 2 and 30 minutes after starting run.

@thanasipantazides
Copy link
Contributor Author

thanasipantazides commented Feb 10, 2024

Feb 9 2024 saw this issue many times during sequence test. Occurred near turn on of CdTe or CMOS systems, or biasing CdTe. Need to check the run log files from this day for better understanding of timing. When it occurs, the timeout is caught by the new if in line 566, but the socket reconnection never returns.

@thanasipantazides
Copy link
Contributor Author

Added ABANDONment to the aforementioned if, but the socket object is still touched during timeout operations. So the nominal plan moving forward is to re-divorce the Housekeeping socket from the SPMU socket in all TransportLayer methods.

@thanasipantazides
Copy link
Contributor Author

One less invasive option for fixing: add boost::asio::ip::tcp::socket argument to TransportLayerMachine::run_tcp_context, (or the same for serial port functions). Then just call ::cancel() on the passed socket object. The socket information required for this is passed to the calling context of run_tcp_context() anyway, just not into the inner function.

@thanasipantazides
Copy link
Contributor Author

Feb 10 2024 made progress after troubled sequence and vibe tests. Built two versions of the main software:

  1. formatter_rtdhk which only queries only the RTDs in the HK system (no introspection data);
  2. formatter_nointrohk which queries both the RTDs and the power board ADC (also no introspection data).

I replicated the problem both times with in 2 of 2 runs of nointrohk and in 0 of 2 runs of formatter_rtdhk. So suspect the issue is in the interface with the AD7490 via the Housekeeping board.

thanasipantazides pushed a commit that referenced this issue Feb 11, 2024
…ts a socket argument, and tcp_local_receive_swap is not used in favor of a local variable.
@thanasipantazides
Copy link
Contributor Author

Feb 12 2024 able to operate HK system enough to power on. There are still dropped packets, which in the running version (caaf803) causes the whole HK to be ABANDONed. Should check if it is possible to handle packet loss without cutting out the whole HK system, i.e. if RTD non-response is recoverable or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant