Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ESP-IDF esp_modem.c data corruption (IDFGH-5435) #7176

Closed
zvasicek56 opened this issue Jun 21, 2021 · 14 comments
Closed

ESP-IDF esp_modem.c data corruption (IDFGH-5435) #7176

zvasicek56 opened this issue Jun 21, 2021 · 14 comments
Labels
Awaiting Response awaiting a response from the author Resolution: Done Issue is done internally Status: Resolved Issue is done internally

Comments

@zvasicek56
Copy link

zvasicek56 commented Jun 21, 2021

Environment

  • Development Kit: [ESP32-Wrover-Kit]
  • Kit version (for WroverKit/PicoKit/DevKitC): [v4]
  • Module or chip used: [ESP32-WROVER-E]
  • IDF version (run git describe --tags to find it):
    // v4.4-dev-1594-g1d7068e4b
  • Example: pppos_client

Problem Description

There is a bug in the recent version of examples\protocols\pppos_client\components\modem\src\esp_modem.c introduced probably in
55d41. The data from the UART's RX buffer are consumed in two events: UART_DATA and UART_PATTERN_DET. This causes duplicate events which manifests in data corruption. Data received in UART_DATA event handler can be overwritten in UART_PATTERN_DET handler under some circumstances. The issue is that both handlers wrongly supposes the esp_dte->buffer is empty despite of the fact that it can contain data written in a previous event.

Debug Log

This issue is hard to debug, so please see the isolated sequence of events (bold), states and actions causing the data corruption:

image

Data (two characters 'OK' added in the fourth event) are overwritten in the last event.

Solution

It is not clear why the data are processed in UART_DATA handler in command mode.

  1. Is it safe to modify the handler as follows?
static void esp_handle_uart_data(esp_modem_dte_t *esp_dte)
{
    if (esp_dte->parent.dce->mode != MODEM_PPP_MODE) return;
  ...
}

  1. Should we rewrite both handlers, i.e. esp_handle_uart_data and esp_handle_uart_pattern, and read data from UART RX buffer as uart_read_bytes(...,esp_dte->buffer + strlen(esp_dte->buffer),...) instead of uart_read_bytes(...,esp_dte->buffer,...)?
@espressif-bot espressif-bot added the Status: Opened Issue is new label Jun 21, 2021
@github-actions github-actions bot changed the title ESP-IDF esp_modem.c data corruption ESP-IDF esp_modem.c data corruption (IDFGH-5435) Jun 21, 2021
@david-cermak
Copy link
Collaborator

Hi @zvasicek56

How often do you see this happening, does it only happen in the COMMAND_MODE?
I've quickly checked the commit you referenced and it works as expected. Getting only UART_PATTERN_DET events in the command mode, since the rx events (UART_DATA events) should be disabled:

https://github.com/espressif/esp-idf/blob/1d7068e4be/examples/protocols/pppos_client/components/modem/src/esp_modem.c#L410-410

It might be possible, however, that we could get these two events in the transition mode, but again in the transition DATA_MODE -> COMMAND_MODE, the pattern detection is disabled, so we should get only data events. Perhaps if some older event was triggered before the transition and processed after it, but doesn't look like this from your log, as you're getting multiple different events.

@zvasicek56
Copy link
Author

zvasicek56 commented Jun 23, 2021

Hi @david-cermak

thank you for your response. The issue is observed in the COMMAND_MODE. According to your description I traced the root cause of that behavior. It is caused by the function uart_flush which is called on line 242. This part of the code is activated if the modem is powering up and emits RDY lines. But this is not the only place where uart_flush or uart_flush_input is called in the esp_modem.c code so there may be different situations when the observed issue happens.

According to the source codes, the function uart_flush is an alias for uart_flush_input. The implementation of uart_flush_input is probably buggy as its execution causes side effects that are not mentioned in the documentation. The function internally calls uart_enable_rx_intr independently on the previous state of the interrupt mask.

It looks that also uart_read_bytes called within esp_handle_uart_pattern can activate rx interrupt as a side effect of uart_check_buf_full call.

@david-cermak
Copy link
Collaborator

@zvasicek56 Thanks for the investigation and sharing more details. I still wasn't able to reproduce the issue, but I think I understand the potential cause of the trouble.

It might be possible, that the pattern detection in the uart driver generates more events before the modem thread gets to receive them and thus reading data behind the currently processing event, causing the potential corruption you described above.
If he assumption is correct, then this WIP fix below should help:
PPPoS-Client-Fix-race-pattern-event-race-condition.patch.txt

Could you please apply the patch and see if this fixes the issue?

@zvasicek56
Copy link
Author

@david-cermak, the patch actually does not address the root of the problem. Data loss still persists and moreover, the patch leads to new errors: a) partial line is given to handle_line, b) "uart read bytes failed" error occurs. This is because the value of event.size is invalid at the time of the execution of the corresponding event handler.

The issue is that both UART and PATTERN interrupts are enabled at the same time as a consequence of the uart_flush call. This causes duplicate events that interfere together.

Please have look at the latest log after patch. The first issue is here: Event I (18244) UART: [UART_PATTERN_DET DATA] ... should be ignored, there is no 0a character in the buffer. This character was already processed in the previous handler, i.e. I (18114) UART: [UART DATA] ... . The previous implementation (before patching) could handle that because the handlers ignored the value of event.size.

Could you please comment the first solution proposed in the initial submission. Is it safe to just ignore all the UART DATA events if the modem driver is in command mode? Anyway, UART driver needs to be patched to fix uart_flush which enables interrupt even if it was disabled before.

E (17004) bg96: bg96_init(185): sync failed
D (17594) esp-modem: Ignore UART event 7 for DTE with no DCE attached
D (17594) esp-modem: Ignore UART event 7 for DTE with no DCE attached
D (17594) esp-modem: Ignore UART event 7 for DTE with no DCE attached
D (17604) esp-modem: Ignore UART event 7 for DTE with no DCE attached
D (17614) esp-modem: Ignore UART event 7 for DTE with no DCE attached
D (17614) esp-modem: Ignore UART event 7 for DTE with no DCE attached
D (17624) esp-modem: Ignore UART event 7 for DTE with no DCE attached
D (17634) esp-modem: Ignore UART event 7 for DTE with no DCE attached
I (18004) pppos_example: Init modem
I (18004) UART: [UART_PATTERN_DET DATA]: event.size: 2 ; modem mode: 0
D (18014) esp-modem: handle_line : 0x3ffba458 0d 0a |..|

I (18024) UART: [UART_PATTERN_DET DATA]: event.size: 4 ; modem mode: 0
D (18044) esp-modem: handle_line : 0x3ffba458 4f 4b 0d 0a |OK..|
D (18054) dce_service: sync ok

I (18054) UART: [UART_PATTERN_DET DATA]: event.size: 2 ; modem mode: 0
D (18074) esp-modem: handle_line : 0x3ffba458 0d 0a |..|

I (18084) UART: [UART_PATTERN_DET DATA]: event.size: 4 ; modem mode: 0
D (18104) esp-modem: handle_line : 0x3ffba458 4f 4b 0d 0a |OK..|
D (18114) dce_service: disable echo ok

I (18114) UART: [UART DATA]: event.size: 1 ; modem mode: 0
V (18124) esp-modem:: esp_handle_uart_data, length :1
V (18134) esp-modem: debug_data: uart_pattern_get_pos :-1
V (18224) esp-modem: debug_data: Continuous read in non-data mode : length: 2 char: 0xa
D (18234) esp-modem: handle_line : 0x3ffba458 0d 0a |..|

I (18244) UART: [UART_PATTERN_DET DATA]: event.size: 1 ; modem mode: 0
D (18244) esp-modem esp_handle_uart_pattern: debug_data: 0x3ffba458 30 |0|
D (18254) esp-modem: handle_line : 0x3ffba458 30 |0|

I (18264) UART: [UART DATA]: event.size: 15 ; modem mode: 0
V (18274) esp-modem:: esp_handle_uart_data, length :14
V (18294) esp-modem: debug_data: uart_pattern_get_pos :-1
V (18334) esp-modem: debug_data: Continuous read in non-data mode : length: 15 char: 0xa
D (18344) esp-modem: handle_line : 0x3ffba458 47 20 44 75 6d 6d 79 20 4d 6f 64 65 6c 0d 0a |G Dummy Model..|

I (18354) UART: [UART_PATTERN_DET DATA]: event.size: 1 ; modem mode: 0
D (18374) esp-modem: handle_line : 0x3ffba458 4f |O|

I (18384) UART: [UART DATA]: event.size: 2 ; modem mode: 0
V (18384) esp-modem:: esp_handle_uart_data, length :1
V (18404) esp-modem: debug_data: uart_pattern_get_pos :-1
V (18444) esp-modem: debug_data: Continuous read in non-data mode : length: 2 char: 0xd
V (18444) esp-modem: debug_data: Continuous read in non-data mode : length: 3 char: 0xa
D (18454) esp-modem: handle_line : 0x3ffba458 4b 0d 0a |K..|

I (18464) UART: [UART_PATTERN_DET DATA]: event.size: 2 ; modem mode: 0
E (18574) esp-modem: uart read bytes failed
E (18614) esp-modem: esp_modem_dte_send_cmd(322): process command timeout
E (18614) dce_service: esp_modem_dce_get_module_name(311): send command failed
E (18614) bg96: bg96_init(189): get module name failed
I (19624) pppos_example: Init modem

@david-cermak
Copy link
Collaborator

david-cermak commented Jun 28, 2021

@zvasicek56 Thanks for testing the patch. It looks like my assumption was wrong. In this case, I'm afraid, I'm running out of ideas. I've been testing this exact scenario quite extensively, simulating both events at the same time, but no luck in understanding what was happening in your case.
Even if I made both interrupts enabled and both events fired, the data got processed correctly. Are you running only one client or multiple? I this happening after startup, or have you been switching between modes?

Could you please comment the first solution proposed in the initial submission.

I don't like the idea of skipping the data processing based on the operational mode. Both handlers must work in both modes, as they seem to work in my case. Another reason is support for the transition mode, where we could receive both data and commands at once. Have seen issues with processing commands (already in the command mode), just after switching from PPP mode. therefore the data event has to work correctly in a plain command mode.

Have you tried disabling the pattern detection interrupts altogether? Could you please check once? This should work the same way.

@david-cermak
Copy link
Collaborator

Hi @zvasicek56

I wasn't able to reproduce the issue, but I believe there is a potential problem of these two events competing and causing data races.

I was thinking of processing the pattern detection manually for some time already (I've done so in the esp_modem component to easily switch the underlying DTE: uart -> usb...), but haven't really gotten to use only the UART_DATA event, for both PPP data and AT commands.
Here a WIP patch removing the uart pattern detection. Could you please check
Examples-pppos_client-Remove-uart-pattern-detection.patch.txt
if this solves the problems you report? Thanks.

@Alvin1Zhang Alvin1Zhang added the Awaiting Response awaiting a response from the author label Sep 3, 2021
@Alvin1Zhang
Copy link
Collaborator

Thanks for reporting, would you please help share if any updates for the issue? Thanks.

@zvasicek56
Copy link
Author

Thank you very much @david-cermak. I can confirm that your patch, which only uses the uart_data event, works correctly, at least for SIM7600. The issue with two competing events has been partly solved by fixing uart_flush_input, but your solution is safer.

@david-cermak
Copy link
Collaborator

@zvasicek56 Thanks for the feedback! I also think this is a good change. I will clean the patch up and get it merged. This issue is to be closed once merged and synced.

@espressif-bot espressif-bot added Status: In Progress Work is in progress Status: Reviewing Issue is being reviewed and removed Status: Opened Issue is new Status: In Progress Work is in progress labels Sep 13, 2021
@AxelLin
Copy link
Contributor

AxelLin commented Sep 24, 2021

@david-cermak

Just curious why do you use a union keep the "line_buffer_size" since it's no longer used?
https://github.com/espressif/esp-idf/blob/master/examples/protocols/pppos_client/components/modem/include/esp_modem.h#L62-L65

@david-cermak
Copy link
Collaborator

@AxelLin

why do you use a union keep the "line_buffer_size" since it's no longer used?

Not used, but still a kind of public API, so wanted to keep it compatible so that both the new option and the old one is accepted.

@AxelLin
Copy link
Contributor

AxelLin commented Sep 24, 2021

@david-cermak

I think I hit a bug with the change in a90817c.

In esp_handle_uart_data()
esp_dte_handle_line()

I check the uart data (It's actually multiple lines):
I got below, the "+COPS: 0" is unexpected data so esp_dte_handle_line() return error.
+COPS: 0

OK

However, it then stop working.

Remove the MODEM_CHECK from this line, then it works.
https://github.com/espressif/esp-idf/blob/master/examples/protocols/pppos_client/components/modem/src/esp_modem.c#L124

Looks like the MODEM_CHECK causes the second line "OK" never received.

@david-cermak
Copy link
Collaborator

Thanks for the early feedback, I'll look into it and fix it (I've tested this multiline scenario, expecting that the line handlers return ESP_OK to indicate also read more, but ESP_FAIL to stop parsing immediately. Probably missed the fact that this unexpected error is reported but reading/parsing might still continue)

@espressif-bot espressif-bot added Resolution: Done Issue is done internally Status: Resolved Issue is done internally and removed Status: Reviewing Issue is being reviewed labels Sep 24, 2021
igrr pushed a commit that referenced this issue Oct 12, 2021
Regression from a90817c that caused
parsing exit in case of an unexpected result found (line handler
reported ESP_FAIL). Prevously if any error found in the line handler we
just posted an unexpected event, but continued with parsing. Fixed by
allowing for further parsing.

Related: #7176
@xdevelnet
Copy link

Hey! I'm sorry to rise up this issue, but can you tell me in which version of esp-idf this issue has been resolved?

I suppose it would be enough to use

platform = https://github.com/platformio/platform-espressif32.git

right?

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Awaiting Response awaiting a response from the author Resolution: Done Issue is done internally Status: Resolved Issue is done internally
Projects
None yet
Development

No branches or pull requests

6 participants