mbedtls_ssl_handshake crash (PSRAM unicore + memw workaround) (IDFGH-3068) #5090

szmodz · 2020-04-09T23:50:32Z

Code to reproduce is based on a modified https_mbedtls example from esp-idf:
https://github.com/szmodz/esp-idf/commits/mbedtls_crash

The code always ends up causing a WDT timeout sooner or later (timeout set to 60s) in UNICORE mode. CONFIG_MBEDTLS_HARDWARE_xxx are disabled, but it doesn't really matter.

Seems to be MUCH more stable with dupldst rather than memw workaround (not sure if it's completely stable though - too soon to tell). No other changes were made, and the effect on stability is obvious.

Perhaps it would be a good idea to also supply newlib and sdk libs compiled with dupldst?

The handshake is using nonblocking sockets and resets the WDT, so I don't think it should ever exceed the 60s timeout. That, plus it seems stable (so far) with dupldst.

It's a pretty intensive load, but that way the problem can be reproduced quickly. It also occurs under normal load conditions, but not as often. I noticed it in my actual application code.

Development Kit: ESP32-DevKitC (but also custom hardware with ethernet)
Kit version (for WroverKit/PicoKit/DevKitC): v4
Module or chip used: ESP32-WROVER-B
Build System: Make
Compiler version (run xtensa-esp32-elf-gcc --version to find it): 2020r1
Operating System: macOS
Using an IDE?: No
Power Supply: USB (custom HW has external power)

example output:

E (116633) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (116633) task_wdt:  - https_get_task (CPU 0/1)
E (116633) task_wdt:  - https_get_task (CPU 0/1)
E (116633) task_wdt:  - https_get_task (CPU 0/1)
E (116633) task_wdt:  - https_get_task (CPU 0/1)
E (116633) task_wdt: Tasks currently running:
E (116633) task_wdt: CPU 0: https_get_task
E (116633) task_wdt: Aborting.

abort() was called at PC 0x400d2e04 on core 0
0x400d2e04: esp_task_wdt_init at /Users/szmodz/xxx/esp-idf/components/esp_common/src/task_wdt.c:213


Backtrace:0x40087359:0x3ffb1350 0x400877dd:0x3ffb1370 0x40095c11:0x3ffb1390 0x400d2e04:0x3ffb1410 0x40082e8e:0x3ffb1430 0x400953a8:0x3fff2210 0x400e817e:0x3fff2220 0x400e8a17:0x3fff2240 0x400e8b1a:0x3fff2270 0x400ddab1:0x3fff2290 0x400de16b:0x3fff2320 0x400df2ba:0x3fff24c0 0x400eace9:0x3fff2650 0x400eaeaf:0x3fff26c0 0x400da5c1:0x3fff26f0 0x400d72e5:0x3fff27b0 0x400d7304:0x3fff27d0 0x400d732a:0x3fff27f0 0x400d4eb4:0x3fff2810
0x40087359: print_illegal_instruction_details at /Users/szmodz/xxx/esp-idf/components/esp_system/port/panic_handler.c:73

0x400877dd: esp_system_abort at /Users/szmodz/xxx/esp-idf/components/esp_system/system_api.c:68

0x40095c11: abort at /Users/szmodz/xxx/esp-idf/components/newlib/abort.c:46

0x400d2e04: esp_task_wdt_init at /Users/szmodz/xxx/esp-idf/components/esp_common/src/task_wdt.c:213

0x40082e8e: _xt_lowint1 at /Users/szmodz/xxx/esp-idf/components/freertos/xtensa/xtensa_vectors.S:1105

0x400953a8: strrchr at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/src/newlib/newlib/libc/string/strrchr.c:38

0x400e817e: mbedtls_internal_aes_decrypt at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/aes.c:961

0x400e8a17: mbedtls_aes_crypt_xts at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/aes.c:1246 (discriminator 3)

0x400e8b1a: mbedtls_aes_crypt_xts at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/aes.c:1222

0x400ddab1: ecp_normalize_jac at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ecp.c:1105

0x400de16b: ecp_normalize_jac_many at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ecp.c:1173

0x400df2ba: mbedtls_ecp_curve_info_from_tls_id at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ecp.c:490
 (inlined by) mbedtls_ecp_tls_read_group_id at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ecp.c:953

0x400eace9: mbedtls_ecdsa_read_signature_restartable at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ecdsa.c:834

0x400eaeaf: mbedtls_asn1_write_len at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/asn1write.c:58

0x400da5c1: ssl_write_client_key_exchange at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ssl_cli.c:3160
 (inlined by) mbedtls_ssl_handshake_client_step at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ssl_cli.c:3586

0x400d72e5: mbedtls_ssl_handshake_free at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:8837

0x400d7304: mbedtls_ssl_handshake_free at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:8866

0x400d732a: mbedtls_ssl_handshake_free at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:8893

0x400d4eb4: https_get_task at /Users/szmodz/xxx/esp-idf/examples/protocols/mbedtls_crash/main/https_mbedtls_example_main.c:202

The text was updated successfully, but these errors were encountered:

Alvin1Zhang · 2020-04-10T02:20:07Z

Thanks for reporting, we will look into. @szmodz

szmodz · 2020-04-10T02:28:48Z

updated the test code - noticed a bug, but I don't think it's relevant to the problem.

szmodz · 2020-04-13T15:27:02Z

maybe related: #5112

szmodz · 2020-04-14T07:23:59Z

@iggr @Spritetm
Shouldn't the memw fix look something like this (volatile fix by jcmvbkbc)?
https://github.com/szmodz/gcc/commits/esp_based_on_8_2_0

Currently testing with the above example, seems to work much better, will leave it a while.

szmodz · 2020-04-14T13:23:15Z

I've also pushed a modified version of the nop fix (not tested).

Note the dupldst fix turned out not to be completely stable in my application (I think I know why).

Spritetm · 2020-04-14T15:45:36Z

Can you expand a bit more on what you tried to fix in those modifications? I can reverse engineer them from your code probably, but if you can write down a line or two, that's probably faster.

szmodz · 2020-04-14T16:58:01Z

@Spritetm mostly the crashes as described in the linked example. You can reproduce them, right?

Mostly i think this is a bug:
https://github.com/szmodz/gcc/blob/9ec9bb29ecbc935015e487ae4d9397a8976099e7/gcc/config/xtensa/xtensa.c#L2338
It doesn't matter if the store is volatile, as the memw is inserted before the store, not after. I indicated this in a comment. I think the assignment last_hiqi_store=insn should be unconditional.

https://github.com/szmodz/gcc/blob/9ec9bb29ecbc935015e487ae4d9397a8976099e7/gcc/config/xtensa/xtensa.c#L2342
Without the volatile fix there's actually a bug here. The're won't be any memw before the (8 or 16 bit) load, and it won't work as expected (unprotected store-load sequence). I also think the volatile test should be inside if (store_insn) , store_insn=NULL should be executed in case of volatile loads, as the load itself should provide the barrier.

Other changes mostly aimed at improving readability and preventing some redundant memws. Do you really need two memws for a 8/16 bit store near an unconditional jump?

Let me know if something is still not clear - I'll try to expand or rephrase.

szmodz · 2020-04-14T20:39:26Z

@igrr @Spritetm the original nop-based workaround has a bug too.

insns_since_store is incremented only here:
https://github.com/espressif/gcc/blob/9ec9bb29ecbc935015e487ae4d9397a8976099e7/gcc/config/xtensa/xtensa.c#L2297

it should also be incremented here:
https://github.com/espressif/gcc/blob/9ec9bb29ecbc935015e487ae4d9397a8976099e7/gcc/config/xtensa/xtensa.c#L2292

if (attr_condjmp == CONDJMP_UNCOND) {
...
} else {
insns_since_store++
}

[edit] shouldn't make a difference though as more nops would actually be put in.

I actually tested my version of the nop workaround with neoniousTR's original test case, and it seems to work ok?

https://github.com/szmodz/gcc/blob/3d06d4d9451bcc14f9af63e95d83741231269347/gcc/config/xtensa/xtensa.c#L2269

szmodz · 2020-04-14T23:20:47Z

pre-built linux toolchain:
https://github.com/szmodz/gcc/releases/tag/2020r1-fixed

I've also made a minor forced-push to the repo.

szmodz · 2020-04-15T11:16:52Z

Wait, doesn't the esp32 have a 7 stage pipeline? it says so in the DS.

The nop patch seems to assume a 5 stage pipeline.

Spritetm · 2020-04-15T12:26:13Z

Yes, it does, but at that time experiments showed that the first two stages (or the last two, or the first and the last; I don't know exactly) seemingly weren't involved in the load/store inversion.

Spritetm · 2020-04-16T07:21:59Z

Also, a bit late perhaps but I can verify the bug and the fact that this is likely a PSRAM bug. On my ESP32 ECO1, the program hangs; on an ESP32 ECO3, the program continues to run and crashes eventually, possibly because an out-of-memory error.

Wrt the bugs: they indeed look like valid fixes/optimizations to me. Unless you think there's more work to be done there, we can take them and run them through our paces in our CI system. It's not perfect (as seemingly it didn't catch the previous bugs in the code) but it should at least tell us if there's some bug the mods may tickle.

szmodz · 2020-04-16T09:47:09Z

Good to go I suppose. Obviously I don't know the whole story behind the PSRAM issues. I didn't touch the dupldst workaround though, similar issues still apply.

ECO3 crash - was that with the original https_mbedtls example, or with the modified mbedtls_crash?

szmodz · 2020-04-16T11:57:06Z

I don't think there are any memory leaks in mbedtls_crash. It's been running for days with the fixed memw workaround. I added printing of free internal and external RAM just now and it also looks ok.

Spritetm · 2020-04-16T12:35:13Z

The ECO3 crash was with your modified code. As far as I can see, it spins up a lot of threads creating a TLS connection in parallel, so I assumed it would be an out-of-memory error.

szmodz · 2020-04-16T12:39:14Z

It does spin up threads, but the connection is closed as soon as the TLS handshake is done, it doesn't download any actual content.

Should probably investigate. I'm not getting any crashes with the fixed toolchain:
https://github.com/szmodz/gcc/releases/download/2020r1-fixed/xtensa-esp32-elf.tar.xz

Spritetm · 2020-04-16T12:54:01Z

It could possibly be an Internet speed thing. If the connection is made fast enough, the tasks end pretty soon and they don't 'heap up'. My connection here isn't that great, and as such the tasks do get the chance to all be active at the same time, which presumably leads to the out-of-memory condition.

szmodz · 2020-04-16T13:01:12Z

The TLS handshake is actually CPU bound. It always takes around 1s to complete (or longer in case of parallel connections).

I also noticed this:
3feffdf

or it might be related to the volatile bug (fixed in the patched toolchain):
#5112

szmodz · 2020-04-16T13:57:12Z

Also, the number of TLS tasks is a constant 15. Each has an infinite loop.

DRAM usage is constant, doesn't change at all during operation. Only PSRAM is freed / allocated.

szmodz · 2020-04-17T10:51:52Z

updated the example to include free memory information:
https://github.com/szmodz/esp-idf/tree/mbedtls_crash

the updated example is running 23 hours (fixed toolchain), no crashes / hangs so far.

szmodz · 2020-04-28T22:22:56Z

Made a change to the patch. Don't ask me why it matters, but it does.
https://github.com/szmodz/gcc/commits/2020r1-fixed

antmak · 2020-05-14T07:14:01Z

@szmodz Hi! https://github.com/szmodz/esp-idf/tree/mbedtls_crash is 404 Could you show the example?

szmodz · 2020-05-14T09:00:16Z

@antmak it's back up.
https://github.com/szmodz/esp-idf/tree/mbedtls_crash

Fixes PSRAM issues, volatile loads with -O2, std::locale, C++ exception crashs, FDE sorting switchable; Includes a 64-bit toolchain for Windows (win64); Closes #5090 Closes #5112 Closes #5150

github-actions bot changed the title ~~mbedtls_ssl_handshake crash (PSRAM unicore + memw workaround)~~ mbedtls_ssl_handshake crash (PSRAM unicore + memw workaround) (IDFGH-3068) Apr 9, 2020

szmodz mentioned this issue Apr 9, 2020

PSRAM Cache Issue stills exist (IDFGH-31) #2892

Open

igrr closed this as completed in aa25b16 Jun 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mbedtls_ssl_handshake crash (PSRAM unicore + memw workaround) (IDFGH-3068) #5090

mbedtls_ssl_handshake crash (PSRAM unicore + memw workaround) (IDFGH-3068) #5090

szmodz commented Apr 9, 2020 •

edited

Loading

Alvin1Zhang commented Apr 10, 2020

szmodz commented Apr 10, 2020

szmodz commented Apr 13, 2020

szmodz commented Apr 14, 2020

szmodz commented Apr 14, 2020

Spritetm commented Apr 14, 2020

szmodz commented Apr 14, 2020 •

edited

Loading

szmodz commented Apr 14, 2020 •

edited

Loading

szmodz commented Apr 14, 2020

szmodz commented Apr 15, 2020

Spritetm commented Apr 15, 2020

Spritetm commented Apr 16, 2020

szmodz commented Apr 16, 2020 •

edited

Loading

szmodz commented Apr 16, 2020

Spritetm commented Apr 16, 2020

szmodz commented Apr 16, 2020

Spritetm commented Apr 16, 2020

szmodz commented Apr 16, 2020 •

edited

Loading

szmodz commented Apr 16, 2020 •

edited

Loading

szmodz commented Apr 17, 2020

szmodz commented Apr 28, 2020

antmak commented May 14, 2020

szmodz commented May 14, 2020

mbedtls_ssl_handshake crash (PSRAM unicore + memw workaround) (IDFGH-3068) #5090

mbedtls_ssl_handshake crash (PSRAM unicore + memw workaround) (IDFGH-3068) #5090

Comments

szmodz commented Apr 9, 2020 • edited Loading

Alvin1Zhang commented Apr 10, 2020

szmodz commented Apr 10, 2020

szmodz commented Apr 13, 2020

szmodz commented Apr 14, 2020

szmodz commented Apr 14, 2020

Spritetm commented Apr 14, 2020

szmodz commented Apr 14, 2020 • edited Loading

szmodz commented Apr 14, 2020 • edited Loading

szmodz commented Apr 14, 2020

szmodz commented Apr 15, 2020

Spritetm commented Apr 15, 2020

Spritetm commented Apr 16, 2020

szmodz commented Apr 16, 2020 • edited Loading

szmodz commented Apr 16, 2020

Spritetm commented Apr 16, 2020

szmodz commented Apr 16, 2020

Spritetm commented Apr 16, 2020

szmodz commented Apr 16, 2020 • edited Loading

szmodz commented Apr 16, 2020 • edited Loading

szmodz commented Apr 17, 2020

szmodz commented Apr 28, 2020

antmak commented May 14, 2020

szmodz commented May 14, 2020

szmodz commented Apr 9, 2020 •

edited

Loading

szmodz commented Apr 14, 2020 •

edited

Loading

szmodz commented Apr 14, 2020 •

edited

Loading

szmodz commented Apr 16, 2020 •

edited

Loading

szmodz commented Apr 16, 2020 •

edited

Loading

szmodz commented Apr 16, 2020 •

edited

Loading