Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mbedtls_ssl_handshake crash (PSRAM unicore + memw workaround) (IDFGH-3068) #5090

Closed
szmodz opened this issue Apr 9, 2020 · 23 comments
Closed

Comments

@szmodz
Copy link

szmodz commented Apr 9, 2020

Code to reproduce is based on a modified https_mbedtls example from esp-idf:
https://github.com/szmodz/esp-idf/commits/mbedtls_crash

The code always ends up causing a WDT timeout sooner or later (timeout set to 60s) in UNICORE mode. CONFIG_MBEDTLS_HARDWARE_xxx are disabled, but it doesn't really matter.

Seems to be MUCH more stable with dupldst rather than memw workaround (not sure if it's completely stable though - too soon to tell). No other changes were made, and the effect on stability is obvious.

Perhaps it would be a good idea to also supply newlib and sdk libs compiled with dupldst?

The handshake is using nonblocking sockets and resets the WDT, so I don't think it should ever exceed the 60s timeout. That, plus it seems stable (so far) with dupldst.

It's a pretty intensive load, but that way the problem can be reproduced quickly. It also occurs under normal load conditions, but not as often. I noticed it in my actual application code.

  • Development Kit: ESP32-DevKitC (but also custom hardware with ethernet)
  • Kit version (for WroverKit/PicoKit/DevKitC): v4
  • Module or chip used: ESP32-WROVER-B
  • Build System: Make
  • Compiler version (run xtensa-esp32-elf-gcc --version to find it): 2020r1
  • Operating System: macOS
  • Using an IDE?: No
  • Power Supply: USB (custom HW has external power)

example output:

E (116633) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (116633) task_wdt:  - https_get_task (CPU 0/1)
E (116633) task_wdt:  - https_get_task (CPU 0/1)
E (116633) task_wdt:  - https_get_task (CPU 0/1)
E (116633) task_wdt:  - https_get_task (CPU 0/1)
E (116633) task_wdt: Tasks currently running:
E (116633) task_wdt: CPU 0: https_get_task
E (116633) task_wdt: Aborting.

abort() was called at PC 0x400d2e04 on core 0
0x400d2e04: esp_task_wdt_init at /Users/szmodz/xxx/esp-idf/components/esp_common/src/task_wdt.c:213


Backtrace:0x40087359:0x3ffb1350 0x400877dd:0x3ffb1370 0x40095c11:0x3ffb1390 0x400d2e04:0x3ffb1410 0x40082e8e:0x3ffb1430 0x400953a8:0x3fff2210 0x400e817e:0x3fff2220 0x400e8a17:0x3fff2240 0x400e8b1a:0x3fff2270 0x400ddab1:0x3fff2290 0x400de16b:0x3fff2320 0x400df2ba:0x3fff24c0 0x400eace9:0x3fff2650 0x400eaeaf:0x3fff26c0 0x400da5c1:0x3fff26f0 0x400d72e5:0x3fff27b0 0x400d7304:0x3fff27d0 0x400d732a:0x3fff27f0 0x400d4eb4:0x3fff2810
0x40087359: print_illegal_instruction_details at /Users/szmodz/xxx/esp-idf/components/esp_system/port/panic_handler.c:73

0x400877dd: esp_system_abort at /Users/szmodz/xxx/esp-idf/components/esp_system/system_api.c:68

0x40095c11: abort at /Users/szmodz/xxx/esp-idf/components/newlib/abort.c:46

0x400d2e04: esp_task_wdt_init at /Users/szmodz/xxx/esp-idf/components/esp_common/src/task_wdt.c:213

0x40082e8e: _xt_lowint1 at /Users/szmodz/xxx/esp-idf/components/freertos/xtensa/xtensa_vectors.S:1105

0x400953a8: strrchr at /builds/idf/crosstool-NG/.build/HOST-x86_64-apple-darwin12/xtensa-esp32-elf/src/newlib/newlib/libc/string/strrchr.c:38

0x400e817e: mbedtls_internal_aes_decrypt at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/aes.c:961

0x400e8a17: mbedtls_aes_crypt_xts at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/aes.c:1246 (discriminator 3)

0x400e8b1a: mbedtls_aes_crypt_xts at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/aes.c:1222

0x400ddab1: ecp_normalize_jac at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ecp.c:1105

0x400de16b: ecp_normalize_jac_many at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ecp.c:1173

0x400df2ba: mbedtls_ecp_curve_info_from_tls_id at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ecp.c:490
 (inlined by) mbedtls_ecp_tls_read_group_id at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ecp.c:953

0x400eace9: mbedtls_ecdsa_read_signature_restartable at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ecdsa.c:834

0x400eaeaf: mbedtls_asn1_write_len at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/asn1write.c:58

0x400da5c1: ssl_write_client_key_exchange at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ssl_cli.c:3160
 (inlined by) mbedtls_ssl_handshake_client_step at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ssl_cli.c:3586

0x400d72e5: mbedtls_ssl_handshake_free at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:8837

0x400d7304: mbedtls_ssl_handshake_free at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:8866

0x400d732a: mbedtls_ssl_handshake_free at /Users/szmodz/xxx/esp-idf/components/mbedtls/mbedtls/library/ssl_tls.c:8893

0x400d4eb4: https_get_task at /Users/szmodz/xxx/esp-idf/examples/protocols/mbedtls_crash/main/https_mbedtls_example_main.c:202

@github-actions github-actions bot changed the title mbedtls_ssl_handshake crash (PSRAM unicore + memw workaround) mbedtls_ssl_handshake crash (PSRAM unicore + memw workaround) (IDFGH-3068) Apr 9, 2020
@Alvin1Zhang
Copy link
Collaborator

Thanks for reporting, we will look into. @szmodz

@szmodz
Copy link
Author

szmodz commented Apr 10, 2020

updated the test code - noticed a bug, but I don't think it's relevant to the problem.

@szmodz
Copy link
Author

szmodz commented Apr 13, 2020

maybe related: #5112

@szmodz
Copy link
Author

szmodz commented Apr 14, 2020

@iggr @Spritetm
Shouldn't the memw fix look something like this (volatile fix by jcmvbkbc)?
https://github.com/szmodz/gcc/commits/esp_based_on_8_2_0

Currently testing with the above example, seems to work much better, will leave it a while.

@szmodz
Copy link
Author

szmodz commented Apr 14, 2020

I've also pushed a modified version of the nop fix (not tested).

Note the dupldst fix turned out not to be completely stable in my application (I think I know why).

@Spritetm
Copy link
Member

Can you expand a bit more on what you tried to fix in those modifications? I can reverse engineer them from your code probably, but if you can write down a line or two, that's probably faster.

@szmodz
Copy link
Author

szmodz commented Apr 14, 2020

@Spritetm mostly the crashes as described in the linked example. You can reproduce them, right?

Mostly i think this is a bug:
https://github.com/szmodz/gcc/blob/9ec9bb29ecbc935015e487ae4d9397a8976099e7/gcc/config/xtensa/xtensa.c#L2338
It doesn't matter if the store is volatile, as the memw is inserted before the store, not after. I indicated this in a comment. I think the assignment last_hiqi_store=insn should be unconditional.

https://github.com/szmodz/gcc/blob/9ec9bb29ecbc935015e487ae4d9397a8976099e7/gcc/config/xtensa/xtensa.c#L2342
Without the volatile fix there's actually a bug here. The're won't be any memw before the (8 or 16 bit) load, and it won't work as expected (unprotected store-load sequence). I also think the volatile test should be inside if (store_insn) , store_insn=NULL should be executed in case of volatile loads, as the load itself should provide the barrier.

Other changes mostly aimed at improving readability and preventing some redundant memws. Do you really need two memws for a 8/16 bit store near an unconditional jump?

Let me know if something is still not clear - I'll try to expand or rephrase.

@szmodz
Copy link
Author

szmodz commented Apr 14, 2020

@igrr @Spritetm the original nop-based workaround has a bug too.

insns_since_store is incremented only here:
https://github.com/espressif/gcc/blob/9ec9bb29ecbc935015e487ae4d9397a8976099e7/gcc/config/xtensa/xtensa.c#L2297

it should also be incremented here:
https://github.com/espressif/gcc/blob/9ec9bb29ecbc935015e487ae4d9397a8976099e7/gcc/config/xtensa/xtensa.c#L2292

if (attr_condjmp == CONDJMP_UNCOND) {
...
} else {
insns_since_store++
}

[edit] shouldn't make a difference though as more nops would actually be put in.

I actually tested my version of the nop workaround with neoniousTR's original test case, and it seems to work ok?

https://github.com/szmodz/gcc/blob/3d06d4d9451bcc14f9af63e95d83741231269347/gcc/config/xtensa/xtensa.c#L2269

@szmodz
Copy link
Author

szmodz commented Apr 14, 2020

pre-built linux toolchain:
https://github.com/szmodz/gcc/releases/tag/2020r1-fixed

I've also made a minor forced-push to the repo.

@szmodz
Copy link
Author

szmodz commented Apr 15, 2020

Wait, doesn't the esp32 have a 7 stage pipeline? it says so in the DS.

The nop patch seems to assume a 5 stage pipeline.

@Spritetm
Copy link
Member

Yes, it does, but at that time experiments showed that the first two stages (or the last two, or the first and the last; I don't know exactly) seemingly weren't involved in the load/store inversion.

@Spritetm
Copy link
Member

Also, a bit late perhaps but I can verify the bug and the fact that this is likely a PSRAM bug. On my ESP32 ECO1, the program hangs; on an ESP32 ECO3, the program continues to run and crashes eventually, possibly because an out-of-memory error.

Wrt the bugs: they indeed look like valid fixes/optimizations to me. Unless you think there's more work to be done there, we can take them and run them through our paces in our CI system. It's not perfect (as seemingly it didn't catch the previous bugs in the code) but it should at least tell us if there's some bug the mods may tickle.

@szmodz
Copy link
Author

szmodz commented Apr 16, 2020

Good to go I suppose. Obviously I don't know the whole story behind the PSRAM issues. I didn't touch the dupldst workaround though, similar issues still apply.

ECO3 crash - was that with the original https_mbedtls example, or with the modified mbedtls_crash?

@szmodz
Copy link
Author

szmodz commented Apr 16, 2020

I don't think there are any memory leaks in mbedtls_crash. It's been running for days with the fixed memw workaround. I added printing of free internal and external RAM just now and it also looks ok.

@Spritetm
Copy link
Member

The ECO3 crash was with your modified code. As far as I can see, it spins up a lot of threads creating a TLS connection in parallel, so I assumed it would be an out-of-memory error.

@szmodz
Copy link
Author

szmodz commented Apr 16, 2020

It does spin up threads, but the connection is closed as soon as the TLS handshake is done, it doesn't download any actual content.

Should probably investigate. I'm not getting any crashes with the fixed toolchain:
https://github.com/szmodz/gcc/releases/download/2020r1-fixed/xtensa-esp32-elf.tar.xz

@Spritetm
Copy link
Member

It could possibly be an Internet speed thing. If the connection is made fast enough, the tasks end pretty soon and they don't 'heap up'. My connection here isn't that great, and as such the tasks do get the chance to all be active at the same time, which presumably leads to the out-of-memory condition.

@szmodz
Copy link
Author

szmodz commented Apr 16, 2020

The TLS handshake is actually CPU bound. It always takes around 1s to complete (or longer in case of parallel connections).

I also noticed this:
3feffdf

or it might be related to the volatile bug (fixed in the patched toolchain):
#5112

@szmodz
Copy link
Author

szmodz commented Apr 16, 2020

Also, the number of TLS tasks is a constant 15. Each has an infinite loop.

DRAM usage is constant, doesn't change at all during operation. Only PSRAM is freed / allocated.

@szmodz
Copy link
Author

szmodz commented Apr 17, 2020

updated the example to include free memory information:
https://github.com/szmodz/esp-idf/tree/mbedtls_crash

the updated example is running 23 hours (fixed toolchain), no crashes / hangs so far.

@szmodz
Copy link
Author

szmodz commented Apr 28, 2020

Made a change to the patch. Don't ask me why it matters, but it does.
https://github.com/szmodz/gcc/commits/2020r1-fixed

@antmak
Copy link
Collaborator

antmak commented May 14, 2020

@szmodz Hi! https://github.com/szmodz/esp-idf/tree/mbedtls_crash is 404 Could you show the example?

@szmodz
Copy link
Author

szmodz commented May 14, 2020

@antmak it's back up.
https://github.com/szmodz/esp-idf/tree/mbedtls_crash

@igrr igrr closed this as completed in aa25b16 Jun 12, 2020
projectgus pushed a commit that referenced this issue Jul 22, 2020
Fixes PSRAM issues, volatile loads with -O2, std::locale, C++ exception crashs, FDE sorting switchable;

Includes a 64-bit toolchain for Windows (win64);

Closes #5090
Closes #5112
Closes #5150
espressif-bot pushed a commit that referenced this issue Jul 28, 2020
Fixes PSRAM issues, volatile loads with -O2, std::locale, C++ exception crashs, FDE sorting switchable;

Includes a 64-bit toolchain for Windows (win64);

Closes #5090
Closes #5112
Closes #5150
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants