Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stochastic OTA Behavior (IDFGH-1198) #3504

Closed
gregjesl opened this issue May 20, 2019 · 13 comments
Closed

Stochastic OTA Behavior (IDFGH-1198) #3504

gregjesl opened this issue May 20, 2019 · 13 comments

Comments

@gregjesl
Copy link
Contributor

Environment

  • Development Kit: ESP32-Wrover-Kit
  • Module or chip used: ESP32-WROOM-32D
  • IDF version: v4.0-dev-548-g21cf6579e
  • Build System: Make
  • Compiler version: 1.22.0-80-g6c4433a5
  • Operating System: Windows
  • Power Supply: USB

Problem Description

HTTPS OTA behavior is stochastic. ~40% of the time the OTA works:

I (505802) esp_https_ota: Starting OTA...
I (505803) esp_https_ota: Writing to partition subtype 17 at offset 0x2b0000
I (510346) esp_https_ota: esp_ota_begin succeeded
I (510346) esp_https_ota: Please Wait. This may take time
I (585177) esp_https_ota: Connection closed, all data received
I (585182) esp_image: segment 0: paddr=0x002b0020 vaddr=0x3f400020 size=0x55340 (348992) map
I (585381) esp_image: segment 1: paddr=0x00305368 vaddr=0x3ffb0000 size=0x03098 ( 12440)
I (585390) esp_image: segment 2: paddr=0x00308408 vaddr=0x40080000 size=0x00400 (  1024)
I (585392) esp_image: segment 3: paddr=0x00308810 vaddr=0x40080400 size=0x07800 ( 30720)
I (585416) esp_image: segment 4: paddr=0x00310018 vaddr=0x400d0018 size=0xdb984 (899460) map
I (585939) esp_image: segment 5: paddr=0x003eb9a4 vaddr=0x40087c00 size=0x0a0e0 ( 41184)
I (585964) esp_image: segment 6: paddr=0x003f5a8c vaddr=0x400c0000 size=0x00064 (   100)
I (585967) esp_image: segment 0: paddr=0x002b0020 vaddr=0x3f400020 size=0x55340 (348992) map
I (586175) esp_image: segment 1: paddr=0x00305368 vaddr=0x3ffb0000 size=0x03098 ( 12440)
I (586184) esp_image: segment 2: paddr=0x00308408 vaddr=0x40080000 size=0x00400 (  1024)
I (586186) esp_image: segment 3: paddr=0x00308810 vaddr=0x40080400 size=0x07800 ( 30720)
I (586210) esp_image: segment 4: paddr=0x00310018 vaddr=0x400d0018 size=0xdb984 (899460) map
I (586727) esp_image: segment 5: paddr=0x003eb9a4 vaddr=0x40087c00 size=0x0a0e0 ( 41184)
I (586752) esp_image: segment 6: paddr=0x003f5a8c vaddr=0x400c0000 size=0x00064 (   100)
I (586756) esp_https_ota: esp_ota_set_boot_partition succeeded

~50% of the time I receive the following error:

I (86288) esp_https_ota: Starting OTA...
I (86289) esp_https_ota: Writing to partition subtype 16 at offset 0x160000
I (161599) esp_https_ota: Connection closed, all data received
I (161600) esp_image: segment 0: paddr=0x00160020 vaddr=0x3f400020 size=0x55340 (348992) map
I (161802) esp_image: segment 1: paddr=0x001b5368 vaddr=0x3ffb0000 size=0x03098 ( 12440)
I (161811) esp_image: segment 2: paddr=0x001b8408 vaddr=0x40080000 size=0x00400 (  1024)
I (161814) esp_image: segment 3: paddr=0x001b8810 vaddr=0x40080400 size=0x07800 ( 30720)
I (161838) esp_image: segment 4: paddr=0x001c0018 vaddr=0x400d0018 size=0xdb984 (899460) map
I (162356) esp_image: segment 5: paddr=0x0029b9a4 vaddr=0x40087c00 size=0x0a0e0 ( 41184)
I (162381) esp_image: segment 6: paddr=0x002a5a8c vaddr=0x400c0000 size=0x00064 (   100)
I (162392) esp_image: segment 0: paddr=0x00160020 vaddr=0x3f400020 size=0x55340 (348992) map
I (162597) esp_image: segment 1: paddr=0x001b5368 vaddr=0x3ffb0000 size=0x03098 ( 12440)
I (162606) esp_image: segment 2: paddr=0x001b8408 vaddr=0x40080000 size=0x00400 (  1024)
I (162608) esp_image: segment 3: paddr=0x001b8810 vaddr=0x40080400 size=0x07800 ( 30720)
I (162632) esp_image: segment 4: paddr=0x001c0018 vaddr=0x400d0018 size=0xdb984 (899460) map
I (163158) esp_image: segment 5: paddr=0x0029b9a4 vaddr=0x40087c00 size=0x0a0e0 ( 41184)
I (163183) esp_image: segment 6: paddr=0x002a5a8c vaddr=0x400c0000 size=0x00064 (   100)
CORRUPT HEAP: multi_heap.c:477 detected at 0x3ffbcca4
abort() was called at PC 0x4008bf43 on core 0

ELF file SHA256: a1b70eafd0ff3eb42602cc1d2c26ea2c45a69540f1c86df00f22adef56fb14e3

Backtrace: 0x40088aaf:0x3fff1af0 0x40088dc9:0x3fff1b10 0x4008bf43:0x3fff1b30 0x4008c34d:0x3fff1b50 0x4008252d:0x3fff1                                                                                                                        b70 0x40088961:0x3fff1b90 0x401a5a8a:0x3fff1bb0 0x401a540a:0x3fff1be0

Rebooting...

~10% of the time I receive the following error:

I (21228) esp_https_ota: Starting OTA...
I (21229) esp_https_ota: Writing to partition subtype 16 at offset 0x160000
I (100529) esp_https_ota: Connection closed, all data received
I (100530) esp_image: segment 0: paddr=0x00160020 vaddr=0x3f400020 size=0x55340 (348992) map
I (100733) esp_image: segment 1: paddr=0x001b5368 vaddr=0x3ffb0000 size=0x03098 ( 12440)
I (100742) esp_image: segment 2: paddr=0x001b8408 vaddr=0x40080000 size=0x00400 (  1024)
I (100745) esp_image: segment 3: paddr=0x001b8810 vaddr=0x40080400 size=0x07800 ( 30720)
I (100768) esp_image: segment 4: paddr=0x001c0018 vaddr=0x400d0018 size=0xdb984 (899460) map
I (101286) esp_image: segment 5: paddr=0x0029b9a4 vaddr=0x40087c00 size=0x0a0e0 ( 41184)
I (101311) esp_image: segment 6: paddr=0x002a5a8c vaddr=0x400c0000 size=0x00064 (   100)
I (101322) esp_image: segment 0: paddr=0x00160020 vaddr=0x3f400020 size=0x55340 (348992) map
I (101527) esp_image: segment 1: paddr=0x001b5368 vaddr=0x3ffb0000 size=0x03098 ( 12440)
I (101535) esp_image: segment 2: paddr=0x001b8408 vaddr=0x40080000 size=0x00400 (  1024)
I (101538) esp_image: segment 3: paddr=0x001b8810 vaddr=0x40080400 size=0x07800 ( 30720)
I (101562) esp_image: segment 4: paddr=0x001c0018 vaddr=0x400d0018 size=0xdb984 (899460) map
I (102084) esp_image: segment 5: paddr=0x0029b9a4 vaddr=0x40087c00 size=0x0a0e0 ( 41184)
I (102109) esp_image: segment 6: paddr=0x002a5a8c vaddr=0x400c0000 size=0x00064 (   100)
assertion "semaphore not initialized" failed: file "C:/esp-idf/components/lwip/lwip/src/api/tcpip.c", line 373, funct                                                                                                                        ion: tcpip_send_msg_wait_sem
abort() was called at PC 0x400d72af on core 0

ELF file SHA256: a1b70eafd0ff3eb42602cc1d2c26ea2c45a69540f1c86df00f22adef56fb14e3

Backtrace: 0x40088aaf:0x3ffde8d0 0x40088dc9:0x3ffde8f0 0x400d72af:0x3ffde910 0x40165a47:0x3ffde940 0x40177035:0x3ffde                                                                                                                        970 0x40177585:0x3ffde990 0x40174bd0:0x3ffde9d0 0x40175286:0x3ffdea00 0x401753aa:0x3ffdea20 0x400f233d:0x3ffdea40 0x4                                                                                                                        000bd83:0x3ffdea60 0x4000182a:0x3ffdea80 0x4019f8e5:0x3ffdeaa0 0x401918c2:0x3ffdead0 0x40191c9e:0x3ffdeaf0 0x40191d36                                                                                                                        :0x3ffdeb90 0x40198835:0x3ffdebb0 0x40192962:0x3ffdec60 0x4019298d:0x3ffdec80 0x4018cbcb:0x3ffdeca0 0x4018cd1c:0x3ffd                                                                                                                        ed50 0x40187e73:0x3ffded70 0x401aa8bb:0x3ffded90 0x4017b2c5:0x3ffdedb0

I've also noticed that the OTA succeeds when writing to partition subtype 17 and fails when writing to partition subtype 16. It is not clear why the subtype is stochastic.

The message I (510346) esp_https_ota: Please Wait. This may take time also appears in successful attempts and does not appear when the OTA fails.

@github-actions github-actions bot changed the title Stochastic OTA Behavior Stochastic OTA Behavior (IDFGH-1198) May 20, 2019
@gregjesl
Copy link
Contributor Author

The following error condition has also appeared:

I (21073) esp_https_ota: Starting OTA...
I (21074) esp_https_ota: Writing to partition subtype 16 at offset 0x160000
I (100343) esp_https_ota: Connection closed, all data received
I (100344) esp_image: segment 0: paddr=0x00160020 vaddr=0x3f400020 size=0x55340 (348992) map
I (100547) esp_image: segment 1: paddr=0x001b5368 vaddr=0x3ffb0000 size=0x03098 ( 12440)
I (100556) esp_image: segment 2: paddr=0x001b8408 vaddr=0x40080000 size=0x00400 (  1024)
I (100559) esp_image: segment 3: paddr=0x001b8810 vaddr=0x40080400 size=0x07800 ( 30720)
I (100582) esp_image: segment 4: paddr=0x001c0018 vaddr=0x400d0018 size=0xdb984 (899460) map
I (101099) esp_image: segment 5: paddr=0x0029b9a4 vaddr=0x40087c00 size=0x0a0e0 ( 41184)
I (101124) esp_image: segment 6: paddr=0x002a5a8c vaddr=0x400c0000 size=0x00064 (   100)
I (101135) esp_image: segment 0: paddr=0x00160020 vaddr=0x3f400020 size=0x55340 (348992) map
I (101341) esp_image: segment 1: paddr=0x001b5368 vaddr=0x3ffb0000 size=0x03098 ( 12440)
I (101350) esp_image: segment 2: paddr=0x001b8408 vaddr=0x40080000 size=0x00400 (  1024)
I (101353) esp_image: segment 3: paddr=0x001b8810 vaddr=0x40080400 size=0x07800 ( 30720)
I (101377) esp_image: segment 4: paddr=0x001c0018 vaddr=0x400d0018 size=0xdb984 (899460) map
I (101901) esp_image: segment 5: paddr=0x0029b9a4 vaddr=0x40087c00 size=0x0a0e0 ( 41184)
I (101928) esp_image: segment 6: paddr=0x002a5a8c vaddr=0x400c0000 size=0x00064 (   100)
assertion "next > (intptr_t)block" failed: file "C:/esp-idf/components/heap/multi_heap.c", line 124, function: get_next_block
abort() was called at PC 0x400d72af on core 0

ELF file SHA256: a1b70eafd0ff3eb42602cc1d2c26ea2c45a69540f1c86df00f22adef56fb14e3

Backtrace: 0x40088aaf:0x3fff19f0 0x40088dc9:0x3fff1a10 0x400d72af:0x3fff1a30 0x4008bf26:0x3fff1a60 0x4008c1c3:0x3fff1a80 0x4008c33d:0x3fff1aa0 0x4008252d:0x3fff1ac0 0x40088961:0x3fff1ae0 0x401a5a8a:0x3fff1b00 0x401a540a:0x3fff1b30

@gregjesl
Copy link
Contributor Author

It turns out the successful update log listed above is from an older firmware (the one pulled via the OTA). Despite the resets, it appears that the OTA may be working, although the reset is ungraceful.

@mahavirj
Copy link
Member

@gregjesl Thanks for reporting, which example from IDF you are using, also do you have any changes done on your side?

@gregjesl
Copy link
Contributor Author

I'm using a custom application but I can try the simple OTA example. I do not have any changes to the IDF on my side.

I just pulled the latest from the master branch. I now receive the following error:

I (88646) esp_https_ota: Starting OTA...
I (88647) esp_https_ota: Writing to partition subtype 16 at offset 0x160000
I (93036) esp_https_ota: Connection closed, all data received
CORRUPT HEAP: multi_heap.c:477 detected at 0x3ffd8644
abort() was called at PC 0x4008bf43 on core 0
0x4008bf43: multi_heap_assert at C:/esp-idf/components/heap/multi_heap.c:699


ELF file SHA256: 672b3d31ba027fcda8903bdb619a4ccbaf4694a4207fac41d69c6169ee630f77

Backtrace: 0x40088aaf:0x3fff2280 0x40088dc9:0x3fff22a0 0x4008bf43:0x3fff22c0 0x4008c34d:0x3fff22e0 0x4008252d:0x3fff2300 0x40088961:0x3fff2320 0x401a6186:0x3fff2340 0x401a5b06:0x3fff2370
0x40088aaf: invoke_abort at C:/esp-idf/components/esp32/panic.c:716

0x40088dc9: abort at C:/esp-idf/components/esp32/panic.c:716

0x4008bf43: multi_heap_assert at C:/esp-idf/components/heap/multi_heap.c:699

0x4008c34d: multi_heap_free_impl at C:/esp-idf/components/heap/multi_heap.c:699

0x4008252d: heap_caps_free at C:/esp-idf/components/heap/heap_caps.c:232

0x40088961: free at C:/esp-idf/components/newlib/heap.c:47

0x401a6186: esp_https_ota at C:/esp-idf/components/esp_https_ota/src/esp_https_ota.c:344

@gregjesl
Copy link
Contributor Author

I disabled assertions and the OTA worked; however, the MQTT client is failing to establish a SSL connection with the broker after the OTA (we connect to the broker, send a message saying the update was successful, and then disconnect).

E (165775) esp-tls: mbedtls_ssl_handshake returned -0x2700
I (165776) esp-tls: Failed to verify peer certificate!
I (165777) esp-tls: verification info:   ! The certificate Common Name (CN) does not match with the expected CN

E (165791) esp-tls: Failed to open new connection
E (165793) TRANS_SSL: Failed to open a new connection
E (165799) MQTT_CLIENT: Error transport connect

We are using the same certificate (in the same memory location in code) for the MQTT client and the OTA HTTPS client. It looks like the OTA client may be trying to free the certificate memory...

@gregjesl
Copy link
Contributor Author

The issue is that free() is called on the update handle twice:

@gregjesl
Copy link
Contributor Author

I've created a pull request to resolve the issue: #3513

@chegewara
Copy link
Contributor

@gregjesl the problem is that https_ota_handle is mangled in this line:
https://github.com/espressif/esp-idf/pull/3513/files#diff-5245a11751d812b2e5354506d239433eR330

Thats why you have free(https_ota_handle) in line 344. It would be better to change naming rather than delete this line, because either way this will cause issues (multi_heap free or memory leak).

@gregjesl
Copy link
Contributor Author

@chegewara

the problem is that https_ota_handle is mangled in this line

I'm not tracking what you mean by mangled. Are you referring to the recast here?

esp_https_ota_t *handle = (esp_https_ota_t *)https_ota_handle;

@chegewara
Copy link
Contributor

Sorry, my bad. In PR i had to scroll too high and i thought this is beginning of function:

int esp_https_ota_get_image_len_read(esp_https_ota_handle_t https_ota_handle)

and this parameter name is the same as variable in here:

esp_https_ota_handle_t https_ota_handle = NULL;

So, ignore my previous comment please. Sorry.

@gregjesl
Copy link
Contributor Author

@chegewara No problem. Is removing free(https_ota_handle) at line 344 the right solution then?

@jitin17
Copy link
Contributor

jitin17 commented May 22, 2019

@gregjesl I think removing free(https_ota_handle) at line 344 should work.

@gregjesl
Copy link
Contributor Author

@jitin17 👍 I tested it for my pull request (#3513) and it appeared to work.

@igrr igrr closed this as completed in 7fbedb1 Jun 26, 2019
trombik pushed a commit to trombik/esp-idf that referenced this issue Aug 9, 2019
Closes espressif#3504

Merges espressif#3513

Signed-off-by: Jitin George <jitin@espressif.com>
espressif-bot pushed a commit that referenced this issue Nov 19, 2019
Closes #3504

Merges #3513

Signed-off-by: Jitin George <jitin@espressif.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants