Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when repeatedly creating and destroying busio.I2C object #3846

Closed
jepler opened this issue Dec 18, 2020 · 6 comments · Fixed by #4387
Closed

Crash when repeatedly creating and destroying busio.I2C object #3846

jepler opened this issue Dec 18, 2020 · 6 comments · Fixed by #4387
Assignees
Labels
bug crash espressif applies to multiple Espressif chips

Comments

@jepler
Copy link
Member

jepler commented Dec 18, 2020

Running the following sequence twice in the same REPL causes CircuitPython to reset (not into safe mode, just a "normal" reset!)

Tested on a kaluga with DEBUG=1 at f2204d7

(tio is my terminal program)

>>> import busio, board; bus = busio.I2C(board.IO10, board.IO11); bus.try_lock(); bus.scan(); bus.deinit()
True
[11, 112]
>>> import busio, board; bus = busio.I2C(board.IO10, board.IO11); bus.try_lock(); bus.scan(); bus.deinit()
True

[tio 10:05:07] Disconnected
[tio 10:05:41] Connected
Auto-reload is on. Simply save files over USB to run them or enter REPL to disable.

Press any key to enter the REPL. Use CTRL-D to reload.

gdb debugging doesn't show anything interesting. hardware debug console doesn't show anything interesting. I don't know how to continue debugging this sort of reset (e.g., to set a breakpoint to "catch" it)

I (15316) gpio: GPIO[11]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0 
I (15316) gpio: GPIO[10]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0 
I (20846) gpio: GPIO[11]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0 
I (20846) gpio: GPIO[10]| InputEn: 0| OutputEn: 0| OpenDrain: 0| Pullup: 0| Pulldown: 0| Intr:0 
<----- ( second bus.scan() call happens here) ---->
ESP-ROM:esp32s2-rc4-20191025
Build:Oct 25 2019
rst:0x8 (TG1WDT_SYS_RST),boot:0x8 (SPI_FAST_FLASH_BOOT)
Saved PC:0x40028200
SPIWP:0xee
mode:DIO, clock div:2
load:0x3ffe6100,len:0x8
load:0x3ffe6108,len:0x1838
load:0x4004c000,len:0x944
load:0x40050000,len:0x2cec
entry 0x4004c1bc
I (51) boot: ESP-IDF v4.3-dev-1198-gd06744f5e 2nd stage bootloader
@jepler jepler added bug crash espressif applies to multiple Espressif chips labels Dec 18, 2020
@jepler
Copy link
Member Author

jepler commented Dec 21, 2020

@igrr We've been puzzled by multiple i2c problems in CircuitPython. This one is pretty simple to reproduce, can you take a look and see if you can spot what we're doing wrong with the esp-idf APIs? Thanks!

@igrr
Copy link

igrr commented Dec 21, 2020

I'll need to set up circuitpython build environment again to debug this, but upon cursory look i don't see anything wrong with I2C APIs usage. Just a few notes, i doubt any of this is the cause of this issue, though:

  1. if (xSemaphoreCreateBinaryStatic(&self->semaphore) != &self->semaphore) {
    Binary semaphore is used to implement exclusive access to a resource, instead of a mutex semaphore. Mutex semaphores in FreeRTOS implement priority inheritance, whereas regular semaphores do not. Normal semaphores should be used in a producer/consumer scenarios, such as e.g. notifying a task from an ISR that data has been received. Mutex semaphores should be used to implement mutual exclusion scenarios.
  2. Same line: looks like xSemaphoreCreateBinaryStatic is called multiple times on subsequent calls to common_hal_busio_i2c_construct.
  3. Possible use-after-free in
    esp_err_t result = i2c_master_cmd_begin(self->i2c_num, cmd, 10);
    i2c_cmd_link_delete(cmd);
    . If for any reason i2c_master_cmd_begin times out (which is likely in a probing scenario), I am not sure whether the IDF I2C driver will still access the cmd object in case an interrupt occurs. I think it shouldn't, since that would make the API hard to use — how is the user supposed to know whether they can call i2c_cmd_link_delete or not? However looking at the IDF code I am not sure it actually doesn't access cmd from an interrupt that might occur after the timeout, in some odd case. Suggest adding some delay after the failed command (if err == ESP_ERR_TIMEOUT) and see if that makes any difference in terms of making the issue harder or easier to reproduce.

@igrr
Copy link

igrr commented Dec 21, 2020

Saved PC:0x40028200

Indicates the value of CPU program counter at the time when the watchdog triggered. 0x40028200 seems to be part of the vector table, specifically the Level4InterruptVector. This interrupt level is associated with various debugging features. On the ESP32-S2 these are interrupt watchdog interrupt and memory protection interrupt. Normally this level 4 interrupt handler invokes the panic handler, which prints a nice human-readable panic message to the UART console. This doesn't seem to be the case here, which indicates that either some memory got corrupted, and the Level4 interrupt and/or the panic handler do not do their job anymore; or that there is some other system-level issue.
If JTAG is available, I would recommend setting a breakpoint on 0x40028200, and when it triggers, looking at EPC4 register — it should contain the PC of the code that was interrupted, possibly giving an indication on what was going on.

@tannewt tannewt added this to the 6.x.x - Bug Fixes milestone Dec 22, 2020
@jepler
Copy link
Member Author

jepler commented Dec 22, 2020

Thank you for the pointers @igrr!

  • I eliminated the semaphore (it is used to implement Python-level locking of the device, and is not relevant to the code that causes the problem) without affecting the outcome
  • I did not have success setting a gdb breakpoint at *0x40028200, it was never triggered.
  • I did observe that during a "normal" failed I2C probe, i2c_master_cmd_begin() returns -1. However, when the I2C device is re-used, it is returning ESP_ERR_TIMEOUT.
  • If I introduce i2c_driver_delete in our deinit code, then the scenario of creating and deleting the I2C bus multiple times within one session works. However, it reintroduces a problem where importing wifi and repeating the same steps causes a lock-up. This was the original scenario we investigated and tried to work around by NOT deleting the i2c driver.
  • before the watchdog triggers(?) I can ctrl-c break in gdb and see that i2c_isr_handler_default is being called repeatedly, with p_i2c->status == 4 (I2C_STATUS_ACK_ERROR)
  • With added code to NOT call i2c_cmd_link_delete after ESP_ERR_TIMEOUT it sometimes recovers but not reliably.

@dhalbert dhalbert self-assigned this Mar 4, 2021
@dhalbert
Copy link
Collaborator

dhalbert commented Mar 7, 2021

Progress notes: I looked more carefully at how to do i2c_driver_install() only once, as we are currently doing, and then to reuse the installed drivers as needed. I found there was no clean way to reassign the pins based on the current API. It is possible to deassign the pins using negative pin numbers, but that caused other things not to be done properly.

I have gone back to the original scheme of installing and deleting I2C drivers on demand. I think have fixed the problem of installing and deleting the driver multiple times. But I am still looking at why deleting an I2C driver then causes the import wifi to crash. I am suspicious that some storage returned by i2c_driver_delete() is being reused by the wifi code, and is getting smashed by some leftover I2C work. I've been commenting out various pieces of code in the IDF and our code to see what's critical to the crash and what's not. Continuing on this.

Typical test that will cause a crash with the old scheme:

import board,busio

i2c = busio.I2C(board.SCL, board.SDA)
i2c.deinit()

import wifi

@dhalbert
Copy link
Collaborator

dhalbert commented Mar 8, 2021

@igrr I narrowed down the issue to the treatment of a single semaphore in i2c_driver_delete() by some divide and conquer work (comment out code until it stops breaking and then add it back piece by piece). Here is the code in esp_idf/components/driver/i2c.c:

esp_err_t i2c_driver_delete(i2c_port_t i2c_num)
{
    I2C_CHECK(i2c_num < I2C_NUM_MAX, I2C_NUM_ERROR_STR, ESP_ERR_INVALID_ARG);
    I2C_CHECK(p_i2c_obj[i2c_num] != NULL, I2C_DRIVER_ERR_STR, ESP_FAIL);

    i2c_obj_t *p_i2c = p_i2c_obj[i2c_num];
    i2c_hal_disable_intr_mask(&(i2c_context[i2c_num].hal), I2C_INTR_MASK);
    esp_intr_free(p_i2c->intr_handle);
    p_i2c->intr_handle = NULL;

    if (p_i2c->cmd_mux) {
        xSemaphoreTake(p_i2c->cmd_mux, portMAX_DELAY);    //<--------- commenting this out prevents the crash
        vSemaphoreDelete(p_i2c->cmd_mux);
    }
...

This cmd_mux semaphore is used for locking in i2c_master_cmd_begin() and nowhere else. If the xSemaphoreTake() is removed, then there's no crash. I guess the idea here is to wait for i2c_master_cmd_begin() to finish if it's still doing something?

This seems related to your case 3. above. Is this a bug? This code is the same at the tip of master.

Another test I did was not to actually free the p_i2c object, in case the storage was being reused and the I2C ISR or whatever was referencing the freed storage. That did not remove the crash.

Thanks for your help with all this.

EDIT: Some more research. The CircuitPython code above does not do any I2C transactions. I put some logging in i2c_master_cmd_begin() to confirm that and it's never called, so the semaphore appears not to be touched after creation. So I'm confused, because it seems like i2c_driver_delete can just take the semaphore immediately and then delete it. But for some reason doing that is fatal.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug crash espressif applies to multiple Espressif chips
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants