[SDMMC Mount] fix infinite loop when SD card is not responsive (IDFGH-9132) #10532

chipweinberger · 2023-01-12T11:05:28Z

Related Issue: #10531

When the sd card is not responsive, we need to have reasonable timeouts to prevent infinite loops.

atanisoft · 2023-01-12T14:44:34Z

components/driver/sdmmc_host.c

    while (SDMMC.ctrl.controller_reset || SDMMC.ctrl.fifo_reset || SDMMC.ctrl.dma_reset) {
-        ;
+        if (esp_log_timestamp() - t0 > 1000) {


perhaps use ESP_RETURN_ON_FALSE (from esp_check.h) for this sort of check?

Additionally, I wouldn't rely on esp_log_timestamp, it is not guaranteed to be monotonic. I would suggest esp_timer_get_time, instead.

I looked at ESP_RETURN_ON_FALSE, but I noticed it logs, which we don't need here.

@chipweinberger Fair point on logging, I'm on the fence if the timeout should be logged or not. It would be really nice if ESP_RETURN_ON_FALSE had a way to pass in a log level or a flag to say "don't log this"

atanisoft · 2023-01-12T14:45:21Z

components/driver/sdmmc_host.c

@@ -152,8 +158,20 @@ static void sdmmc_host_clock_update_command(int slot)
    };
    bool repeat = true;
    while(repeat) {
-        sdmmc_host_start_command(slot, cmd_val, 0);
+
+        esp_err_t err = sdmmc_host_start_command(slot, cmd_val, 0);


perhaps use ESP_RETURN_ON_ERROR for this and similar usages.

components/driver/sdmmc_host.c

igrr · 2023-01-12T15:30:17Z

components/driver/sdmmc_host.c

    while (SDMMC.ctrl.controller_reset || SDMMC.ctrl.fifo_reset || SDMMC.ctrl.dma_reset) {
-        ;
+        if (esp_log_timestamp() - t0 > 1000) {


Additionally, I wouldn't rely on esp_log_timestamp, it is not guaranteed to be monotonic. I would suggest esp_timer_get_time, instead.

components/driver/sdmmc_host.c

igrr · 2023-01-12T15:31:59Z

components/driver/sdmmc_host.c

-    sdmmc_host_clock_update_command(slot);
+    err = sdmmc_host_clock_update_command(slot);
+    if (err != ESP_OK) {
+        ESP_LOGE(TAG, "set clk div failed");


Probably not necessary to log here, since it's not something that can happen in practice. (Clock update commands don't depend on the state of the Card Interface Unit, and as such shouldn't be blocked by the bus state.)

(Same comment for other sdmmc_host_clock_update_command error checks.)

You must be mistaken? In my issue #10531, we get stuck in

I (5109) sdmmc_periph: sdmmc_host_clock_update_command() - SDMMC.clkena.cclk_enable &= ~BIT(slot);

And this other commenter hit sdmmc_host_clock_update_command() as well : #2986 (comment)

We should probably just leave it in.

Sorry, have overlooked that. In that case the fix might be a bit different than ignoring the fact that the clock update didn't happen. We might be missing to reset some part of the peripheral. I'll check it and come back with more details.

Yes would be great to fix the underlying issue, in addition to having reasonable timeouts.

lmk how things go, Ivan. Thanks.

chipweinberger · 2023-01-12T21:50:30Z

@igrr , updated the PR with the given feedback.

chipweinberger · 2023-01-18T23:24:12Z

@igrr , I audited all SDMMC code for potential infinite loops & added reasonable timeouts.

Please re-review.

igrr

The timeout constants are defined with _MS suffix (e.g. SDMMC_INIT_WAIT_DATA_READY_TIMEOUT_MS) but you are comparing them against the return value of esp_timer_get_time(), which is in microseconds.

igrr · 2023-01-19T02:02:21Z

components/driver/sdmmc_transaction.c

@@ -153,7 +154,11 @@ esp_err_t sdmmc_host_do_transaction(int slot, sdmmc_command_t* cmdinfo)
    cmdinfo->error = ESP_OK;
    sdmmc_req_state_t state = SDMMC_SENDING_CMD;
    sdmmc_event_t unhandled_events = { 0 };
+    int t0 = esp_timer_get_time();


The return type of esp_timer_get_time is int64_t, suggest using that for the variable type.

(applies to a bunch of other assignments, as well)

igrr · 2023-01-19T02:05:26Z

components/driver/sdmmc_transaction.c

    while (state != SDMMC_IDLE) {
+        if (esp_timer_get_time() - t0 > SDMMC_HOST_DO_TRANSACTION_TIMEOUT_MS) {


The name of this is confusing (to me, at least), since it's not really the transaction timeout being implemented here. Transaction timeout is defined by cmdinfo->timeout_ms. This is more like a state machine timeout? What kind of situation does it protect against?

Also, you can't return at this point, since the request mutex is still held — see out bail-out path below.

Correct me if I'm wrong, but it is not just a state machine. It seems like it does active communication with the SD card, which is why I added a timeout.

For example:

static esp_err_t process_events(sdmmc_event_t evt, sdmmc_command_t* cmd, sdmmc_req_state_t* pstate, sdmmc_event_t* unhandled_events) { while (next_state != state) { state = next_state; // note: this might still be the same state as before! switch (state) { case SDMMC_SENDING_DATA: ..... // it looks like this code will loop until some // stop condition is set in the peripheral break; }

The waiting-for-the-peripheral should happen outside of process_events, in the handle_event -> sdmmc_host_wait_for_event, which does know how to timeout.

process_events is only updating the sdmmc_req_state_t* pstate based on events recorded in sdmmc_event_t* unhandled_events. The loop in process_events has a condition while (next_state != state) {, which basically means that we are only looping while the state is changing. If we stay in the same state, then the function exits and we get back to sdmmc_host_wait_for_event until the next event arrives.

If we stay in the same state, then the function exits and we get back to sdmmc_host_wait_for_event

So, could a broken SD card prevent us from exiting? i.e. by returning new events infinitely?

lmk if I should remove this check. From my understanding it seems like it could be hit with a broken SD card.

Not really, it's more like if the SDMMC peripheral suddenly stops following its documented state machine. I mean, it's okay to not trust the hardware, but if we sprinkle all the drivers for checks that the chip peripheral is not following its spec, it will be a mess to read and the code size will be impacted. So I do think this one is unnecessary.

(I agree with the other ones which are card behavior dependent, though.)

Okay. removed!

igrr · 2023-01-19T02:06:18Z

components/sdmmc/CMakeLists.txt

@@ -5,7 +5,7 @@ idf_component_register(SRCS "sdmmc_cmd.c"
                            "sdmmc_mmc.c"
                            "sdmmc_sd.c"
                    INCLUDE_DIRS include
-                    REQUIRES driver
+                    REQUIRES driver esp_timer


esp_timer could be in PRIV_REQUIRES

igrr · 2023-01-19T02:10:21Z

components/sdmmc/sdmmc_cmd.c

    /* SD mode: wait for the card to become idle based on R1 status */
    while (!host_is_spi(card) && !(status & MMC_R1_READY_FOR_DATA)) {
-        // TODO: add some timeout here
+        if (esp_timer_get_time() - t0 > SDMMC_WRITE_SECTORS_DMA_TIMEOUT_MS) {


Suggested change

if (esp_timer_get_time() - t0 > SDMMC_WRITE_SECTORS_DMA_TIMEOUT_MS) {

if (esp_timer_get_time() - t0 > SDMMC_READY_FOR_DATA_TIMEOUT_MS) {

(the data has already been written at this point, we are waiting for the card to become idle)

igrr · 2023-01-19T02:10:54Z

components/sdmmc/sdmmc_cmd.c

    while (!host_is_spi(card) && !(status & MMC_R1_READY_FOR_DATA)) {
-        // TODO: add some timeout here
+        if (esp_timer_get_time() - t0 > SDMMC_READ_SECTORS_DMA_TIMEOUT_MS) {


likewise, suggest SDMMC_READY_FOR_DATA_TIMEOUT_MS

igrr · 2023-01-19T02:12:35Z

components/sdmmc/sdmmc_cmd.c

        err = sdmmc_send_cmd_send_status(card, &status);
        if (err != ESP_OK) {
+            ESP_LOGE(TAG, "read: status cmd failed");


suggest

Suggested change

ESP_LOGE(TAG, "read: status cmd failed");

ESP_LOGE(TAG, "%s: sdmmc_send_cmd_send_status returned 0x%x", __func__, err);

(similar to other error messages in this file)

igrr · 2023-01-19T02:15:47Z

components/driver/sdmmc_host.c

+
+            if (esp_timer_get_time() - t0 > SDMMC_HOST_CLOCK_UPDATE_CMD_TIMEOUT_MS) {
+                return ESP_ERR_TIMEOUT;
+            }


Can be combined with the loop condition,

while (esp_timer_get_time() - t0 < SDMMC_HOST_CLOCK_UPDATE_CMD_TIMEOUT_MS) {

True, but I think it's more readable to keep all the checks identical. Unnecessary complexity imo. Plus, readability wise it makes the return value less immediately clear, IMO.

Fair enough, I just thought someone might ask me to change this during code review. Let's leave it as is, for now.

igrr · 2023-01-19T02:17:11Z

components/driver/include/driver/sdmmc_host.h

+#define SDMMC_HOST_RESET_TIMEOUT_MS            5000
+#define SDMMC_HOST_DO_TRANSACTION_TIMEOUT_MS   5000
+#define SDMMC_HOST_CLOCK_UPDATE_CMD_TIMEOUT_MS 1000
+#define SDMMC_HOST_START_CMD_TIMEOUT_MS        1000


I think it's not necessary to define these in a public header file, since these are an implementation detail.
There is no use for these macros in the application code.

(Once something is defined in a public header file, we need to take care to keep it backwards compatible.)

chipweinberger · 2023-01-19T02:21:07Z

Thanks for the review, Ivan!

igrr · 2023-01-19T05:27:39Z

sha=d7324be0348dd9f4990f4762b193427b4f8e05f2

chipweinberger · 2023-03-10T22:24:15Z

@igrr any updates? this seemed like a good simple change

chipweinberger · 2023-03-31T15:54:36Z

bump. Any blockers?

adokitkat · 2023-03-31T16:35:56Z

Hi @chipweinberger. The MR is currently in review stage needing just a few last accepts. Sorry for the wait. Hopefully it will be merged soon.

AxelLin · 2023-04-07T06:21:27Z

sha=d7324be0348dd9f4990f4762b193427b4f8e05f2

Just curious what does this comment mean?
People usually think this means the PR is merged internally and will be available publicly soon.
But the fact is it may take yet another months to get the fix in github. (And such long time is not a single case).
Several PRs are in "Sync-merge" state for months.

igrr · 2023-04-07T07:45:45Z

This type of a comment is a trigger for creating an internal merge request. This change is in review now, it was left unattended for a while but was recently picked up again.

adokitkat · 2023-04-13T08:54:28Z

The pull request was merged to master branch. Thank you for your contribution :) I will close this on GH as canceled but your commits are merged and will be shown in commit history after the GH master branch syncs with our internal master.

Closes: #10532

chipweinberger mentioned this pull request Jan 12, 2023

[v4.4.3][SD Card] sdmmc mount sometimes hits infinite loop in sdmmc_host_clock_update_command() (IDFGH-9131) #10531

Closed

3 tasks

chipweinberger force-pushed the user/chip/sdmmc-set-clk-infinite-loop branch from b7dc443 to 9cc3666 Compare January 12, 2023 11:13

espressif-bot added the Status: Opened Issue is new label Jan 12, 2023

github-actions bot changed the title ~~[SDMMC Mount] fix infinite loop when SD card is not responsive~~ [SDMMC Mount] fix infinite loop when SD card is not responsive (IDFGH-9132) Jan 12, 2023

atanisoft reviewed Jan 12, 2023

View reviewed changes

igrr requested changes Jan 12, 2023

View reviewed changes

[SDMMC Mount] fix infinite loop when SD card is not responsive

516b951

chipweinberger force-pushed the user/chip/sdmmc-set-clk-infinite-loop branch from 9cc3666 to 516b951 Compare January 12, 2023 21:50

chipweinberger closed this Jan 12, 2023

chipweinberger reopened this Jan 12, 2023

chipweinberger force-pushed the user/chip/sdmmc-set-clk-infinite-loop branch from ebb4071 to 8eda7cc Compare January 18, 2023 23:26

igrr requested changes Jan 19, 2023

View reviewed changes

chipweinberger force-pushed the user/chip/sdmmc-set-clk-infinite-loop branch 6 times, most recently from aeaecff to 4450fd0 Compare January 19, 2023 02:59

[SDMMC] add reasonable timeouts to all while loops

d7324be

chipweinberger force-pushed the user/chip/sdmmc-set-clk-infinite-loop branch from 4450fd0 to d7324be Compare January 19, 2023 04:22

igrr approved these changes Jan 19, 2023

View reviewed changes

igrr added the PR-Sync-Merge Pull request sync as merge commit label Jan 19, 2023

espressif-bot added Status: Reviewing Issue is being reviewed and removed Status: Opened Issue is new labels Mar 8, 2023

espressif-bot assigned pacucha42 Mar 28, 2023

espressif-bot assigned adokitkat and unassigned pacucha42 Mar 28, 2023

espressif-bot added Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: Reviewing Issue is being reviewed Resolution: NA Issue resolution is unavailable labels Apr 12, 2023

adokitkat closed this Apr 13, 2023

adokitkat mentioned this pull request Apr 13, 2023

Lock up in sdmmc_host_start_command causes Task WDT kick and abort (IDFGH-582) #2986

Closed

espressif-bot pushed a commit that referenced this pull request Apr 17, 2023

[SDMMC Mount] fix infinite loop when SD card is not responsive

a2aa9e3

Closes: #10532

espressif-bot pushed a commit that referenced this pull request Apr 17, 2023

[SDMMC] add reasonable timeouts to all while loops

74d6215

Closes: #10532

espressif-bot pushed a commit that referenced this pull request May 20, 2023

[SDMMC Mount] fix infinite loop when SD card is not responsive

6ff1059

Closes: #10532

espressif-bot pushed a commit that referenced this pull request May 20, 2023

[SDMMC] add reasonable timeouts to all while loops

c7ca30e

Closes: #10532

chipweinberger deleted the user/chip/sdmmc-set-clk-infinite-loop branch May 20, 2023 10:49

		while (state != SDMMC_IDLE) {
		if (esp_timer_get_time() - t0 > SDMMC_HOST_DO_TRANSACTION_TIMEOUT_MS) {

	if (esp_timer_get_time() - t0 > SDMMC_WRITE_SECTORS_DMA_TIMEOUT_MS) {
	if (esp_timer_get_time() - t0 > SDMMC_READY_FOR_DATA_TIMEOUT_MS) {

	ESP_LOGE(TAG, "read: status cmd failed");
	ESP_LOGE(TAG, "%s: sdmmc_send_cmd_send_status returned 0x%x", __func__, err);

[SDMMC Mount] fix infinite loop when SD card is not responsive (IDFGH-9132) #10532

[SDMMC Mount] fix infinite loop when SD card is not responsive (IDFGH-9132) #10532

Conversation

chipweinberger commented Jan 12, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chipweinberger Jan 12, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chipweinberger commented Jan 12, 2023

chipweinberger commented Jan 18, 2023

igrr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chipweinberger Jan 19, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chipweinberger Jan 19, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chipweinberger Jan 19, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chipweinberger commented Jan 19, 2023

igrr commented Jan 19, 2023

chipweinberger commented Mar 10, 2023

chipweinberger commented Mar 31, 2023

adokitkat commented Mar 31, 2023

AxelLin commented Apr 7, 2023

igrr commented Apr 7, 2023

adokitkat commented Apr 13, 2023

chipweinberger Jan 12, 2023 •

edited

chipweinberger Jan 19, 2023 •

edited

chipweinberger Jan 19, 2023 •

edited

chipweinberger Jan 19, 2023 •

edited