Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SDMMC Mount] fix infinite loop when SD card is not responsive (IDFGH-9132) #10532

Conversation

chipweinberger
Copy link
Contributor

Related Issue: #10531

When the sd card is not responsive, we need to have reasonable timeouts to prevent infinite loops.

@chipweinberger chipweinberger force-pushed the user/chip/sdmmc-set-clk-infinite-loop branch from b7dc443 to 9cc3666 Compare January 12, 2023 11:13
@espressif-bot espressif-bot added the Status: Opened Issue is new label Jan 12, 2023
@github-actions github-actions bot changed the title [SDMMC Mount] fix infinite loop when SD card is not responsive [SDMMC Mount] fix infinite loop when SD card is not responsive (IDFGH-9132) Jan 12, 2023
while (SDMMC.ctrl.controller_reset || SDMMC.ctrl.fifo_reset || SDMMC.ctrl.dma_reset) {
;
if (esp_log_timestamp() - t0 > 1000) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps use ESP_RETURN_ON_FALSE (from esp_check.h) for this sort of check?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I wouldn't rely on esp_log_timestamp, it is not guaranteed to be monotonic. I would suggest esp_timer_get_time, instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at ESP_RETURN_ON_FALSE, but I noticed it logs, which we don't need here.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@chipweinberger Fair point on logging, I'm on the fence if the timeout should be logged or not. It would be really nice if ESP_RETURN_ON_FALSE had a way to pass in a log level or a flag to say "don't log this"

@@ -152,8 +158,20 @@ static void sdmmc_host_clock_update_command(int slot)
};
bool repeat = true;
while(repeat) {
sdmmc_host_start_command(slot, cmd_val, 0);

esp_err_t err = sdmmc_host_start_command(slot, cmd_val, 0);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps use ESP_RETURN_ON_ERROR for this and similar usages.

components/driver/sdmmc_host.c Outdated Show resolved Hide resolved
while (SDMMC.ctrl.controller_reset || SDMMC.ctrl.fifo_reset || SDMMC.ctrl.dma_reset) {
;
if (esp_log_timestamp() - t0 > 1000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additionally, I wouldn't rely on esp_log_timestamp, it is not guaranteed to be monotonic. I would suggest esp_timer_get_time, instead.

components/driver/sdmmc_host.c Show resolved Hide resolved
sdmmc_host_clock_update_command(slot);
err = sdmmc_host_clock_update_command(slot);
if (err != ESP_OK) {
ESP_LOGE(TAG, "set clk div failed");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not necessary to log here, since it's not something that can happen in practice. (Clock update commands don't depend on the state of the Card Interface Unit, and as such shouldn't be blocked by the bus state.)

(Same comment for other sdmmc_host_clock_update_command error checks.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You must be mistaken? In my issue #10531, we get stuck in

I (5109) sdmmc_periph: sdmmc_host_clock_update_command() - SDMMC.clkena.cclk_enable &= ~BIT(slot);

And this other commenter hit sdmmc_host_clock_update_command() as well : #2986 (comment)

We should probably just leave it in.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, have overlooked that. In that case the fix might be a bit different than ignoring the fact that the clock update didn't happen. We might be missing to reset some part of the peripheral. I'll check it and come back with more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes would be great to fix the underlying issue, in addition to having reasonable timeouts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lmk how things go, Ivan. Thanks.

@chipweinberger chipweinberger force-pushed the user/chip/sdmmc-set-clk-infinite-loop branch from 9cc3666 to 516b951 Compare January 12, 2023 21:50
@chipweinberger
Copy link
Contributor Author

@igrr , updated the PR with the given feedback.

@chipweinberger
Copy link
Contributor Author

@igrr , I audited all SDMMC code for potential infinite loops & added reasonable timeouts.

Please re-review.

@chipweinberger chipweinberger force-pushed the user/chip/sdmmc-set-clk-infinite-loop branch from ebb4071 to 8eda7cc Compare January 18, 2023 23:26
Copy link
Member

@igrr igrr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The timeout constants are defined with _MS suffix (e.g. SDMMC_INIT_WAIT_DATA_READY_TIMEOUT_MS) but you are comparing them against the return value of esp_timer_get_time(), which is in microseconds.

@@ -153,7 +154,11 @@ esp_err_t sdmmc_host_do_transaction(int slot, sdmmc_command_t* cmdinfo)
cmdinfo->error = ESP_OK;
sdmmc_req_state_t state = SDMMC_SENDING_CMD;
sdmmc_event_t unhandled_events = { 0 };
int t0 = esp_timer_get_time();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type of esp_timer_get_time is int64_t, suggest using that for the variable type.

(applies to a bunch of other assignments, as well)

while (state != SDMMC_IDLE) {
if (esp_timer_get_time() - t0 > SDMMC_HOST_DO_TRANSACTION_TIMEOUT_MS) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this is confusing (to me, at least), since it's not really the transaction timeout being implemented here. Transaction timeout is defined by cmdinfo->timeout_ms. This is more like a state machine timeout? What kind of situation does it protect against?

Also, you can't return at this point, since the request mutex is still held — see out bail-out path below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong, but it is not just a state machine. It seems like it does active communication with the SD card, which is why I added a timeout.

For example:

static esp_err_t process_events(sdmmc_event_t evt, sdmmc_command_t* cmd,
        sdmmc_req_state_t* pstate, sdmmc_event_t* unhandled_events)

{
    while (next_state != state) {
            state = next_state; // note: this might still be the same state as before!
            switch (state) {

         case SDMMC_SENDING_DATA:
                  .....
                 // it looks like this code will loop until some 
                 // stop condition is set in the peripheral
                break;
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The waiting-for-the-peripheral should happen outside of process_events, in the handle_event -> sdmmc_host_wait_for_event, which does know how to timeout.

process_events is only updating the sdmmc_req_state_t* pstate based on events recorded in sdmmc_event_t* unhandled_events. The loop in process_events has a condition while (next_state != state) {, which basically means that we are only looping while the state is changing. If we stay in the same state, then the function exits and we get back to sdmmc_host_wait_for_event until the next event arrives.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we stay in the same state, then the function exits and we get back to sdmmc_host_wait_for_event

So, could a broken SD card prevent us from exiting? i.e. by returning new events infinitely?

lmk if I should remove this check. From my understanding it seems like it could be hit with a broken SD card.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, it's more like if the SDMMC peripheral suddenly stops following its documented state machine. I mean, it's okay to not trust the hardware, but if we sprinkle all the drivers for checks that the chip peripheral is not following its spec, it will be a mess to read and the code size will be impacted. So I do think this one is unnecessary.

(I agree with the other ones which are card behavior dependent, though.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. removed!

@@ -5,7 +5,7 @@ idf_component_register(SRCS "sdmmc_cmd.c"
"sdmmc_mmc.c"
"sdmmc_sd.c"
INCLUDE_DIRS include
REQUIRES driver
REQUIRES driver esp_timer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

esp_timer could be in PRIV_REQUIRES

/* SD mode: wait for the card to become idle based on R1 status */
while (!host_is_spi(card) && !(status & MMC_R1_READY_FOR_DATA)) {
// TODO: add some timeout here
if (esp_timer_get_time() - t0 > SDMMC_WRITE_SECTORS_DMA_TIMEOUT_MS) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (esp_timer_get_time() - t0 > SDMMC_WRITE_SECTORS_DMA_TIMEOUT_MS) {
if (esp_timer_get_time() - t0 > SDMMC_READY_FOR_DATA_TIMEOUT_MS) {

(the data has already been written at this point, we are waiting for the card to become idle)

while (!host_is_spi(card) && !(status & MMC_R1_READY_FOR_DATA)) {
// TODO: add some timeout here
if (esp_timer_get_time() - t0 > SDMMC_READ_SECTORS_DMA_TIMEOUT_MS) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

likewise, suggest SDMMC_READY_FOR_DATA_TIMEOUT_MS

err = sdmmc_send_cmd_send_status(card, &status);
if (err != ESP_OK) {
ESP_LOGE(TAG, "read: status cmd failed");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggest

Suggested change
ESP_LOGE(TAG, "read: status cmd failed");
ESP_LOGE(TAG, "%s: sdmmc_send_cmd_send_status returned 0x%x", __func__, err);

(similar to other error messages in this file)


if (esp_timer_get_time() - t0 > SDMMC_HOST_CLOCK_UPDATE_CMD_TIMEOUT_MS) {
return ESP_ERR_TIMEOUT;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be combined with the loop condition,

while (esp_timer_get_time() - t0 < SDMMC_HOST_CLOCK_UPDATE_CMD_TIMEOUT_MS) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but I think it's more readable to keep all the checks identical. Unnecessary complexity imo. Plus, readability wise it makes the return value less immediately clear, IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, I just thought someone might ask me to change this during code review. Let's leave it as is, for now.

#define SDMMC_HOST_RESET_TIMEOUT_MS 5000
#define SDMMC_HOST_DO_TRANSACTION_TIMEOUT_MS 5000
#define SDMMC_HOST_CLOCK_UPDATE_CMD_TIMEOUT_MS 1000
#define SDMMC_HOST_START_CMD_TIMEOUT_MS 1000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's not necessary to define these in a public header file, since these are an implementation detail.
There is no use for these macros in the application code.

(Once something is defined in a public header file, we need to take care to keep it backwards compatible.)

@chipweinberger
Copy link
Contributor Author

Thanks for the review, Ivan!

@chipweinberger chipweinberger force-pushed the user/chip/sdmmc-set-clk-infinite-loop branch 6 times, most recently from aeaecff to 4450fd0 Compare January 19, 2023 02:59
@chipweinberger chipweinberger force-pushed the user/chip/sdmmc-set-clk-infinite-loop branch from 4450fd0 to d7324be Compare January 19, 2023 04:22
@igrr igrr added the PR-Sync-Merge Pull request sync as merge commit label Jan 19, 2023
@igrr
Copy link
Member

igrr commented Jan 19, 2023

sha=d7324be0348dd9f4990f4762b193427b4f8e05f2

@espressif-bot espressif-bot added Status: Reviewing Issue is being reviewed and removed Status: Opened Issue is new labels Mar 8, 2023
@chipweinberger
Copy link
Contributor Author

@igrr any updates? this seemed like a good simple change

@espressif-bot espressif-bot assigned adokitkat and unassigned pacucha42 Mar 28, 2023
@chipweinberger
Copy link
Contributor Author

bump. Any blockers?

@adokitkat
Copy link
Collaborator

Hi @chipweinberger. The MR is currently in review stage needing just a few last accepts. Sorry for the wait. Hopefully it will be merged soon.

@AxelLin
Copy link
Contributor

AxelLin commented Apr 7, 2023

sha=d7324be0348dd9f4990f4762b193427b4f8e05f2

Just curious what does this comment mean?
People usually think this means the PR is merged internally and will be available publicly soon.
But the fact is it may take yet another months to get the fix in github. (And such long time is not a single case).
Several PRs are in "Sync-merge" state for months.

@igrr
Copy link
Member

igrr commented Apr 7, 2023

This type of a comment is a trigger for creating an internal merge request. This change is in review now, it was left unattended for a while but was recently picked up again.

@espressif-bot espressif-bot added Resolution: NA Issue resolution is unavailable Status: Done Issue is done internally Resolution: Done Issue is done internally and removed Status: Reviewing Issue is being reviewed Resolution: NA Issue resolution is unavailable labels Apr 12, 2023
@adokitkat
Copy link
Collaborator

The pull request was merged to master branch. Thank you for your contribution :) I will close this on GH as canceled but your commits are merged and will be shown in commit history after the GH master branch syncs with our internal master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR-Sync-Merge Pull request sync as merge commit Resolution: Done Issue is done internally Status: Done Issue is done internally
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants