-
Notifications
You must be signed in to change notification settings - Fork 7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SDMMC Mount] fix infinite loop when SD card is not responsive (IDFGH-9132) #10532
[SDMMC Mount] fix infinite loop when SD card is not responsive (IDFGH-9132) #10532
Conversation
b7dc443
to
9cc3666
Compare
components/driver/sdmmc_host.c
Outdated
while (SDMMC.ctrl.controller_reset || SDMMC.ctrl.fifo_reset || SDMMC.ctrl.dma_reset) { | ||
; | ||
if (esp_log_timestamp() - t0 > 1000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps use ESP_RETURN_ON_FALSE
(from esp_check.h) for this sort of check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, I wouldn't rely on esp_log_timestamp, it is not guaranteed to be monotonic. I would suggest esp_timer_get_time, instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at ESP_RETURN_ON_FALSE
, but I noticed it logs, which we don't need here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chipweinberger Fair point on logging, I'm on the fence if the timeout should be logged or not. It would be really nice if ESP_RETURN_ON_FALSE
had a way to pass in a log level or a flag to say "don't log this"
@@ -152,8 +158,20 @@ static void sdmmc_host_clock_update_command(int slot) | |||
}; | |||
bool repeat = true; | |||
while(repeat) { | |||
sdmmc_host_start_command(slot, cmd_val, 0); | |||
|
|||
esp_err_t err = sdmmc_host_start_command(slot, cmd_val, 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps use ESP_RETURN_ON_ERROR
for this and similar usages.
components/driver/sdmmc_host.c
Outdated
while (SDMMC.ctrl.controller_reset || SDMMC.ctrl.fifo_reset || SDMMC.ctrl.dma_reset) { | ||
; | ||
if (esp_log_timestamp() - t0 > 1000) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additionally, I wouldn't rely on esp_log_timestamp, it is not guaranteed to be monotonic. I would suggest esp_timer_get_time, instead.
sdmmc_host_clock_update_command(slot); | ||
err = sdmmc_host_clock_update_command(slot); | ||
if (err != ESP_OK) { | ||
ESP_LOGE(TAG, "set clk div failed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not necessary to log here, since it's not something that can happen in practice. (Clock update commands don't depend on the state of the Card Interface Unit, and as such shouldn't be blocked by the bus state.)
(Same comment for other sdmmc_host_clock_update_command error checks.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You must be mistaken? In my issue #10531, we get stuck in
I (5109) sdmmc_periph: sdmmc_host_clock_update_command() - SDMMC.clkena.cclk_enable &= ~BIT(slot);
And this other commenter hit sdmmc_host_clock_update_command()
as well : #2986 (comment)
We should probably just leave it in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, have overlooked that. In that case the fix might be a bit different than ignoring the fact that the clock update didn't happen. We might be missing to reset some part of the peripheral. I'll check it and come back with more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes would be great to fix the underlying issue, in addition to having reasonable timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lmk how things go, Ivan. Thanks.
9cc3666
to
516b951
Compare
@igrr , updated the PR with the given feedback. |
@igrr , I audited all SDMMC code for potential infinite loops & added reasonable timeouts. Please re-review. |
ebb4071
to
8eda7cc
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The timeout constants are defined with _MS
suffix (e.g. SDMMC_INIT_WAIT_DATA_READY_TIMEOUT_MS
) but you are comparing them against the return value of esp_timer_get_time()
, which is in microseconds.
@@ -153,7 +154,11 @@ esp_err_t sdmmc_host_do_transaction(int slot, sdmmc_command_t* cmdinfo) | |||
cmdinfo->error = ESP_OK; | |||
sdmmc_req_state_t state = SDMMC_SENDING_CMD; | |||
sdmmc_event_t unhandled_events = { 0 }; | |||
int t0 = esp_timer_get_time(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type of esp_timer_get_time is int64_t, suggest using that for the variable type.
(applies to a bunch of other assignments, as well)
while (state != SDMMC_IDLE) { | ||
if (esp_timer_get_time() - t0 > SDMMC_HOST_DO_TRANSACTION_TIMEOUT_MS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name of this is confusing (to me, at least), since it's not really the transaction timeout being implemented here. Transaction timeout is defined by cmdinfo->timeout_ms
. This is more like a state machine timeout? What kind of situation does it protect against?
Also, you can't return
at this point, since the request mutex is still held — see out
bail-out path below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong, but it is not just a state machine. It seems like it does active communication with the SD card, which is why I added a timeout.
For example:
static esp_err_t process_events(sdmmc_event_t evt, sdmmc_command_t* cmd,
sdmmc_req_state_t* pstate, sdmmc_event_t* unhandled_events)
{
while (next_state != state) {
state = next_state; // note: this might still be the same state as before!
switch (state) {
case SDMMC_SENDING_DATA:
.....
// it looks like this code will loop until some
// stop condition is set in the peripheral
break;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The waiting-for-the-peripheral should happen outside of process_events, in the handle_event -> sdmmc_host_wait_for_event, which does know how to timeout.
process_events is only updating the sdmmc_req_state_t* pstate
based on events recorded in sdmmc_event_t* unhandled_events
. The loop in process_events has a condition while (next_state != state) {
, which basically means that we are only looping while the state is changing. If we stay in the same state, then the function exits and we get back to sdmmc_host_wait_for_event until the next event arrives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we stay in the same state, then the function exits and we get back to sdmmc_host_wait_for_event
So, could a broken SD card prevent us from exiting? i.e. by returning new events infinitely?
lmk if I should remove this check. From my understanding it seems like it could be hit with a broken SD card.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, it's more like if the SDMMC peripheral suddenly stops following its documented state machine. I mean, it's okay to not trust the hardware, but if we sprinkle all the drivers for checks that the chip peripheral is not following its spec, it will be a mess to read and the code size will be impacted. So I do think this one is unnecessary.
(I agree with the other ones which are card behavior dependent, though.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. removed!
components/sdmmc/CMakeLists.txt
Outdated
@@ -5,7 +5,7 @@ idf_component_register(SRCS "sdmmc_cmd.c" | |||
"sdmmc_mmc.c" | |||
"sdmmc_sd.c" | |||
INCLUDE_DIRS include | |||
REQUIRES driver | |||
REQUIRES driver esp_timer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
esp_timer
could be in PRIV_REQUIRES
components/sdmmc/sdmmc_cmd.c
Outdated
/* SD mode: wait for the card to become idle based on R1 status */ | ||
while (!host_is_spi(card) && !(status & MMC_R1_READY_FOR_DATA)) { | ||
// TODO: add some timeout here | ||
if (esp_timer_get_time() - t0 > SDMMC_WRITE_SECTORS_DMA_TIMEOUT_MS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if (esp_timer_get_time() - t0 > SDMMC_WRITE_SECTORS_DMA_TIMEOUT_MS) { | |
if (esp_timer_get_time() - t0 > SDMMC_READY_FOR_DATA_TIMEOUT_MS) { |
(the data has already been written at this point, we are waiting for the card to become idle)
components/sdmmc/sdmmc_cmd.c
Outdated
while (!host_is_spi(card) && !(status & MMC_R1_READY_FOR_DATA)) { | ||
// TODO: add some timeout here | ||
if (esp_timer_get_time() - t0 > SDMMC_READ_SECTORS_DMA_TIMEOUT_MS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
likewise, suggest SDMMC_READY_FOR_DATA_TIMEOUT_MS
components/sdmmc/sdmmc_cmd.c
Outdated
err = sdmmc_send_cmd_send_status(card, &status); | ||
if (err != ESP_OK) { | ||
ESP_LOGE(TAG, "read: status cmd failed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest
ESP_LOGE(TAG, "read: status cmd failed"); | |
ESP_LOGE(TAG, "%s: sdmmc_send_cmd_send_status returned 0x%x", __func__, err); |
(similar to other error messages in this file)
|
||
if (esp_timer_get_time() - t0 > SDMMC_HOST_CLOCK_UPDATE_CMD_TIMEOUT_MS) { | ||
return ESP_ERR_TIMEOUT; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can be combined with the loop condition,
while (esp_timer_get_time() - t0 < SDMMC_HOST_CLOCK_UPDATE_CMD_TIMEOUT_MS) {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True, but I think it's more readable to keep all the checks identical. Unnecessary complexity imo. Plus, readability wise it makes the return value less immediately clear, IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair enough, I just thought someone might ask me to change this during code review. Let's leave it as is, for now.
#define SDMMC_HOST_RESET_TIMEOUT_MS 5000 | ||
#define SDMMC_HOST_DO_TRANSACTION_TIMEOUT_MS 5000 | ||
#define SDMMC_HOST_CLOCK_UPDATE_CMD_TIMEOUT_MS 1000 | ||
#define SDMMC_HOST_START_CMD_TIMEOUT_MS 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's not necessary to define these in a public header file, since these are an implementation detail.
There is no use for these macros in the application code.
(Once something is defined in a public header file, we need to take care to keep it backwards compatible.)
Thanks for the review, Ivan! |
aeaecff
to
4450fd0
Compare
4450fd0
to
d7324be
Compare
sha=d7324be0348dd9f4990f4762b193427b4f8e05f2 |
@igrr any updates? this seemed like a good simple change |
bump. Any blockers? |
Hi @chipweinberger. The MR is currently in review stage needing just a few last accepts. Sorry for the wait. Hopefully it will be merged soon. |
Just curious what does this comment mean? |
This type of a comment is a trigger for creating an internal merge request. This change is in review now, it was left unattended for a while but was recently picked up again. |
The pull request was merged to master branch. Thank you for your contribution :) I will close this on GH as canceled but your commits are merged and will be shown in commit history after the GH master branch syncs with our internal master. |
Related Issue: #10531
When the sd card is not responsive, we need to have reasonable timeouts to prevent infinite loops.