Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NAND] Handling of bad blocks and ECC errors #43

Open
stefano-zanotti opened this issue Apr 12, 2024 · 0 comments
Open

[NAND] Handling of bad blocks and ECC errors #43

stefano-zanotti opened this issue Apr 12, 2024 · 0 comments

Comments

@stefano-zanotti
Copy link

I can't figure out if and how LevelX handles these two error conditions:
A) blocks that go bad (not factory-marked as bad)
B) correctable ECC errors

For bad blocks: when a block must be erased, the driver function LX_NAND_FLASH.lx_nand_flash_driver_block_erase is called.
This function will trigger an erase in the NAND chip, which can fail if the block has gone bad. In this scenario, the block wasn't bad at first, so it is not yet marked as bad, and LevelX still thinks it is good.
The function can return an error; it seems that all the LevelX functions that call it, check its result. These functions are:
lx_nand_flash_block_data_move
lx_nand_flash_driver_block_erase
lx_nand_flash_driver_block_erased_verify
lx_nand_flash_format
lx_nand_flash_metadata_allocate
lx_nand_flash_sector_release
lx_nand_flash_sector_write
Upon error, they all call _lx_nand_flash_system_error, which in turn calls the driver function LX_NAND_FLASH.lx_nand_flash_driver_system_error
However, after this, as the error is reported up the call stack, it seems that one of two things happen:

  1. the whole operation fails because of the error
  2. the error is ignored (eg. when _lx_nand_flash_metadata_write calls _lx_nand_flash_metadata_allocate, here)
    When the error is ignored, it also seems that this causes corruption of the LevelX internal data: the operation itself is reported as successful, but then LevelX starts misbehaving (eg. in future operations, it asks the driver to access non-existing block 0xFFFF).

I tried calling _lx_nand_flash_block_status_set from within the driver function lx_nand_flash_driver_system_error, to let LevelX know that the block is bad, but it didn't work; the call seemed to succeed, but LevelX misbehaved anyway.
Also, I don't think I can just mark the block as bad in hardware, as in this case LevelX wouldn't know it.

How can I handle these blocks that go bad? Should I call some LevelX utility, inside the system error function or elsewhere?
Should I just ignore it (just report the error), and LevelX will automatically take care of it during the next operation?

For correctable ECC errors: when a page is read, and there is an ECC error, and this error is corrected, the data can be used normally. This works. However, to prevent the data from accumulating errors, thus making them uncorrectable (or even undetectable) in the future, the corrupted page should be moved to another page, so that the data and ECC code is rewritten, thus restoring it to a 0-errors condition.
LevelX doesn't seem to do that.
The page will eventually be moved elsewhere, thus removing the error, as a consequence of other operations. However, this can be arbitrarily far in the future, especially if the data is mostly read, and rarely written, so there is no guarantee about when the error will disappear, and in the meantime the error might get worse. So, we cannot just let the error be, and wait.
In case of a corrected ECC error, I return the error LX_NAND_ERROR_CORRECTED from these driver functions:
LX_NAND_FLASH.lx_nand_flash_driver_pages_read
LX_NAND_FLASH.lx_nand_flash_driver_pages_copy
The error is then handled by these LevelX functions:
lx_nand_flash_metadata_allocate
lx_nand_flash_open
lx_nand_flash_sector_read
lx_nand_flash_sector_release
lx_nand_flash_sector_write
It is handled by calling lx_nand_flash_driver_system_error, and then continuing with the operation, without a failure.
This is ok, but it seems that LevelX never tries to move the page elsewhere to remove the error.

How can I handle this "repair" of the errored page? Should I call some LevelX utility, inside the system error function or elsewhere?
Should I just ignore it (just report the error)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant