Skip to content

Fix write error on board reset#88

Closed
kaetemi wants to merge 1 commit into
adafruit:masterfrom
kaetemi:feature/fix-reset-write-error
Closed

Fix write error on board reset#88
kaetemi wants to merge 1 commit into
adafruit:masterfrom
kaetemi:feature/fix-reset-write-error

Conversation

@kaetemi
Copy link
Copy Markdown
Contributor

@kaetemi kaetemi commented Mar 12, 2021

Currently (at least, for me) the reset of the board is always causing Windows to report a write error (which is not very user friendly) when the file is finished uploading.

Deferring the reset, until after tud_task has completed, appears to solve this.

(It does not fix the write error case when multiple firmware families are concatenated in a single file, though. Is there any mechanism to request a safe eject from the OS?)

@hathach
Copy link
Copy Markdown
Member

hathach commented Mar 12, 2021

hmm that is weird, the while(1) {} I added in recent commit is to solve this issue. Look like it indeed cause the issue. Without the while(1) at tud_msc_write10_complete_cb(). Will the issue occur with your setup ?

My reason is that tud_msc_write10_complete_cb() is invoked after WRITE10 is complete, returned from that function will allow tinyusb stack to queue and receive more SCSI command, and board reset could occur before device response and cause the error. Maybe OS expect device to also response with TEST UNIT READY after WRITE10 or something. I will do more in-depth test.

@kaetemi
Copy link
Copy Markdown
Contributor Author

kaetemi commented Mar 12, 2021

Ah. I had the issue before you added the while(1). I just merged that into my workaround. Let's try.

@kaetemi
Copy link
Copy Markdown
Contributor Author

kaetemi commented Mar 12, 2021

With while(1) in tud_msc_write10_complete_cb (as currently on master): Error when file is uploaded.

With while(1) removed there (as it was before): Error when file is uploaded.

With this workaround (including the while(1): No error.

With this workaround (without the while(1)): No error.

Tested with tinyusb stack that comes with the Pico SDK, with Windows 10.

@kaetemi kaetemi marked this pull request as draft March 12, 2021 18:59
@kaetemi
Copy link
Copy Markdown
Contributor Author

kaetemi commented Mar 12, 2021

My reason is that tud_msc_write10_complete_cb() is invoked after WRITE10 is complete, returned from that function will allow tinyusb stack to queue and receive more SCSI command, and board reset could occur before device response and cause the error. Maybe OS expect device to also response with TEST UNIT READY after WRITE10 or something. I will do more in-depth test.

Hmmm. Perhaps even deferring the reset until the main loop has cycled at least once without any commands received, could improve behaviour more? Not familiar with how the USB stack queues up these commands, but at least whatever is queued should get responded to then...

@henrygab
Copy link
Copy Markdown
Collaborator

henrygab commented Mar 12, 2021

Do you have a USB bus trace (or even storage command block level trace), as viewed from the host PC?
(e.g., http://www.bustrace.com/, or other form that can be readily viewed)

There may be interactions caused by deeply rooted architectural decisions in both ghostfat and windows. (note: my windows knowledge is from >10 years ago, so it might be dated).

Random thoughts on potential examples


GhostFAT sets the BPB to indicate it has a unique volume serial number, volume label, and file system label:
image

According to Microsoft's FatGen103 (see section for BPB "Starting at Offset 36", around page 10), the volume ID enables "volume tracking on removable media".


Generally, the file system and memory manager in Windows used to work hard to optimize the user experience. In the past, this included caching writes to the media, for a period of time. This is why it used to be required to "safely eject" USB flash drives, as writes would be lost when users disconnected the USB device before Windows was done flushing the cached data.

I think Windows 10 is a bit "safer", and no longer caches the data for USB flash drives? Not sure?


Another item I've noticed on ghostfat volumes... Windows believes that it is able to create files on the media. When I previously used command.exe, and listed the file system of a ghostfat volume, I think I sometimes saw additional system-generated directories... maybe something was generating per-volume metadata? This is an ongoing risk for ghostfat ... any OS might send writes to ghostfat, get a success message, but then fail in unexpected ways if the OS needs to read that data back from the device.

Most times, the device won't have to provide that data, as the OS cached it. But ... it's still a risk.


Finally, there's no guarantee that the file data will actually be sent to the device in LBA order. The storage stack was notorious for this in the past, and I believe the USB storage stack would often switch the order of two IRPs / URBs in some cases.

Does ghostfat (or another part of this stack) ensure that the entirety of the UF2 file has been sent to the device, before it considers the write as being complete?

@henrygab
Copy link
Copy Markdown
Collaborator

Is there any mechanism to request a safe eject from the OS?

@kaetemi -- that's an excellent question. you would think that some USB storage device somewhere exposed an "eject button", but my search skills couldn't find any suggestion that this was the case.

It would be fairly useful for storage devices. Expose an "eject button" (e.g., via HID), and then get notified by the OS in some way that the OS is ready for the device's departure. Of course, the device could disappear at any time anyways, inline with USB, but ...

Unfortunately, I'm not aware of this capability being built-in to the common OSs. Maybe @hathach knows of something ... he's much deeper into the USB side than I would be.

@kaetemi
Copy link
Copy Markdown
Contributor Author

kaetemi commented Mar 13, 2021

Does ghostfat (or another part of this stack) ensure that the entirety of the UF2 file has been sent to the device, before it considers the write as being complete?

UF2 contains an ID in each block and the total number of blocks in the "file". When the device got all the blocks it's interested in, it'll happily reset. From the device POV, everything's OK.

But, that block numbering does not actually apply to the whole file either, it only applies to each portion of the UF2 file respectively. It's allowed by the UF2 spec to throw multiple firmwares (or other content) into a single UF2 file. Given that they'll have different family IDs set, the device will ignore everything except the part it's interested in, and reset as soon as it got what it needed. The block number in the UF2 file only applies to a single firmware contained in it, not the whole file. This case doesn't seem trivially fixable in a clean way. From the device POV, everything's still OK, but the OS will report a write failure, since the files are never fully written.

Finally, there's no guarantee that the file data will actually be sent to the device in LBA order.

Yep. At least the uf2conv.py tool appears to result in the firmware being written in some sort of reverse order on my system.

you would think that some USB storage device somewhere exposed an "eject button", but my search skills couldn't find any suggestion that this was the case.

Looking through some documentation, it seems there's a command called PREVENT ALLOW MEDIUM REMOVAL, but I'm going to assume here that some OS will just always flag this, and not just when it's busy. I do vaguely recall CD-ROM drives back in the day refusing to eject while they were busy, but that might've just been internal behaviour...

USB bus trace

I'll try that out.


So, there's two problem cases..

  • OS reporting a write error when the file is completely written. (Due to some other pending commands.)
  • OS reporting a write error when the file is partially written, but the useful section was fully written. (Due to the partial file.)

A guess... Maybe it's just writing to the FAT afterwards.

@kaetemi
Copy link
Copy Markdown
Contributor Author

kaetemi commented Mar 13, 2021

In the past, this included caching writes to the media, for a period of time.

Then it would not give write errors when the user is copying the file, though. Just the badly ejected media notification (which in this situation is more acceptable than "write failure" with the giant red X error icon). (Although I guess the caching might probably delay things for too long.)

@henrygab
Copy link
Copy Markdown
Collaborator

One option to consider:
After all the needed writes are received, wait until no writes occur for a time (e.g., 1 second?).

We're well into "hack" territory here.

As to prevent/allow medium removal ... that command only applies to devices with removable media. CDROMs, Iomega ZIP / JAZZ drives, etc. Here, the UF2 device is removable, not the media in the device, so it doesn't apply.

@hathach
Copy link
Copy Markdown
Member

hathach commented Mar 14, 2021

@kaetemi please try to use the tinyusb that is included with this repo, that's way we are testing the same software. There are lots of improvement for rp2040 in latest release https://github.com/hathach/tinyusb/releases/tag/0.9.0 . I will also bump up tinyusb version for this repo soon as well.

To be honest, this walk-around isn't look good to me. It delay the reset a bit (probably enough to response with TEST_UNIT ready) and could have timing/race condition. We should reset at the right time (after TEST_UINIT_READY) if it is what windows is looking for. I am mostly on Linux which handle this differently, I will try to test with windows. However, since your setup is unique I am not sure if I could reproduce it on other platform.

Meanwhile, can you re-compile the project with CFG_TUSB_DEBUG=2, it will output the usb stack activities via rp2040 uart similar to LOG=2 here https://github.com/hathach/tinyusb/blob/master/docs/getting_started.md#log and post your log here if the issue still occur (it could cause some timing delay). It will give a hint at which scsi it failed to response.

@kaetemi
Copy link
Copy Markdown
Contributor Author

kaetemi commented Mar 14, 2021

Here's the logs I'm getting, two variations. (With this workaround enabled, and some extra prints added.)

tud_msc_write10_cb 0, 66181, 0, 2000f7d4, 4096
Write firmware addr 0x0
Write firmware addr 0x100
Write firmware addr 0x200
Write firmware addr 0x300
Write firmware addr 0x400
Write firmware addr 0x500
Write firmware addr 0x600
Write firmware addr 0x700
  Queue EP 01 with 4096 bytes ... OK
USBD Xfer Complete on EP 01 with 4096 bytes 
  MSC xfer callback
  SCSI Data
tud_msc_write10_cb 0, 66189, 0, 2000f7d4, 4096
Write firmware addr 0x800
Write firmware addr 0x900
Write firmware addr 0xa00
Write firmware addr 0xb00
Write firmware addr 0xc00
Write firmware addr 0xd00
Write firmware addr 0xe00
Write firmware addr 0xf00
board_flash_flush
tud_msc_write10_complete_cb 0 <----------------------------------
  Queue EP 81 with 13 bytes ... OK
USBD Xfer Complete on EP 81 with 13 bytes
  MSC xfer callback
  SCSI Status: 0
  Queue EP 01 with 31 bytes ... OK
USBD Xfer Complete on EP 01 with 31 bytes
  MSC xfer callback
  SCSI Command: Write10
  Queue EP 01 with 2048 bytes ... OK
Write firmware addr 0x0
Write firmware addr 0x100
Write firmware addr 0x200
Write firmware addr 0x300
Write firmware addr 0x400
Write firmware addr 0x500
Write firmware addr 0x600
Write firmware addr 0x700
  Queue EP 01 with 4096 bytes ... OK
USBD Xfer Complete on EP 01 with 4096 bytes
  MSC xfer callback
  SCSI Data
tud_msc_write10_cb 0, 66189, 0, 2000f7d4, 4096
Write firmware addr 0x800
Write firmware addr 0x900
Write firmware addr 0xa00
Write firmware addr 0xb00
Write firmware addr 0xc00
Write firmware addr 0xd00
Write firmware addr 0xe00
Write firmware addr 0xf00
board_flash_flush
tud_msc_write10_complete_cb 0 <----------------------------------
  Queue EP 81 with 13 bytes ... OK
USBD Xfer Complete on EP 81 with 13 bytes
  MSC xfer callback
  SCSI Status: 0
  Queue EP 01 with 31 bytes ... OK
USBD Xfer Complete on EP 01 with 31 bytes
  MSC xfer callback
  SCSI Command: Test Unit Ready
tud_msc_test_unit_ready_cb 0
  Queue EP 81 with 13 bytes ... OK
USBD Xfer Complete on EP 81 with 13 bytes
  MSC xfer callback
  SCSI Status: 0
  Queue EP 01 with 31 bytes ... OK
USBD Xfer Complete on EP 01 with 31 bytes
  MSC xfer callback
  SCSI Command: Write10
  Queue EP 01 with 2048 bytes ... OK

The status is sent after the callback. That might be the cause?
https://github.com/raspberrypi/tinyusb/blob/pico/src/class/msc/msc_device.c#L628
https://github.com/hathach/tinyusb/blob/master/src/class/msc/msc_device.c#L426

@kaetemi
Copy link
Copy Markdown
Contributor Author

kaetemi commented Mar 14, 2021

Swapping the callback to happen after the status is sent appears to fix the issue.

Log at the end becomes as follows.

board_flash_flush
  Queue EP 81 with 13 bytes ... OK
tud_msc_write10_complete_cb 0

@kaetemi
Copy link
Copy Markdown
Contributor Author

kaetemi commented Mar 14, 2021

Oh. You're right. There's some different behaviour there.

This is under MSC_STAGE_STATUS in Pico SDK tinyusb, but under MSC_STAGE_STATUS_SENT in 0.9.0.

https://github.com/raspberrypi/tinyusb/blob/pico/src/class/msc/msc_device.c#L628
https://github.com/hathach/tinyusb/blob/master/src/class/msc/msc_device.c#L426

@kaetemi
Copy link
Copy Markdown
Contributor Author

kaetemi commented Mar 14, 2021

Log with 0.9.0.
Issue does not occur with the latest tinyusb.

board_flash_flush
  Queue EP 81 with 13 bytes ... OK
USBD Xfer Complete on EP 81 with 13 bytes
  MSC xfer callback
  SCSI Status: 0
tud_msc_write10_complete_cb 0

@kaetemi kaetemi closed this Mar 14, 2021
@hathach
Copy link
Copy Markdown
Member

hathach commented Mar 15, 2021

thank for testing it out.

@kaetemi
Copy link
Copy Markdown
Contributor Author

kaetemi commented Mar 15, 2021

Np. Thanks for the help. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants