Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Onboard flash has gaps in log file (F7 board) #7448

Closed
TheIsotopes opened this issue Jan 22, 2019 · 42 comments
Closed

Onboard flash has gaps in log file (F7 board) #7448

TheIsotopes opened this issue Jan 22, 2019 · 42 comments

Comments

@TheIsotopes
Copy link
Contributor

TheIsotopes commented Jan 22, 2019

Describe the bug
Onboard flash has gaps in log file, also the saving takes a long time
seems to concern only f7 boards ... on my f4 board with same flash chip everything is ok

edit: tested also on BF3.5.5 and same prob with f7 board

edit2: the problem has already crept in before build #1306 ... this was the first build where the target MATEKF722SE got flash support

To Reproduce
found on latest build #1379

BTFL_BLACKBOX_LOG_Mark1_20190122_100617.TXT

image

Expected behavior
i know that only from sdcard but not with onboard flash :((

Flight controller configuration

# Betaflight / MATEKF722SE (MF7S) 4.0.0 Jan 22 2019 / 06:13:21 (c39ea81b6) MSP API: 1.41

# name
name Mark1

# feature
feature -TELEMETRY
feature -OSD
feature LED_STRIP
feature AIRMODE

# beeper
beeper -BAT_CRIT_LOW
beeper -BAT_LOW

# map
map TAER1234

# led
led 0 9,2::AW:0
led 1 9,3::AW:0
led 2 9,4::AW:0
led 3 9,5::AW:0
led 4 6,5::AW:0
led 5 6,4::AW:0
led 6 6,3::AW:0
led 7 6,2::AW:0

# mode_color
mode_color 6 0 3
mode_color 6 1 6

# aux
aux 0 0 0 1600 2000 0 0
aux 1 13 1 1000 1200 0 0
aux 2 26 0 1600 1800 0 0
aux 3 31 0 1200 1400 0 0
aux 4 36 1 1800 2000 0 0

# master
set acc_hardware = NONE
set mag_hardware = NONE
set baro_hardware = NONE
set max_aux_channels = 2
set serialrx_provider = SUMD
set blackbox_device = SPIFLASH
set motor_pwm_protocol = DSHOT600
set pid_process_denom = 1

# profile
set feedforward_transition = 50
set iterm_relax_type = GYRO

# rateprofile
set roll_rc_rate = 135
set pitch_rc_rate = 135
set yaw_rc_rate = 120
set roll_expo = 15
set pitch_expo = 15
set roll_srate = 75
set pitch_srate = 75

Setup / Versions
Flight controller is an MATEK F722-mini

@TheIsotopes TheIsotopes changed the title BF4.0.0 - Onboard flash has gaps in log file (F7 board) Onboard flash has gaps in log file (F7 board) Jan 22, 2019
@TheIsotopes
Copy link
Contributor Author

new update ... tested latest inav firmware and configurator ... blackbox file has no gaps and bbf saving with configurator runs also normal

@mikeller any thoughts?

@mikeller
Copy link
Member

@TheIsotopes: This is the first time I am hearing of this happening with on board flash as well - also, I cannot verify as I don't have the hardware.

Maybe @MATEKSYS can help here?

@TheIsotopes
Copy link
Contributor Author

TheIsotopes commented Jan 23, 2019

Hhmmm ... @mikeller Could this be due to the bf configurator?
After a flash on the Inav firmware without format the flash i was able to read the protocol of the previous Betaflight firmware without gaps with the Inav configurator.

@TheIsotopes
Copy link
Contributor Author

@mikeller here my last straw
flashing latest betaflight build #1382 ... short test flight to fill blackbox

First test: back to configurator -> Blackbox Tab -> Save flash to file -> read takes damn long (as if he would read the entire flash, although only 2.4MB are occupied) and the blackbox log is fragmented

image

Second test: started MSC-Mode -> go to data explorer -> saving bbl log -> read takes only seconds and the log file is not fragmented

image

imho seems the problem to lie with bf configurator and not the firmware.

@McGiverGim can you contribute something to this problem?

@McGiverGim
Copy link
Member

No, all my FC have SD Card, so I can help too much with this. It seems clear that something is wrong in the MSP reading of the log. There is one PR open related but I don't know if it will fix the problem: betaflight/betaflight-configurator#1246

@mikeller
Copy link
Member

@TheIsotopes:

First test: back to configurator -> Blackbox Tab -> Save flash to file -> read takes damn long (as if he would read the entire flash, although only 2.4MB are occupied)

Reading flash over MSP is very slow (a couple of 10 kB/s), so this is expected.

The resulting file should not be corrupted though. Can you run configurator from a local checkout (npm start), and check in the if there are any errors in the debug console?

@TheIsotopes
Copy link
Contributor Author

@mikeller here you can see the end from debug console

MSPHelper.js:1821 CRC error for address 1028654 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1028654 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1028654 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1028654 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1028654 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1028654 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1028654 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1028654 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
msp.js:117 code: 71 - crc failed
MSPHelper.js:1821 CRC error for address 1039406 - retrying
onboard_logging.js:406 Received 1044526 bytes in 151.36s (6.74kB / s) with block size 4096.
onboard_logging.js:409 Compressed into 487662 bytes with mean compression factor of 2.1419056641690353

@mikeller
Copy link
Member

Ok, this shows a massive error rate - no surprise it's even slower than usual, and no surprise there are still errors in the resulting file - we should probably throw an error and discard the file if we exceed a certain retry threshold.

What version of the configurator are you using? And have you tried using an older version of the configurator?

@TheIsotopes
Copy link
Contributor Author

The debug console output is from the current master, the rest up there from BFC 10.4.0
I have not tested with an older one.

@MJ666
Copy link
Contributor

MJ666 commented Jan 24, 2019

On testing with a NOR flash I can seen the same behavior on the AFNG F7 dev board. It behaves the same on the the released version of the configurator and the actual master. It takes It is working with no problems on a few F4 base controllers i have.

@mikeller
Copy link
Member

Interesting that we would have a bug that is so specific to reading the flash with MSP (but not MSC) for a specific MCU type and flash type. But that means I won't be able to reproduce with the hardware that I've got here - can you have a look @MJ666?

@TheIsotopes
Copy link
Contributor Author

TheIsotopes commented Jan 24, 2019

@mikeller @MJ666 imo I think it has something to do with the huffman compression.
Inav does not use these, so I did not have any trouble reading the flash.

@jflyper
Copy link
Contributor

jflyper commented Jan 25, 2019

I think it's time to fix this: #6914.

The issue ticket claims that the FC wedges, but since the underlying issue is a racing condition, anything can happen; wedging, block drops, duplicate outputs, etc.

@mikeller
Copy link
Member

@jflyper: Looks like a viable candidate for the problem described above.

@MJ666
Copy link
Contributor

MJ666 commented Jan 26, 2019

Looking into this since a while. There is some newer code from STM for the VCP drivers. Since the BF code is customized in some way its not easy to see the differences. I have also not found the not customized STM code where the actual code is based of. This would allow more easy application of the customization to the new code. From first review it looks to be not much change in the specific areas.

@MJ666
Copy link
Contributor

MJ666 commented Jan 26, 2019

I have tested with an modified branch with some fixed applied based on #6914 and the garbage i can see when switching to CLI mode is gone. I can also read the flash content finally without errors into the file but I still have many CRC errors during read with lots of retries and the very long reading time. Without the modification i have gaps in the flash reading. My flash during test is filled with some pattern wich allows to identify the address where it is coming from. Looks to be there is still another problem remaining.

@mikeller
Copy link
Member

mikeller commented Feb 2, 2019

@MJ666: I got my hands on a MATEKF722-mini and tested with this. I can reproduce the problem, and unfortunately, #6914 #7472 does not fix it. I get approximately the same amount of CRC errors with #6914 #7472 than I get without it.

@TheIsotopes: It also does not look like it's compression related: I tested with compression turned off, and I still get CRC errors, and there is still gaps in the log.

Looks like this is related to the flash chip...

Update: Corrected pull request number,

Proposed fix for race condition of issue #6914 #7472

is too confusing for me. 😅

@TheIsotopes
Copy link
Contributor Author

@mikeller nice to read that
are there any considerations to fix this? ... at inav there is not this problem

@MJ666
Copy link
Contributor

MJ666 commented Feb 2, 2019

@mikeller with #7472 i still get the CRC errors but an error free flash output. This was not the case without #7472. Anyhow we need to find the reason for the errors. I did not get to far with this part yet.

@TheIsotopes
Copy link
Contributor Author

@MJ666 tested now #7472 with latest master
same like @mikeller ... massive crc errors and also still gaps in log

@mikeller
Copy link
Member

mikeller commented Feb 2, 2019

@TheIsotopes: Of course we want to fix this - evaluating possible approaches to a fix is part of this process.

The low level drivers in iNav and Betaflight are reasonably different, so the fact that it does work there does not necessarily mean that it's easy to fix here, or that we can graft what they are doing.

@MJ666: What board are you using? Subjectively from looking at the configurator log, I still had 100s of CRC errors with #7472, and there still were about the same number of errors in the log (again subjectively) when testing with the MATEKF722-mini. I really suspect this is dependent on the flash chip used.

@MJ666
Copy link
Contributor

MJ666 commented Feb 6, 2019

@mikeller im testing with the AFNGF7 dev environment (NuckleoF722 with sensors and other components attached). Using a W25Q128 chip as flash. Redone the tests with latest master and the file is also corrupted. @hydra found reducing block size in "onboard_logging.js" to 2048 fixes CRC errors. actual BLOCK_SIZE: 4096. With this change in the configurator the file looks to be ok.

@mikeller
Copy link
Member

mikeller commented Feb 6, 2019

@MJ666, @hydra: Good findings! Looks like this flash chip does not like blocks > 2048 bytes to be read. I'll try to confirm later with the MATEKF722-mini.

If this turns out to be the solution, then we should look into restricting the max block size for flash reads in the firmware, otherwise users using older versions of the configurator will stay broken.

@hydra
Copy link
Contributor

hydra commented Feb 6, 2019

the issue isnt in reading the flash chip. it's in the data transfer i think as flash verification within the firmware works fine.

you can read all day long at large block sizes in the firmware, its only wheb you transfer it the issue occurs.

i have not yet tried large blocksizes and connecting via UART instead of USB. I have not experimented with disabling data-caches to see if its a caching-related issue.

I think the issue is VCP related.

note, i had the issue with a different flash chip.

smaller block size is only a workaround until the underlying issue is found as small block sizes increases transfer overhead.

@mikeller
Copy link
Member

mikeller commented Feb 7, 2019

@MJ666, @TheIsotopes: I was able to replicate the fix on the MATEKF722-mini, so it looks like this flash chip does not work for MSP flash reading with block sizes > 2048. Will open a pull request to limit the maximum transfer block size to 2048.

@hydra: I tested with a number of other F4 / F7 boards with different flash chips, but I was not able to replicate the problem. If you let me know what board / chip you were able to replicate with I can enable the block size limitation for this too.

@TheIsotopes
Copy link
Contributor Author

TheIsotopes commented Feb 7, 2019

@mikeller ok, i tested it with reduced block size to 2048 and it works 👍
Now I have only the question why this issues does not occur on the MatekF405-Mini?
Both, the F405-Mini and the F722-Mini, have the same W25Q256FV flash chip ... I do not understand?!

@MJ666
Copy link
Contributor

MJ666 commented Feb 7, 2019

Looks to be an problem specific to the F7. Also have an F4 board with the W25Q128 which is working but the F7 with the same chip has the CRC problems. So the issue look not to be specific to the flash chip itself.

@mikeller
Copy link
Member

mikeller commented Feb 7, 2019

@TheIsotopes: Interesting point. I was just testing with a sampling of F4 and F7 boards with on board flash (there aren't many F7s with it), and the MATEKF722-mini was the only one showing the CRC errors. Not sure why it would come down to a specific MCU type / flash chip combination.

In this case, I think the right solution is to add a parameter that allows setting of an upper limit for the transfer block size, and set it to 2048 by default on this board.

@TheIsotopes
Copy link
Contributor Author

In this case, I think the right solution is to add a parameter that allows setting of an upper limit for the transfer block size, and set it to 2048 by default on this board.

At the moment this is definitely the best solution.

@etracer65
Copy link
Member

etracer65 commented Feb 9, 2019

I didn't see it in all the comments, but what computer are you using to download the log using the Configurator? I have an older MacBook Pro that will intermittently have lots of CRC errors when downloading the logs where my other Macs are fine. Not specific to a single flight controller or flash chip.

Also, why aren't you using mass storage mode? The logs can be transferred much faster (like an order of magnitude!).

@mikeller
Copy link
Member

mikeller commented Feb 9, 2019

@etracer65: I have reproduced the CRC errors with the MATEKF722-mini on linux, so it's not likely to be problem with the USB host.

@MJ666
Copy link
Contributor

MJ666 commented Feb 9, 2019

I'm testing on Windows 10. Using mass storage mode is always working and definitely faster for large flash chips. Anyhow if we keep this download option in the confugurator it should work. In my view reducing the buffer size (performance) is an valid workaround if we can not locate the exact source of the problem.

@hydra
Copy link
Contributor

hydra commented Feb 9, 2019

The issue is not related to just one board, I have another board under development on which this also occurs, it does it every time, not random. I have not had time to see if it's VCP related.

@TheIsotopes can you try using a serial->uart adapter and try transferring a log using serial rather than via USB? This will help identify the problem area.

I don't think we should look at limiting the block size on a per-chip/fc basis with configuration in the configurator. Perhaps a better solution is for the configurator to ask the firmware what the block size should be?

@jflyper
Copy link
Contributor

jflyper commented Feb 9, 2019

I've also been looking into this for a couple of days now.

Although I haven't found the root cause of the issue, what I've found out is that buffer size at usbd_cdc_interface.c (determined by APP_TX_DATA_SIZE) has certain effects on how errors are generated.

For example, at the current size of 2048, transfer block size (VCP_BLOCK_SIZE in BFC onboard_logging.js) of 2048 would give only a couple of errors for 10MB transfer. Increasing the transfer block size starts to generate more errors, and at 2500~3000, it reaches at a level which is not usable.

If we increase the buffer size to 4096, then transfer size of 4096 would only give a couple of errors again. Same thing happens for smaller buffer size.

So, a pragmatic solution is to increase the buffer size to 4096 and forget the problem, but I really want to know why this is happening...

BTW, the CDC_Send_DATA was modified to use ATOMIC_BLOCK when manipulating buffer pointers, and USE_HUFFMAN was dropped from a target def for this test.

@TheIsotopes
Copy link
Contributor Author

So, a pragmatic solution is to increase the buffer size to 4096 and forget the problem

@digitalentity has also taken this step in inav -> iNavFlight/inav@d83573f

@mikeller @MJ666 @jflyper what speaks against this insert in betaflight?

can you try using a serial->uart adapter and try transferring a log using serial rather than via USB? This will help identify the problem area.

@hydra I have no possibilities to test this.

@etracer65
Copy link
Member

I've come to the conclusion that this is not related to any particular flash chip, but rather a VCP buffer overflow. I can replicate the problem with both a MatekF7-mini which uses the 32MB flash, as well as the DALRCF7DUAL which has an older 16MB flash. The commonality is that both are F7 and using the HAL VCP.

Testing by increasing the VCP buffer to 4096 helps and replicates what @jflyper reported. Reduces to just a few CRC/retries.

But I suspect that there is still output buffer overflows happening accounting for the occasional CRC errors. Using an artificially slow computer I can get large numbers of crc/retries to happen even with the buffer set to 4096.

Also see the same occasional handful of crc/retry errors with a F4 using it's default VCP buffer size of 2048.

So I suspect that there's some inadequate output buffer overflow checking/prevention going on and with the extra speed of F7 we're more able to cause the overflow. This may be exacerbated by the possibly faster 32MB flash chip.

@mikeller
Copy link
Member

mikeller commented Feb 9, 2019

@TheIsotopes: Increasing the buffer size to match the maximum transfer size seems to be a reasonable approach - albeit 4096 is wrong, it will have to be slightly larger to accommodate the MSP frame header.

@etracer65: Agreed - I retested with AG3XF7, and I got the same CRC errors.

@jflyper
Copy link
Contributor

jflyper commented Feb 10, 2019

@TheIsotopes

what speaks against this insert in betaflight?

Because I want to know why it is happening and fix it.

If, in the end, it happened to be something hardware related, then I would admit the mask approach. (This was the case for premature termination of UART DMA transmission on F4.)

@MJ666
Copy link
Contributor

MJ666 commented Feb 10, 2019

I tested with an serial port at the F7 and it works with no issue. Not even a single CRC error. Also change the VCP buffer sizes to 4096 will solve the problem for me. I see no CRC errors at all. There is really a big difference in speed with VCP or Serial.

@etracer65
Copy link
Member

@TheIsotopes @MJ666 Can you try #7563 ?

It addresses some bugs in the HAL VCP transmit buffer logic. Seems to fix the problem without increasing the buffer size. Also the actual flash chip doesn't have anything to do with the problem and it just appears to be bugs in the HAL VCP version used by F7. The faster 32M flash seemed to exacerbate the problem though.

@MJ666
Copy link
Contributor

MJ666 commented Feb 10, 2019

@etracer65 for me #7563 also solves the CRC error issue with the AFNGF7. No CRC errors with a W25Q128.

@TheIsotopes
Copy link
Contributor Author

@etracer65 I can also confirm this. #7563 fixes crc errors and there are no gaps in the logfile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants