Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when USB connection is terminated #393

Closed
krichardsson opened this issue Feb 18, 2019 · 21 comments

Comments

Projects
None yet
5 participants
@krichardsson
Copy link
Member

commented Feb 18, 2019

From issue bitcraze/crazyflie-lib-python#110

How to reproduce:

  1. Start a Roadrunner or Crazyflie
  2. Connect via USB to a computer
  3. Make sure no Crazradio is connected to the computer
  4. Run the https://github.com/bitcraze/crazyflie-lib-python/blob/0.1.7/examples/basiclog.py script and terminate it with CTRL+C. Make sure the script connects via the USB cable (some thing like Connecting to usb://0 in terminal)

Symptoms:

  • The Roadrunner/Crazyflie crashes and restarts
@AlexisTM

This comment has been minimized.

Copy link

commented Feb 19, 2019

This is what happens (wireshark) when the device is disconnected.

  • The roadrunner has the ID 1.14.1
  • basiclog.py is stopped (Ctrl+Z)
  • The roadrunner has the ID 1.15.0

pcap file: https://pcap.honeynet.org.my/v1/submission/view.php?id=1111.php

image

@AlexisTM

This comment has been minimized.

Copy link

commented Feb 19, 2019

The "crash" comes from the RCC_FLAG_IWDGRST watchdog.

By commenting on the following line, we can experience that the roadrunner suspends when the USB connection suspends; Yet the roadrunner recovers once we start the connection again.

@krichardsson

This comment has been minimized.

Copy link
Member Author

commented Feb 19, 2019

Yes, the restart is caused by the watchdog. You can turn it off by compiling with DEBUG=1 if you like.

I did a quick test this morning with a DEBUG compiled firmware and it seemed to hand instead of restarting. I did not have time to dig any more into this though.

Interesting that it seems to recover for you.

@whoenig

This comment has been minimized.

Copy link
Contributor

commented Feb 19, 2019

I have seen the same issue with crazyflie_ros/crazyflie_cpp. I did debug the problem and found that it is caused by the logging task keep filling up the message queue with new data even though USB is disconnected (I think one can also reproduce by pulling the cable). Unfortunately, I was not able to figure out why this causes a crash (the queue has a finite size), and how to fix it.

For a CF 2.0, I have not seen recovery after the issue (a reboot is required).

@AlexisTM

This comment has been minimized.

Copy link

commented Feb 20, 2019

@whoenig To have a recovery of the CF2.0, I disabled the IWDGRST watchdog (see above). This could also be done using DEBUG=1.

Are you sure it's due to overflow? I experience the CF2.0 to "wait" for the USB connection to come back. I feel it is more like a feature of STM32 for the USB driver.

@whoenig

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2019

I debugged this more today and found a much simpler reproducible. It's essentially enough to just query for the platformVersion once. There will be a response, but afterwards the heart beat LED is not blinking anymore (DEBUG build, I guess in release the watchdog would kick in). The CF will still respond to other messages, but all other tasks seem to be suspended until messages over USB are exchanged. A debugger just shows that time is spent in various USB related interrupts.

So far, I could not find the underlying issue (and I don't know much about the STM32 USB OTG driver that is being used). However, this shows that this is not necessarily a logging related issue.

@krichardsson

This comment has been minimized.

Copy link
Member Author

commented Mar 7, 2019

Great! It is good to have a quick way to trigger the problem.

We suspect that it is related to task levels or interrupts taking too long but we have not investigated it further yet. The watchdog is reset on idle and if there is no idle time it will not be reset.

@whoenig

This comment has been minimized.

Copy link
Contributor

commented Mar 7, 2019

Is there an easy way/documentation on how to debug such issues? The task priority for the CRTP tasks is lower compared to some of the other tasks. I am not sure how to profile time spent in interrupts.

@krichardsson

This comment has been minimized.

Copy link
Member Author

commented Mar 8, 2019

@whoenig unfortunately not. There is some support in FreeRTOS to monitor tasks but nothing for interrupts. Instrumenting the code is maybe the way forward?

@AlexisTM

This comment has been minimized.

Copy link

commented Mar 8, 2019

Related issue?: https://www.freertos.org/FreeRTOS_Support_Forum_Archive/November_2015/freertos_After_Running_USB_OTG_Code_FreeRTOS_hangs_WDT_Reboot_306ca43bj.html

Following the proposed solution:

All I do is call this: _getPRIMASK(); before the initialization of the USB interface. Can anyone can explain how that fixes the problem?

@tobbeanton

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

I think this is a ST Library issue. Found this in st forum. I will try the fix and get back.

@tobbeanton

This comment has been minimized.

Copy link
Member

commented Mar 14, 2019

Tried the fix and now the it is not able to connect USB at all = FAIL

@whoenig

This comment has been minimized.

Copy link
Contributor

commented Mar 15, 2019

The forum post and the USB device lib used in the crazyflie firmware are both pretty old. Do you think it would help to upgrade to the latest STM libraries? I assume this is a bigger change, since there might be a bunch of dependencies that need to be resolved.

@tobbeanton

This comment has been minimized.

Copy link
Member

commented Mar 18, 2019

@whoenig Yes I think updating the libraries is the next thing to try. @krichardsson did a try last Friday but apparently there are quite some things to be resolved so didn't get all the way. We will try again as soon as we have time.

@krichardsson

This comment has been minimized.

Copy link
Member Author

commented Apr 25, 2019

It seems as the problem occurs when the Crazyflie tries to send data over USB when the USB connection is terminated. The USB library continues to request data (from an interrupt) after the USB connection is terminated and this seems to cause some form of infinite loop in the interrupt code (implemented in the USB lib). I think the CF code is OK and that the problem lies in the USB lib.

@whoenig, I have tried to upgrade to V2.2.1 / 17-March-2018 of the USB library (https://www.st.com/content/st_com/en/products/embedded-software/mcu-mpu-embedded-software/stm32-embedded-software/stm32-standard-peripheral-library-expansion/stsw-stm32046.html) but have not yet managed to get it to work.

@whoenig

This comment has been minimized.

Copy link
Contributor

commented Apr 26, 2019

Thanks for the update Kristoffer! What is the issue when you try to upgrade the USB library? Does it compile and simply not run, or are there issues with compiling because they changed their API in the meantime?

@krichardsson

This comment has been minimized.

Copy link
Member Author

commented Apr 26, 2019

It did compile, the API/lib is fairly similar, the problem was that I not could get normal operation to work correctly. It worked as expected for a few packets (around 5 if I remember correctly) but after that some packets were "stuck" and interleaved with garbage. I got the feeling that it might be related to some form of buffer overrun in a circular buffer or similar, maybe cased by miss-configuration?

@evoggy spent some more time on this yesterday, not sure of the outcome yet.

@evoggy

This comment has been minimized.

Copy link
Member

commented May 2, 2019

I think I've found a fix for this, but there's still some debugging to be done. I'll write more details once it's cleaned up and pushed.

@evoggy

This comment has been minimized.

Copy link
Member

commented May 3, 2019

This should be fixed now, but please test it out a bit and re-open the issue if there's still problems.

The main issue was the fifo empty interrupt flag not being cleared correctly by the ST lib. Previously the flag would be cleared once the transfer is completed. So you would send data tough the USB, the data would be moved to the USB peripheral and the ISR would constantly run until the host had fetched the data. Since this is host dependent the time spent here may vary and we've seen some long latency propagate though the system when running logging on USB.

As a side-effect of this if the cable is pulled or the host program terminated while we're waiting to the transfer to complete the ISR would continue to run which would have nasty side-effects like queues filing up and the SYSLINK assert mentioned above being hit and the watchdog triggering.

The fix for this is detailed here (https://community.st.com/s/question/0D50X00009XkbWBSAZ/bug-in-stm32usbotgdriver-causes-100-cpu-in-fifo-empty-interrupts). Instead of waiting until the transfer is done the fifo empty interrupt is switched off after the data is sent to the USB peripheral.

But even with fixed the above there's some congestion in the system since logging continues to run, lots of data has been moved to the USB peripheral that will not be sent and so on. So together with clearing out the queues for sending data, flushing the TX in the USB peripheral and stopping the logging by fixing #428 the connection performance has been improved and is now very stable on connect/disconnect.

@evoggy evoggy closed this May 3, 2019

@krichardsson

This comment has been minimized.

Copy link
Member Author

commented May 16, 2019

USB connection in the python client does not seem to work on OSX after this change.

Not 100% sure but it seems likely that it is related to this issue

@krichardsson krichardsson reopened this May 16, 2019

@evoggy

This comment has been minimized.

Copy link
Member

commented Jun 5, 2019

The fix for this worked well on Linux, but not at all on Mac OSX or Windows. Looking into this again to see if we can fix the fix...

@evoggy evoggy closed this in 1becef7 Jun 5, 2019

ataffanel pushed a commit that referenced this issue Jun 5, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.