-
-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix issues impacting RX_STATE_MODES state duration and add DEBUG_RX_STATE_TIME #11274
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested by @AndreaCCIE in #11271
Even though I realize it will fall on deaf ears, I'm compelled to reiterate: #11228 (comment)
We can't keep continuing to react and try and carve out exceptions as discovered rather than fixing the underlying problem of tasks having the possibility of being locked out. And as predicted, here's a case that causes a failsafe and the quad falling out of the sky due to the RX task getting prevented from running. It should be easy to recognize that this tactic is not viable as there will always be unknown edge cases that will eventually trigger this problem over and over as can be seen with the multiple "fixes" made already. Further it opens the risk that any future enhancements unknowingly open the door to triggering additional exceptions. The scheduler needs to be fixed to prevent the possibility that a task can be blocked from running. The firmware is not stable and is dangerous in its current state. This obsession with having all the task executions be "deterministic" is simply not viable and unnecessary. As in the previous scheduler, the focus should be on the gyro/filter/pid tasks running on time as much as possible and ensuring that no tasks are blocked from running. Having a non critical task run a little late sometimes simply doesn't matter. Further, having the gyro/filter/pid tasks be occasionally delayed to ensure that all tasks do in fact run is necessary even if undesirable. Overall there seems to be problem seeing the bigger picture of overall firmware stability for now and the long term. Ideally the scheduler concept should have been left alone. The other work to improve the execution time consistency of tasks (Like OSD, baro, etc.) was good and just by itself would have improved the on-time task management without any additional changes in the scheduler. |
Alas we have discovered a few instances of particularly errant behaviour with respect to task execution time, which have been addressed, but as you state the consequence of such cases is a failSAFE. And determinism is viable, and desirable in a real-time system such as this. |
And this makes the firmware inherently unsafe. For example, you cannot state with any degree of certainty that there's not another "errant behavior" that has yet to be discovered that will cause the same problem or worse. It's not like this is an isolated case and no other similar problems have been discovered. And there's no mitigation to prevent future changes from exposing additional exceptions. The scheduler needs to be fixed to prevent the possibility that a task can be blocked from running. |
Hello, my issue detail: Removing my OSD adjustment will appear to allow it to update at a reasonably normally rate if it displays at all. However, even without my OSD adjustment the OSD can occasionally transition from boot screen to become stuck not showing any of my OSD elements. |
c486c9d
to
bb0a526
Compare
@etracer65 If you look at betaflight/src/main/scheduler/scheduler.c Line 652 in 3b93e0b
The RX task maintains it's own estimates of task duration by state and would previously continuously reset the anticipated execution time back to the peak it had recorded. Now, in this PR it picks up the scheduler's copy, and if it has been reduced it will reset the expectation it maintains. I've tested this on a I absolutely maintain my position that deterministic behaviour is advantageous to flight performance, and all tasks should be well behaved in this regard, but now, should any RX protocol driver fail to meet this expectation the scheduler will rapidly forgive it's transgression. |
3b93e0b
to
4e47a79
Compare
…tion time is ignored
d2499d0
to
13b500f
Compare
Tested on Edit: And now tested on a |
13b500f
to
db904b4
Compare
Tested on Matek F411 with CRSF. Avgload is between 59 and 67% After consecutive executing
|
I'm happy to see discussion between @SteveCEvans and @etracer65 on this. Keep up the good work guys. Given my experience with the H7RF target (PR #11110), which has SPI RX and CPU based PixelOSD I've noticed that recent code in master has been unusable when I port my changes to it, with the code prior to the scheduler changes the system was usable - i.e. no failsafes, no rx-loss, and 60fps osd. With the H7RF changes ported to current master, I've been getting extremely sluggish ui performance, rx loss, <10fps osd - but yes, the gyro task was on time. Others have has similar experience with the CPU usage and inability to use the FC, e.g. #11221 I've mentioned it before, it doesn't matter if the gyro task is on time if you can't use the rest of the system. There is merit in getting all the tasks to behave in a more deterministic way such that the rc input and gyro inputs to motor output latency is improved, and related math/code is right more of the time. There is a cost to doing this, in terms of increased complexity in every task and task-to-scheduler time reporting and resulting code size and ram usage. |
@hydra It is indeed good to talk. All the tasks could and should be well behaved. I've made big changes to the OSD in particular towards this goal. There have, of course been some corner cases of CPU hogging discovered along the way, such as FLASH memory writes from the RX task in response to stick commands, but now most tasks are very well behaved and this PR ensured that even when they transgress the timing contract they're not punished enough to negatively impact their function. I've only see OSD issues at all on the F411 but these now seem to be addressed in this PR. I've just been test flying a BetaFPV75X with MatekF411 compatible FC with bit banged DSHOT disabled (so I can use LED_STRIP) to enable gyro DMA at the expense of no OSD DMA, but I'm still getting the correct OSD task rate, albeit with 10% of the OSD tasks being late, typically by about 10us which is way better than we've had before.
With bit banged DSHOT enabled (at the expense of LED_STRIP) I get a very small number of late tasks. The scheduler remains locked to the gyro albeit with a bit more jitter than when using gyro DMA, but still giving us all the flight performance that makes 4.3 such a leap forward. With OSD DMA enabled by this change the OSD task is rather better behaved with regard to lateness.
On an F411 one should avoid running very high RX rates for example so as not to overload the processor, but it should be understood that with the code as it was previously the RX task was stomping all over the gyro/filter/pid code causing gyro samples to be missed, filters mis-timed, DSHOT errors etc. Tuning is all about compromise and balance, so, for example running high PID loop, with high RX frame rate and a busy OSD might have looked OK in the past, but the reality was that the samples weren't being read from the RX at a very regular rate, defeating input filtering, the gyro sampling was all over the place, so again another poor input, and with the OSD being given inappropriately high priority to run for as long as it wanted there may have been no visible issues, but the processor was struggling under the hood, having to process compromised inputs and that will have compromised the outputs. On the H7 I can leave it running for hours on end and see no late tasks. When using a processor of that performance it really should be possible to play nicely in a cooperative multitasking environment. You have chosen to focus on very deterministic 16kHz interrupt driven sampling of video data as a priority over the other flight code. And yes, there is some increased complexity such as having state machines to break long running tasks such as the RX and OSD tasks into a set of schedule-able steps, but most tasks which always completed in under 30us or so see no change as the scheduler does all the work of handling the timing. Basically the only tasks which have seen increased complexity are those which completely ignored the fact that our gyro loop takes 125us, and they couldn't hog all that time without a negative impact on performance. |
Correcting some misinformation: There were no problems with gyro filters being "mis-timed". Gyro samples are captured by the external MPU hardware every 125us completely outside the control of Betaflight. When Betaflight actually reads the samples from the gyro is completely irrelevant to filtering. For example, we could choose to use the FIFO and wait for 100 samples before reading a single one. Then running these previously captured samples through filtering with a 8khz sample rate would be perfect and not introduce any filtering artifacts. The key point is that the samples are captured in the context of an external 8khz sample rate regardless of when Betaflight retrieves the sample. There were no problems with DSHOT. The timing of the DSHOT packets was very stable. In fact using gyro EXTI to control the gyro loop timing actually introduces some instability and degrades the RPM filtering performance. This is because the EXTI timing is not as stable as previous and may not be exactly 125us - particularly with MPU6000 and its every 8th sample glitch. The reason why this introduces filtering problems for the RPM samples is that unlike the gyro hardware, the ESC does not capture samples on its own based on a external stable timebase. Instead it relies on the timing of the DSHOT packet to initiate the RPM measurement. So having a stable DSHOT packet timing is critical and a major reason for the scheduler improvements in 4.2. There were no problems with high-rate RX samples being read on time or being missed. All receiver drivers (excepting deprecated PWM and PPM) are interrupt-based. For serial receivers there is a data-ready IRQ the receives the data. For SPI there's a data-ready EXTI. Data is received in interrupt context - on time and completely. Samples are never lost. Filtering of samples is based on the measured rate of packets received from the receiver. This timing is captured in the interrupt and the timing is extremely low jitter. This is due to major improvements already made in 4.2. As previously stated, the goal of trying to maximize the ability of the gyro/filter/pid loops to run on time is beneficial, but this goal of trying to apply the same constraints to other tasks is unnecessary. This obsession of making sure that no task is ever late is a case of being unable to see the forest for the trees. Why does it matter if the OSD task is slightly late? Or if the BEEPER task is occasionally not on time? None of this can justify the costs being unnecessarily forced on the firmware and the impact of making slower MCU's pay the penalty. The only tasks that we really need to run on time are the gyro, filter, and pid. This is what the 4.2 scheduler did. The best solution would have been to take some of the improvements for long-running tasks along with DMA-based gyro reads and layer those onto the previous scheduler. |
Correcting a mis-correction. See #10525 (comment) for evidence of how one sample in eight is missed from an MPU6000 gyro if not sampled at the correct time. The timing of a DSHOT packet is very stable once it's kicked off as it's interrupt driven, but if (sorry, when) an overrunning RX or OSD task over-ran by many 10s of us (or more!) then the DSHOT communication would be delayed by that amount. Glad we agree on "the goal of trying to maximize the ability of the gyro/filter/pid loops to run on time is beneficial" but given that they all run in the same context the only way to have them called on time is to have other tasks finish on time so that this triplet of tasks is called on time. That's all the scheduler changes do. You're totally missing the point as to what late means. A late task is one that finishes later than the time allowed for by the scheduler. If it finishes late then the task that immediately follows it (for example the gyro task) will be late starting which as you have stated is undesirable. A cooperative multi-tasking scheme as we have relies on the task cooperating and the key part of that is not hogging the processor. When we have a preemptive scheduler (remember #7556 ?) then a task can "run" for as long at it likes but it will be constantly preempted by higher priority tasks. At present we only have a single task context, so unless code is interrupt driven, such as the FLASH writing code I introduced in #10525 we must guarantee that tasks cooperate. And no, the 4.2 scheduler did not make the gyro/filter/pid tasks run on time, hence all this work. For evidence, see #10573 (comment). |
Flown today without any issues at all on goggles or otherwise. Nox F411 FC. OSD quick and responsive.. |
Did some packs with a F411 / Frsky-D SPI, no issues. |
@ctzsnooze & @Zuldan Thank you. Also flown today by myself using an JHEF7DUAL with DJI Vista and OSD updates were continuous and uninterrupted. |
Fixes: #11271
RC link issues were observed on SPI RX boards when OSD profile is switched using adjustment thus:
This was caused by a call to
osdAnalyzeActiveElements()
which took some time and upset the scheduler's estimation of theRX_STATE_MODES
state execution time.Whilst debugging it was also observed that the initial call to
calcActiveAdjustmentRanges()
on startup also resulted in the state execution estimation being too high.To aid debug of RX timing, particularly with SPI RX a
DEBUG_RX_STATE_TIME
debug mode has been added.Tested on a CRAZYBEEF4R.