Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix issues impacting RX_STATE_MODES state duration and add DEBUG_RX_STATE_TIME #11274

Merged
merged 4 commits into from
Jan 16, 2022

Conversation

SteveCEvans
Copy link
Member

@SteveCEvans SteveCEvans commented Jan 11, 2022

Fixes: #11271

RC link issues were observed on SPI RX boards when OSD profile is switched using adjustment thus:

image

# adjrange
adjrange 0 0 2 900 1200 29 2 0 0
adjrange 1 0 2 1200 1800 29 2 0 0
adjrange 2 0 2 1800 2100 29 2 0 0

This was caused by a call to osdAnalyzeActiveElements() which took some time and upset the scheduler's estimation of the RX_STATE_MODES state execution time.

Whilst debugging it was also observed that the initial call to calcActiveAdjustmentRanges() on startup also resulted in the state execution estimation being too high.

To aid debug of RX timing, particularly with SPI RX a DEBUG_RX_STATE_TIME debug mode has been added.

// DEBUG_RX_STATE_TIME, RX task state duration in us
// 0 - RX_STATE_CHECK
// 1 - RX_STATE_PROCESS
// 2 - RX_STATE_MODES
// 3 - RX_STATE_UPDATE

Tested on a CRAZYBEEF4R.

haslinghuis
haslinghuis previously approved these changes Jan 11, 2022
Copy link
Member

@haslinghuis haslinghuis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested by @AndreaCCIE in #11271

@etracer65
Copy link
Member

Even though I realize it will fall on deaf ears, I'm compelled to reiterate: #11228 (comment)

A design fundamental of the previous scheduler is that a task could never be "starved" of it's execution and locked out. That seems to no longer be the case as multiple fixes have had to address this. This is dangerous - particularly when it can affect critical tasks like RX and impact disarming, failsafes, etc.

We can't keep continuing to react and try and carve out exceptions as discovered rather than fixing the underlying problem of tasks having the possibility of being locked out. And as predicted, here's a case that causes a failsafe and the quad falling out of the sky due to the RX task getting prevented from running. It should be easy to recognize that this tactic is not viable as there will always be unknown edge cases that will eventually trigger this problem over and over as can be seen with the multiple "fixes" made already. Further it opens the risk that any future enhancements unknowingly open the door to triggering additional exceptions.

The scheduler needs to be fixed to prevent the possibility that a task can be blocked from running.

The firmware is not stable and is dangerous in its current state.

This obsession with having all the task executions be "deterministic" is simply not viable and unnecessary. As in the previous scheduler, the focus should be on the gyro/filter/pid tasks running on time as much as possible and ensuring that no tasks are blocked from running. Having a non critical task run a little late sometimes simply doesn't matter. Further, having the gyro/filter/pid tasks be occasionally delayed to ensure that all tasks do in fact run is necessary even if undesirable. Overall there seems to be problem seeing the bigger picture of overall firmware stability for now and the long term.

Ideally the scheduler concept should have been left alone. The other work to improve the execution time consistency of tasks (Like OSD, baro, etc.) was good and just by itself would have improved the on-time task management without any additional changes in the scheduler.

@SteveCEvans
Copy link
Member Author

Alas we have discovered a few instances of particularly errant behaviour with respect to task execution time, which have been addressed, but as you state the consequence of such cases is a failSAFE.

And determinism is viable, and desirable in a real-time system such as this.

@etracer65
Copy link
Member

Alas we have discovered a few instances of particularly errant behaviour with respect to task execution time, which have been addressed, but as you state the consequence of such cases is a failSAFE.

And this makes the firmware inherently unsafe. For example, you cannot state with any degree of certainty that there's not another "errant behavior" that has yet to be discovered that will cause the same problem or worse. It's not like this is an isolated case and no other similar problems have been discovered. And there's no mitigation to prevent future changes from exposing additional exceptions.

The scheduler needs to be fixed to prevent the possibility that a task can be blocked from running.

phobos-
phobos- previously approved these changes Jan 12, 2022
@pitts-mo
Copy link

Hello,
I think I am affected by this and will test when merged. My travels across issue reports makes it feel like there could still be an underlying issue... but maybe we'll just need find and rework all the individual task events that seem to be causing problem.

my issue detail:
I have found unpredictable delay/lag/failure in OSD updates for my EMAX Tinyhawk II w/MATEKF411RX when using 4.3.0x builds.

Removing my OSD adjustment will appear to allow it to update at a reasonably normally rate if it displays at all.
adjrange 0 0 1 900 2100 29 1 0 0

However, even without my OSD adjustment the OSD can occasionally transition from boot screen to become stuck not showing any of my OSD elements.

@SteveCEvans
Copy link
Member Author

SteveCEvans commented Jan 12, 2022

@etracer65 If you look at

} else if ((selectedTask->taskAgePeriods > TASK_AGE_EXPEDITE_COUNT) ||
you'll see that if a task ages the scheduler will rapidly reduce the anticipated execution time and make it schedule-able again. In general the age is considered in multiples of the task period, but for the RX task I instead consider the number of times there has been an unsuccessful attempt to call the task.

The RX task maintains it's own estimates of task duration by state and would previously continuously reset the anticipated execution time back to the peak it had recorded. Now, in this PR it picks up the scheduler's copy, and if it has been reduced it will reset the expectation it maintains.

I've tested this on a CRAZYBEEF4R (as it has an SPI RX) by backing out the change I'd made to calling schedulerIgnoreTaskExecTime() whilst changing profiles via stick commands. The trace below is obtained using the debug_mode=RX_STATE_TIME where debug[2] shows the time taken in the RX task state RX_STATE_MODES. You'll see that without the call to schedulerIgnoreTaskExecTime() there is a peak to over 100us which is then rapidly reduced to 26us and the RX task continues without stalling. Monitoring the Setup tab in the configurator this does not provoke an RX_FAILSAFE as it did previously.

image

I absolutely maintain my position that deterministic behaviour is advantageous to flight performance, and all tasks should be well behaved in this regard, but now, should any RX protocol driver fail to meet this expectation the scheduler will rapidly forgive it's transgression.

ctzsnooze
ctzsnooze previously approved these changes Jan 13, 2022
haslinghuis
haslinghuis previously approved these changes Jan 13, 2022
@SteveCEvans
Copy link
Member Author

SteveCEvans commented Jan 14, 2022

Tested on CRAZYBEEF4R, MATEKF4STD, MATEKF722SE, MATEKH743.

Edit: And now tested on a MATEKF411 with 150Hz CRSF, bidirectional DSHOT (both bitbanged and not), and LED_STRIP. With cpu_overclock=108MHz the average load is about 60% and with 120MHz it's about 54%.

@haslinghuis
Copy link
Member

haslinghuis commented Jan 14, 2022

Tested on Matek F411 with CRSF.

Avgload is between 59 and 67%

After consecutive executing tasks we see OSD has some lateness:

# tasks
Task list             rate/hz  max/us  avg/us maxload avgload  total/ms   late    run reqd/us
00 - (         SYSTEM)     10       2       0    0.0%    0.0%         2      0    641       1
01 - (         SYSTEM)    985       4       2    0.3%    0.1%       226      0  63055       2
02 - (           GYRO)   7886      18      12   14.1%    9.4%     19804      0 505082       0
03 - (         FILTER)   3942      50      43   19.7%   16.9%     36242      0 252541       0
04 - (            PID)   3942      92      81   36.2%   31.9%     71030      0 252541       0
05 - (            ACC)    983      13      12    1.2%    1.1%      2425      0  62930      12
06 - (       ATTITUDE)    492      13      11    0.6%    0.5%      1238      0  31525      11
07 - (             RX)     15      32       2    0.0%    0.0%       244      0   3840      16
08 - (         SERIAL)     99    6119       3   60.5%    0.0%      1600      0   6351       3
09 - (       DISPATCH)    984       3       1    0.2%    0.0%       181      0  63056       0
10 - (BATTERY_VOLTAGE)     50       4       3    0.0%    0.0%        23      0   3197       3
11 - (BATTERY_CURRENT)     50       3       1    0.0%    0.0%        10      0   3197       0
12 - ( BATTERY_ALERTS)      5       2       1    0.0%    0.0%         1      0    320       2
13 - (         BEEPER)     99       3       2    0.0%    0.0%        29      0   6351       2
18 - (      TELEMETRY)    493       3       0    0.1%    0.0%       101      0  31521       1
20 - (            OSD)     12      35      54    0.0%    0.0%       405    108   8458      54
22 - (            CMS)     20       2       1    0.0%    0.0%         7      0   1280       2
23 - (        VTXCTRL)      5       1       1    0.0%    0.0%         1      0    320       0
24 - (        CAMCTRL)      5       2       0    0.0%    0.0%         0      0    320       0
26 - (    ADCINTERNAL)      1       3       2    0.0%    0.0%         0      0     64       2
28 - (SPEED_NEGOTIATION)     99       3       0    0.0%    0.0%        21      0   6336       1
RX Check Function                  10       6                       249
Total (excluding SERIAL)                                59.9%

Debug set to RX_STATE_TIME:
Screenshot from 2022-01-14 16-35-17

@hydra
Copy link
Contributor

hydra commented Jan 14, 2022

I'm happy to see discussion between @SteveCEvans and @etracer65 on this. Keep up the good work guys.

Given my experience with the H7RF target (PR #11110), which has SPI RX and CPU based PixelOSD I've noticed that recent code in master has been unusable when I port my changes to it, with the code prior to the scheduler changes the system was usable - i.e. no failsafes, no rx-loss, and 60fps osd. With the H7RF changes ported to current master, I've been getting extremely sluggish ui performance, rx loss, <10fps osd - but yes, the gyro task was on time.

Others have has similar experience with the CPU usage and inability to use the FC, e.g. #11221

I've mentioned it before, it doesn't matter if the gyro task is on time if you can't use the rest of the system.

There is merit in getting all the tasks to behave in a more deterministic way such that the rc input and gyro inputs to motor output latency is improved, and related math/code is right more of the time.

There is a cost to doing this, in terms of increased complexity in every task and task-to-scheduler time reporting and resulting code size and ram usage.

@SteveCEvans
Copy link
Member Author

@hydra It is indeed good to talk.

All the tasks could and should be well behaved. I've made big changes to the OSD in particular towards this goal. There have, of course been some corner cases of CPU hogging discovered along the way, such as FLASH memory writes from the RX task in response to stick commands, but now most tasks are very well behaved and this PR ensured that even when they transgress the timing contract they're not punished enough to negatively impact their function.

I've only see OSD issues at all on the F411 but these now seem to be addressed in this PR. I've just been test flying a BetaFPV75X with MatekF411 compatible FC with bit banged DSHOT disabled (so I can use LED_STRIP) to enable gyro DMA at the expense of no OSD DMA, but I'm still getting the correct OSD task rate, albeit with 10% of the OSD tasks being late, typically by about 10us which is way better than we've had before.

# tasks
Task list             rate/hz  max/us  avg/us maxload avgload  total/ms   late    run reqd/us
00 - (         SYSTEM)     10       1       1    0.0%    0.0%         6      0    146       0
01 - (         SYSTEM)    929      14       1    1.3%    0.0%       786     25  13423       1
02 - (           GYRO)   7968      17       5   13.5%    3.9%     24272      0 116501       0
03 - (         FILTER)   3984      83      74   33.0%   29.4%    178732      0  58250       0
04 - (            PID)   3984     122     114   48.6%   45.4%    246474      0  58251       0
05 - (            ACC)    900      20       3    1.8%    0.2%      2786      7  13235       3
06 - (       ATTITUDE)     99      29      21    0.2%    0.2%       878      1   1454      21
07 - (             RX)    150      44      12    0.6%    0.1%      1206      7   8584      14
08 - (         SERIAL)     99    9803      15   97.0%    0.1%      3925      0   1453      15
09 - (       DISPATCH)    931      18      11    1.6%    1.0%       627     25  13569      11
10 - (BATTERY_VOLTAGE)     50      19      14    0.0%    0.0%        81      1    728      14
11 - (BATTERY_CURRENT)     50       7       3    0.0%    0.0%        40      3    728       3
12 - ( BATTERY_ALERTS)      5       2       5    0.0%    0.0%         6      0     73       5
13 - (         BEEPER)     99       3       1    0.0%    0.0%        94      1   1453       1
18 - (      TELEMETRY)    449       5       4    0.2%    0.1%      1750     13   6613       4
19 - (       LEDSTRIP)     99      39      38    0.3%    0.3%       395      3   1452      38
20 - (            OSD)     12      46      10    0.0%    0.0%     10074    216   2502      45
22 - (            CMS)     20       2       1    0.0%    0.0%        19      0    292       2
23 - (        VTXCTRL)      5       5       4    0.0%    0.0%        30      0     73       4
24 - (        CAMCTRL)      5       1       1    0.0%    0.0%         2      0     73       0
26 - (    ADCINTERNAL)      1       3       3    0.0%    0.0%         1      0     14       3
28 - (SPEED_NEGOTIATION)     96       5       2    0.0%    0.0%        48      1   1403       2

With bit banged DSHOT enabled (at the expense of LED_STRIP) I get a very small number of late tasks. The scheduler remains locked to the gyro albeit with a bit more jitter than when using gyro DMA, but still giving us all the flight performance that makes 4.3 such a leap forward. With OSD DMA enabled by this change the OSD task is rather better behaved with regard to lateness.

# tasks
Task list             rate/hz  max/us  avg/us maxload avgload  total/ms   late    run reqd/us
00 - (         SYSTEM)     10       1       0    0.0%    0.0%         0      0     80       1
01 - (         SYSTEM)    965       4       2    0.3%    0.1%        39      1   7621       2
02 - (           GYRO)   7968      26      15   20.7%   11.9%      3471      0  63316       0
03 - (         FILTER)   3982      62      50   24.6%   19.9%      6407      0  31658       0
04 - (            PID)   3984     121     116   48.2%   46.2%     15455      0  31658       0
05 - (            ACC)    922      25      23    2.3%    2.1%       421     10   7491      23
06 - (       ATTITUDE)    100      15      14    0.1%    0.1%        49      0    791      14
07 - (             RX)    150      36       3    0.5%    0.0%       416      2   4752      18
08 - (         SERIAL)     99   23894      28  236.5%    0.2%      1197      0    791      28
09 - (       DISPATCH)    965      14      12    1.3%    1.1%        31      6   7656      12
10 - (BATTERY_VOLTAGE)     50       4       2    0.0%    0.0%         4      1    396       2
11 - (BATTERY_CURRENT)     50       5       3    0.0%    0.0%         4      0    396       3
12 - ( BATTERY_ALERTS)      5       3       2    0.0%    0.0%         0      0     40       2
13 - (         BEEPER)     99       4       2    0.0%    0.0%         5      0    791       2
18 - (      TELEMETRY)    467       3       1    0.1%    0.0%        18      4   3782       1
20 - (            OSD)     12      36       5    0.0%    0.0%        98      7   1248      61
22 - (            CMS)     20       2       1    0.0%    0.0%         1      0    159       1
23 - (        VTXCTRL)      5       1      15    0.0%    0.0%         0      0     40      15
24 - (        CAMCTRL)      5       1       0    0.0%    0.0%         0      0     40       0
26 - (    ADCINTERNAL)      2       3       3    0.0%    0.0%         0      0      8       3
28 - (SPEED_NEGOTIATION)     99       4       2    0.0%    0.0%         3      0    784       2

On an F411 one should avoid running very high RX rates for example so as not to overload the processor, but it should be understood that with the code as it was previously the RX task was stomping all over the gyro/filter/pid code causing gyro samples to be missed, filters mis-timed, DSHOT errors etc. Tuning is all about compromise and balance, so, for example running high PID loop, with high RX frame rate and a busy OSD might have looked OK in the past, but the reality was that the samples weren't being read from the RX at a very regular rate, defeating input filtering, the gyro sampling was all over the place, so again another poor input, and with the OSD being given inappropriately high priority to run for as long as it wanted there may have been no visible issues, but the processor was struggling under the hood, having to process compromised inputs and that will have compromised the outputs.

On the H7 I can leave it running for hours on end and see no late tasks. When using a processor of that performance it really should be possible to play nicely in a cooperative multitasking environment. You have chosen to focus on very deterministic 16kHz interrupt driven sampling of video data as a priority over the other flight code.

And yes, there is some increased complexity such as having state machines to break long running tasks such as the RX and OSD tasks into a set of schedule-able steps, but most tasks which always completed in under 30us or so see no change as the scheduler does all the work of handling the timing. Basically the only tasks which have seen increased complexity are those which completely ignored the fact that our gyro loop takes 125us, and they couldn't hog all that time without a negative impact on performance.

@etracer65
Copy link
Member

Correcting some misinformation:

There were no problems with gyro filters being "mis-timed". Gyro samples are captured by the external MPU hardware every 125us completely outside the control of Betaflight. When Betaflight actually reads the samples from the gyro is completely irrelevant to filtering. For example, we could choose to use the FIFO and wait for 100 samples before reading a single one. Then running these previously captured samples through filtering with a 8khz sample rate would be perfect and not introduce any filtering artifacts. The key point is that the samples are captured in the context of an external 8khz sample rate regardless of when Betaflight retrieves the sample.

There were no problems with DSHOT. The timing of the DSHOT packets was very stable. In fact using gyro EXTI to control the gyro loop timing actually introduces some instability and degrades the RPM filtering performance. This is because the EXTI timing is not as stable as previous and may not be exactly 125us - particularly with MPU6000 and its every 8th sample glitch. The reason why this introduces filtering problems for the RPM samples is that unlike the gyro hardware, the ESC does not capture samples on its own based on a external stable timebase. Instead it relies on the timing of the DSHOT packet to initiate the RPM measurement. So having a stable DSHOT packet timing is critical and a major reason for the scheduler improvements in 4.2.

There were no problems with high-rate RX samples being read on time or being missed. All receiver drivers (excepting deprecated PWM and PPM) are interrupt-based. For serial receivers there is a data-ready IRQ the receives the data. For SPI there's a data-ready EXTI. Data is received in interrupt context - on time and completely. Samples are never lost. Filtering of samples is based on the measured rate of packets received from the receiver. This timing is captured in the interrupt and the timing is extremely low jitter. This is due to major improvements already made in 4.2.

As previously stated, the goal of trying to maximize the ability of the gyro/filter/pid loops to run on time is beneficial, but this goal of trying to apply the same constraints to other tasks is unnecessary. This obsession of making sure that no task is ever late is a case of being unable to see the forest for the trees. Why does it matter if the OSD task is slightly late? Or if the BEEPER task is occasionally not on time? None of this can justify the costs being unnecessarily forced on the firmware and the impact of making slower MCU's pay the penalty. The only tasks that we really need to run on time are the gyro, filter, and pid. This is what the 4.2 scheduler did.

The best solution would have been to take some of the improvements for long-running tasks along with DMA-based gyro reads and layer those onto the previous scheduler.

@SteveCEvans
Copy link
Member Author

Correcting a mis-correction. See #10525 (comment) for evidence of how one sample in eight is missed from an MPU6000 gyro if not sampled at the correct time.

The timing of a DSHOT packet is very stable once it's kicked off as it's interrupt driven, but if (sorry, when) an overrunning RX or OSD task over-ran by many 10s of us (or more!) then the DSHOT communication would be delayed by that amount.

Glad we agree on "the goal of trying to maximize the ability of the gyro/filter/pid loops to run on time is beneficial" but given that they all run in the same context the only way to have them called on time is to have other tasks finish on time so that this triplet of tasks is called on time. That's all the scheduler changes do. You're totally missing the point as to what late means. A late task is one that finishes later than the time allowed for by the scheduler. If it finishes late then the task that immediately follows it (for example the gyro task) will be late starting which as you have stated is undesirable. A cooperative multi-tasking scheme as we have relies on the task cooperating and the key part of that is not hogging the processor. When we have a preemptive scheduler (remember #7556 ?) then a task can "run" for as long at it likes but it will be constantly preempted by higher priority tasks. At present we only have a single task context, so unless code is interrupt driven, such as the FLASH writing code I introduced in #10525 we must guarantee that tasks cooperate.

And no, the 4.2 scheduler did not make the gyro/filter/pid tasks run on time, hence all this work. For evidence, see #10573 (comment).

@ctzsnooze
Copy link
Member

ctzsnooze commented Jan 15, 2022

Flown today without any issues at all on goggles or otherwise. Nox F411 FC. OSD quick and responsive..

@Zuldan
Copy link

Zuldan commented Jan 15, 2022

Did some packs with a F411 / Frsky-D SPI, no issues.

@SteveCEvans
Copy link
Member Author

@ctzsnooze & @Zuldan Thank you.

Also flown today by myself using an JHEF7DUAL with DJI Vista and OSD updates were continuous and uninterrupted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Failsafe when switching OSD profile on CRAZYBEEF4SX1280 target
10 participants