Move failsafe processing out of the RX task into the scheduler as a real-time task #11362

SteveCEvans · 2022-01-28T23:06:36Z

Concerns have been expressed about the impact of the combination of task behaviour and the 4.3 scheduler (due to its rigid application of scheduling (it's job)) possibly impacting failsafe behaviour. Given the importance of failsafe as a safety feature this PR moves it entirely into the domain of the scheduler and unless a task runs for the entire failsafe timeout period or more (which would be distinctly badly behaved) we can guarantee failsafes will happen as they should. If nuisance failsafes should occur as result of this, it will only be as a consequence of very poorly behaved tasks and we'll all have enough fingers left to count how times we've worried about such things with leftover fingers to point the blame.

To explain further, if we are at risk of any RX related issues causing a failure to disarm/enter failsafe then this is solution is RX agnostic. It checks every 200ms if there has been a valid RX packet received within the failsafe timeout, and if not, failsafes.

…eal-time task

haslinghuis · 2022-01-28T23:28:02Z

Memory overrun 😞

etracer65 · 2022-01-29T00:00:49Z

Not really the type of refactoring that should happen during release candidate stage.

SteveCEvans · 2022-01-29T00:42:50Z

Not really the type of refactoring that should happen during release candidate stage.

I refer the gentleman to his previous comment.

etracer65 · 2022-01-29T01:06:23Z

I'm quite sure I never suggested refactoring how failsafe processing works. In fact I did comment that the scheduler needs to be fixed so that tasks are not prevented from running. Refactoring failsafe processing out of the RX task is effectively admitting that it can never be made reliable. One could argue that handling failsafe in another task adds even more points of failure and risk due to the challenges ensuring the all tasks run reliably in the new scheduler. If the RX task would run like it's supposed to then this would be unnecessary as has been the case for Betaflight's entire existence.

Whether this refactoring is beneficial or not in the long term is separate from the simple fact that enhancements and refactoring should not happen during the release candidate phase. Particularly for a critical function like failsafe handling. If a change like this is accepted at this point then the release candidate phase should be terminated and the release process and timing should be reevaluated.

KarateBrot · 2022-01-29T11:10:47Z

@etracer65 One could argue that handling failsafe in another task adds even more points of failure and risk due to the challenges ensuring the all tasks run reliably in the new scheduler.

In this PR the failsafe doesn't run as a normal task. It is hardcoded into the scheduler to make sure it gets executed with 100% certainty, no matter what the other tasks are doing.

ledvinap · 2022-01-29T14:19:02Z

src/main/scheduler/scheduler.c

@@ -104,6 +106,8 @@ static int16_t taskCount = 0;
 static uint32_t nextTimingCycles;
 #endif

+static timeUs_t lastFailsafeCheck = 0;


either use micros() or some other type (int32_t, uint32_t)

Should have been timeMs_t. I started off measuring in us, but then, realising the failsafe timings are all expressed in ms, switched to that.

ledvinap · 2022-01-29T14:20:19Z

src/main/scheduler/scheduler.c

@@ -490,6 +494,14 @@ FAST_CODE void scheduler(void)
                taskExecutionTimeUs += schedulerExecuteTask(getTask(TASK_PID), currentTimeUs);
            }

+            // Check for failsafe conditions without reliance on the RX task being well behaved
+            if (millis() - lastFailsafeCheck > PERIOD_RXDATA_FAILURE) {


cmp32_t or cmpTimeUs with micros(). IMO micros is preferable, reusing schedulerStartTimeUs

See above. Also mills() is very low cost.

ledvinap · 2022-01-29T14:22:59Z

src/test/unit/scheduler_unittest.cc

@@ -70,6 +70,7 @@ extern "C" {
    // set up micros() to simulate time
    uint32_t simulatedTime = 0;
    uint32_t micros(void) { return simulatedTime; }
+    uint32_t millis(void) { return simulatedTime/1000; }


Please add comment that this implementation is good for short unittest only (it will behave terribly after 2^23us, someone may reuse it in flight code)

ledvinap · 2022-01-29T14:26:07Z

src/main/flight/failsafe.c

-void failsafeUpdateState(void)
+void failsafeCheckDataFailurePeriod(void)
+{
+    if ((millis() - failsafeState.validRxDataReceivedAt) > failsafeState.rxDataFailurePeriod) {


Zuldan · 2022-01-30T02:28:14Z

Tested with with F7, F411 and F405 (and put F4's CPU's under load) with a combination of Frsky/Tracer/Crossfire at low TX output power and did some real world testing.

Flew quads low to the ground around hillside until failsafe occurred. No issues found. Happy to test code again if it changes.

hydra · 2022-01-30T23:45:52Z

IMHO - this is fundamentally the wrong approach. the scheduler should care about tasks and tasks only, it should not have a dependency on the gyro task, the osd task, the rx code or the failsafe code. the other subsystems should tell the scheduler about themselves and it should behave accordingly. the dependencies are currently inverted from the way it should be.

I agree with @etracer in that this should not happen as part of an RC.

I think perhaps the wisest course of action, for a timely release that is, is to back out the scheduler changes whilst keeping as much of the task cleanups as possible, restore the old scheduler behavior until either it's fixed, reworked, re-designed or replaced.

SteveCEvans · 2022-01-31T00:42:39Z

This PR is recognising that there are some activities which need to be executed with a higher priority than normal tasks. Another example is deferred interrupt processing which ELRS requires to be executed promptly and not necessarily as part of the rx task. These examples are few in number and simplest to implement as I have done. It may be appropriate to create a new category of tasks, or extend the current task definitions to abstract this, but I am mindful of our space restrictions.

Other task specific code such as that relating to OSD or RX should not be in the scheduler as you highlight and will be removed in due course once these tasks are well behaved.

…d otherwise wastes ITCM

hydra · 2022-01-31T10:26:22Z

I think perhaps a different approach should be taken for the scheduling, there seems to be different phases of the tasks which are more important than others.

This is food for thought:

With ELRS there is a very specific timing window that must be honored in order for the link not to be dropped. Once a packet is received, and it's EXTI flag set, there is a time period in which the RX code MUST respond to the exti flag so that the RF receiver can be switched to the correct frequency for the next packet otherwise that packet /will/ be lost, and if not enough packets are received the system drifts out of phase with the transmitter and then the link is lost.

As @etracer noted, it's not as critical to actually process the gyro signal immediately as long as you know the timestamp for which the current signal is valid, since the data and it's timestamp remains valid until the next signal. that is, the gyro can record the signal in its buffer, then you have to read it before it becomes invalid, and it still remains valid for processing. If the timestamp of the gyro exti is given to the filter math, and not the 'current time' then the filters will still be acting appropriately.

yes, reducing the latency between gyro-signal to motor-output is still good though. However, we know that motors, due to their physical properties, don't respond instantly.

In my head, the priorities for the scheduling is as follows:

...L->| RX exti |<-H->| RF process |<-H->| RF change channel | <-L...
...L->| Gyro  exti |<-M->| Gyro start read |<-M->| Gyro read complete |<-H->| Gyro process | <-M->Update motors | <-L...

^ H = high priority, M = medium priority, L = low.

For great flight performance yes, we want to respond to gyro signals quickly, but there's not much point responding to gyro signals if the user is trying to disarm or the link is lost.

SteveCEvans · 2022-01-31T12:23:12Z

On F4 processors a typical configuration is to use bitbang DSHOT so that DMA streams are available for OSD/FLASH/SPI RX. In that case we can’t use DMA for SPI 1 and thus the gyro. In those cases the gyro will be read in the gyro task and this must happen at precisely the right time because, as you’ve noted, timing is important for filtering and also because you risk missing 1/8 gyro updates otherwise (on an MPU6000). And, as you state gyro to motor jitter should be reduced.

hydra · 2022-02-01T10:21:54Z

As I understand things, it appears this doesn't work:

failsafeUpdateState uses failsafeIsReceivingRxData to check the RX link state, stored in receivingRxData, however failsafeState.rxLinkState doesn't get set to FAILSAFE_RXLINK_DOWN, which happens only via the call to failsafeOnValidDataFailed which is only called by the RX task, in detectAndApplySignalLossBehaviour which is called from calculateRxChannelsAndUpdateFailsafe which is called from processRx which is called from taskUpdateRxMain.

If the RX task isn't scheduled correctly then this can occur too late and failsafe will still not work.

If I missed something please let me know.

Suggest closing this PR on the grounds that:

it doesn't work as intended.
the design is (arguably) wrong.

SteveCEvans · 2022-02-01T18:02:55Z

@hydra Firstly for ELRS, I am recoding the time critical bits to be interrupt driven. The incoming interrupt will trigger SPI accesses to read/clear the status (already working) and then the DMA completion kicks off either the read from the FIFO or start receiving. In either case the busy flag must be checked, and this again will be interrupt driven - currently polling using SPI, but that's wasting CPU cycles. All time critical stuff will thus happen completely outside the scheduler with only minimal ISR time required to keep it progressing, leaving the checker with less to do.

Thanks for your observation regarding the failsafe. This is a safety supervisory function, not dissimilar to a watchdog and so it is entirely appropriate, indeed desirable that it not be handled by the RX task. failsafeCheckDataFailurePeriod() is called to check is valid RX data has been received within the phase 1 failsafe period, failsafeState.rxDataFailurePeriod. It does this by comparing a timestamp of the last valid data, failsafeState.validRxDataReceivedAt against current time. Should a timeout be detected then the appropriate failsafe actions will be taken by the call to failsafeUpdateState(). Note that none of this requires the RX task to be running.

You'll see below that failsafeCheckDataFailurePeriod() does set failsafeState.rxLinkState to FAILSAFE_RXLINK_DOWN if needed.

void failsafeCheckDataFailurePeriod(void)
 {
     if (cmp32(millis(), failsafeState.validRxDataReceivedAt) > (int32_t)failsafeState.rxDataFailurePeriod) {
         setArmingDisabled(ARMING_DISABLED_RX_FAILSAFE); // To prevent arming with no RX link
         failsafeState.rxLinkState = FAILSAFE_RXLINK_DOWN;
     }
 }

hydra · 2022-02-01T20:35:25Z

@SteveCEvans Ok, I'll have another read of what you just said and the code paths, but I still feel the approach here is fundamentally wrong with regard to the dependency inversion. Your point regarding that it is indeed a supervisory function is valid though. It feels like the scheduler should be /told/ about a supervisor task, rather than the scheduler /knowing/ about one, and thus depending on a specific implementation of one, and all it's dependencies.

Getting the dependencies right is what allows dependencies to be mocked/stubbed in test code and leads to more maintainable tests and isolated production code. Isolated code allows multiple developers to work on different aspects of the system at the same time without 'treading on each others toes' as it were. This was one of the fundamental design factors that lead to Cleanflight, on which Betaflight is based. There are other benefits, including making PR's easier to review and limiting the scope of change only to the code being changed.

SteveCEvans · 2022-02-02T02:47:31Z

@hydra totally understood, but to be able to tell the scheduler in a dynamic way would require implementation of an API to register, say, a watchdog function of which the failsafe would be an example. The scheduler would then have to traverse the list of such functions (or know there could only be one) at the expense of code space.

I think you’d like this book. Captures all the concepts you talk about, but pragmatism is required where code space is so precious.

hydra · 2022-02-02T12:45:51Z

@hydra totally understood, but to be able to tell the scheduler in a dynamic way would require implementation of an API to register, say, a watchdog function of which the failsafe would be an example. The scheduler would then have to traverse the list of such functions (or know there could only be one) at the expense of code space.

Depends, it could have ONE or N. In this case I was thinking of ONE and then the dependencies are fixed.

I think you’d like this book.

yeah, there's lots of good stuff in that book for sure! 😄

hydra · 2022-02-02T12:48:02Z

Also it could be done at compile time via a data passed to the schedulers init function from main, which would also be ok. Then if the compiler sees that there's only ever one value for the argument it can potentially optimize it out and/or inline the code.

SteveCEvans · 2022-02-03T01:11:33Z

@hydra I realise there is an easily plugged hole here. We need to make sure not only that data has been received, but also that the data has been interpreted. This extends the idea of a failsafe to cover not only the reception of RC data, but also, like any watchdog, correct operation thereafter. No doubt an interesting topic for discussion.

ctzsnooze · 2022-03-01T22:33:40Z

@SteveCEvans if this is now in #11380 can this PR be closed?

SteveCEvans · 2022-03-02T00:10:01Z

Yes

SteveCEvans added this to the 4.3 milestone Jan 28, 2022

SteveCEvans added the RN: SAFETY IMPROVEMENT label Jan 28, 2022

haslinghuis added this to For discussion in Finalizing Firmware 4.3 Release via automation Jan 28, 2022

haslinghuis assigned SteveCEvans Jan 28, 2022

haslinghuis moved this from For discussion to Firmware in Finalizing Firmware 4.3 Release Jan 28, 2022

Move failsafe processing out of the RX task into the scheduler as a r…

d6231f0

…eal-time task

SteveCEvans force-pushed the infallible_failsafe branch from 929e123 to d6231f0 Compare January 28, 2022 23:11

SteveCEvans requested review from klutvott123, blckmn, ctzsnooze, haslinghuis and KarateBrot January 28, 2022 23:12

SteveCEvans mentioned this pull request Jan 29, 2022

Steve's mashup of proposed PRs for RC3 #11355

Closed

ledvinap reviewed Jan 29, 2022

View reviewed changes

SteveCEvans force-pushed the infallible_failsafe branch from cef1816 to 7c0558a Compare January 30, 2022 22:37

KarateBrot previously approved these changes Jan 30, 2022

View reviewed changes

Don't inline failsafeUpdateState() as it's only called every 200ms an…

66a5ff1

…d otherwise wastes ITCM

SteveCEvans dismissed KarateBrot’s stale review via 66a5ff1 January 31, 2022 00:48

SteveCEvans force-pushed the infallible_failsafe branch from 7c0558a to 66a5ff1 Compare January 31, 2022 00:48

mard89 mentioned this pull request Feb 8, 2022

intermittent failsafes while using expresslrs SPI #11318

Closed

SteveCEvans closed this Mar 2, 2022

Finalizing Firmware 4.3 Release automation moved this from Firmware to Finished (Merged) Mar 2, 2022

hydra mentioned this pull request Mar 2, 2022

Random CRSF Failsafes and intermittent RX Signal "Loss" #11428

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move failsafe processing out of the RX task into the scheduler as a real-time task #11362

Move failsafe processing out of the RX task into the scheduler as a real-time task #11362

SteveCEvans commented Jan 28, 2022 •

edited

haslinghuis commented Jan 28, 2022

etracer65 commented Jan 29, 2022

SteveCEvans commented Jan 29, 2022

etracer65 commented Jan 29, 2022

KarateBrot commented Jan 29, 2022 •

edited

ledvinap Jan 29, 2022

SteveCEvans Jan 30, 2022 •

edited

ledvinap Jan 29, 2022

SteveCEvans Jan 30, 2022

ledvinap Jan 29, 2022

ledvinap Jan 29, 2022

SteveCEvans Jan 30, 2022

Zuldan commented Jan 30, 2022

hydra commented Jan 30, 2022 •

edited

SteveCEvans commented Jan 31, 2022

hydra commented Jan 31, 2022 •

edited

SteveCEvans commented Jan 31, 2022

hydra commented Feb 1, 2022 •

edited

SteveCEvans commented Feb 1, 2022

hydra commented Feb 1, 2022 •

edited

SteveCEvans commented Feb 2, 2022

hydra commented Feb 2, 2022

hydra commented Feb 2, 2022

SteveCEvans commented Feb 3, 2022

ctzsnooze commented Mar 1, 2022

SteveCEvans commented Mar 2, 2022

Move failsafe processing out of the RX task into the scheduler as a real-time task #11362

Move failsafe processing out of the RX task into the scheduler as a real-time task #11362

Conversation

SteveCEvans commented Jan 28, 2022 • edited

haslinghuis commented Jan 28, 2022

etracer65 commented Jan 29, 2022

SteveCEvans commented Jan 29, 2022

etracer65 commented Jan 29, 2022

KarateBrot commented Jan 29, 2022 • edited

ledvinap Jan 29, 2022

Choose a reason for hiding this comment

SteveCEvans Jan 30, 2022 • edited

Choose a reason for hiding this comment

ledvinap Jan 29, 2022

Choose a reason for hiding this comment

SteveCEvans Jan 30, 2022

Choose a reason for hiding this comment

ledvinap Jan 29, 2022

Choose a reason for hiding this comment

ledvinap Jan 29, 2022

Choose a reason for hiding this comment

SteveCEvans Jan 30, 2022

Choose a reason for hiding this comment

Zuldan commented Jan 30, 2022

hydra commented Jan 30, 2022 • edited

SteveCEvans commented Jan 31, 2022

hydra commented Jan 31, 2022 • edited

SteveCEvans commented Jan 31, 2022

hydra commented Feb 1, 2022 • edited

SteveCEvans commented Feb 1, 2022

hydra commented Feb 1, 2022 • edited

SteveCEvans commented Feb 2, 2022

hydra commented Feb 2, 2022

hydra commented Feb 2, 2022

SteveCEvans commented Feb 3, 2022

ctzsnooze commented Mar 1, 2022

SteveCEvans commented Mar 2, 2022

SteveCEvans commented Jan 28, 2022 •

edited

KarateBrot commented Jan 29, 2022 •

edited

SteveCEvans Jan 30, 2022 •

edited

hydra commented Jan 30, 2022 •

edited

hydra commented Jan 31, 2022 •

edited

hydra commented Feb 1, 2022 •

edited

hydra commented Feb 1, 2022 •

edited