Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move failsafe processing out of the RX task into the scheduler as a real-time task #11362

Closed

Conversation

SteveCEvans
Copy link
Member

@SteveCEvans SteveCEvans commented Jan 28, 2022

Concerns have been expressed about the impact of the combination of task behaviour and the 4.3 scheduler (due to its rigid application of scheduling (it's job)) possibly impacting failsafe behaviour. Given the importance of failsafe as a safety feature this PR moves it entirely into the domain of the scheduler and unless a task runs for the entire failsafe timeout period or more (which would be distinctly badly behaved) we can guarantee failsafes will happen as they should. If nuisance failsafes should occur as result of this, it will only be as a consequence of very poorly behaved tasks and we'll all have enough fingers left to count how times we've worried about such things with leftover fingers to point the blame.

To explain further, if we are at risk of any RX related issues causing a failure to disarm/enter failsafe then this is solution is RX agnostic. It checks every 200ms if there has been a valid RX packet received within the failsafe timeout, and if not, failsafes.

@SteveCEvans SteveCEvans added this to the 4.3 milestone Jan 28, 2022
@haslinghuis haslinghuis added this to For discussion in Finalizing Firmware 4.3 Release via automation Jan 28, 2022
@haslinghuis haslinghuis moved this from For discussion to Firmware in Finalizing Firmware 4.3 Release Jan 28, 2022
@haslinghuis
Copy link
Member

Memory overrun 😞

@etracer65
Copy link
Member

Not really the type of refactoring that should happen during release candidate stage.

@SteveCEvans
Copy link
Member Author

Not really the type of refactoring that should happen during release candidate stage.

I refer the gentleman to his previous comment.

@etracer65
Copy link
Member

I'm quite sure I never suggested refactoring how failsafe processing works. In fact I did comment that the scheduler needs to be fixed so that tasks are not prevented from running. Refactoring failsafe processing out of the RX task is effectively admitting that it can never be made reliable. One could argue that handling failsafe in another task adds even more points of failure and risk due to the challenges ensuring the all tasks run reliably in the new scheduler. If the RX task would run like it's supposed to then this would be unnecessary as has been the case for Betaflight's entire existence.

Whether this refactoring is beneficial or not in the long term is separate from the simple fact that enhancements and refactoring should not happen during the release candidate phase. Particularly for a critical function like failsafe handling. If a change like this is accepted at this point then the release candidate phase should be terminated and the release process and timing should be reevaluated.

@KarateBrot
Copy link
Member

KarateBrot commented Jan 29, 2022

@etracer65 One could argue that handling failsafe in another task adds even more points of failure and risk due to the challenges ensuring the all tasks run reliably in the new scheduler.

In this PR the failsafe doesn't run as a normal task. It is hardcoded into the scheduler to make sure it gets executed with 100% certainty, no matter what the other tasks are doing.

@@ -104,6 +106,8 @@ static int16_t taskCount = 0;
static uint32_t nextTimingCycles;
#endif

static timeUs_t lastFailsafeCheck = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either use micros() or some other type (int32_t, uint32_t)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have been timeMs_t. I started off measuring in us, but then, realising the failsafe timings are all expressed in ms, switched to that.

@@ -490,6 +494,14 @@ FAST_CODE void scheduler(void)
taskExecutionTimeUs += schedulerExecuteTask(getTask(TASK_PID), currentTimeUs);
}

// Check for failsafe conditions without reliance on the RX task being well behaved
if (millis() - lastFailsafeCheck > PERIOD_RXDATA_FAILURE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmp32_t or cmpTimeUs with micros(). IMO micros is preferable, reusing schedulerStartTimeUs

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above. Also mills() is very low cost.

@@ -70,6 +70,7 @@ extern "C" {
// set up micros() to simulate time
uint32_t simulatedTime = 0;
uint32_t micros(void) { return simulatedTime; }
uint32_t millis(void) { return simulatedTime/1000; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comment that this implementation is good for short unittest only (it will behave terribly after 2^23us, someone may reuse it in flight code)

void failsafeUpdateState(void)
void failsafeCheckDataFailurePeriod(void)
{
if ((millis() - failsafeState.validRxDataReceivedAt) > failsafeState.rxDataFailurePeriod) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmp32

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@Zuldan
Copy link

Zuldan commented Jan 30, 2022

Tested with with F7, F411 and F405 (and put F4's CPU's under load) with a combination of Frsky/Tracer/Crossfire at low TX output power and did some real world testing.

Flew quads low to the ground around hillside until failsafe occurred. No issues found. Happy to test code again if it changes.

KarateBrot
KarateBrot previously approved these changes Jan 30, 2022
@hydra
Copy link
Contributor

hydra commented Jan 30, 2022

IMHO - this is fundamentally the wrong approach. the scheduler should care about tasks and tasks only, it should not have a dependency on the gyro task, the osd task, the rx code or the failsafe code. the other subsystems should tell the scheduler about themselves and it should behave accordingly. the dependencies are currently inverted from the way it should be.

I agree with @etracer in that this should not happen as part of an RC.

I think perhaps the wisest course of action, for a timely release that is, is to back out the scheduler changes whilst keeping as much of the task cleanups as possible, restore the old scheduler behavior until either it's fixed, reworked, re-designed or replaced.

@SteveCEvans
Copy link
Member Author

This PR is recognising that there are some activities which need to be executed with a higher priority than normal tasks. Another example is deferred interrupt processing which ELRS requires to be executed promptly and not necessarily as part of the rx task. These examples are few in number and simplest to implement as I have done. It may be appropriate to create a new category of tasks, or extend the current task definitions to abstract this, but I am mindful of our space restrictions.

Other task specific code such as that relating to OSD or RX should not be in the scheduler as you highlight and will be removed in due course once these tasks are well behaved.

@hydra
Copy link
Contributor

hydra commented Jan 31, 2022

I think perhaps a different approach should be taken for the scheduling, there seems to be different phases of the tasks which are more important than others.

This is food for thought:

With ELRS there is a very specific timing window that must be honored in order for the link not to be dropped. Once a packet is received, and it's EXTI flag set, there is a time period in which the RX code MUST respond to the exti flag so that the RF receiver can be switched to the correct frequency for the next packet otherwise that packet /will/ be lost, and if not enough packets are received the system drifts out of phase with the transmitter and then the link is lost.

As @etracer noted, it's not as critical to actually process the gyro signal immediately as long as you know the timestamp for which the current signal is valid, since the data and it's timestamp remains valid until the next signal. that is, the gyro can record the signal in its buffer, then you have to read it before it becomes invalid, and it still remains valid for processing. If the timestamp of the gyro exti is given to the filter math, and not the 'current time' then the filters will still be acting appropriately.

yes, reducing the latency between gyro-signal to motor-output is still good though. However, we know that motors, due to their physical properties, don't respond instantly.

In my head, the priorities for the scheduling is as follows:

...L->| RX exti |<-H->| RF process |<-H->| RF change channel | <-L...
...L->| Gyro  exti |<-M->| Gyro start read |<-M->| Gyro read complete |<-H->| Gyro process | <-M->Update motors | <-L...

^ H = high priority, M = medium priority, L = low.

For great flight performance yes, we want to respond to gyro signals quickly, but there's not much point responding to gyro signals if the user is trying to disarm or the link is lost.

@SteveCEvans
Copy link
Member Author

On F4 processors a typical configuration is to use bitbang DSHOT so that DMA streams are available for OSD/FLASH/SPI RX. In that case we can’t use DMA for SPI 1 and thus the gyro. In those cases the gyro will be read in the gyro task and this must happen at precisely the right time because, as you’ve noted, timing is important for filtering and also because you risk missing 1/8 gyro updates otherwise (on an MPU6000). And, as you state gyro to motor jitter should be reduced.

@hydra
Copy link
Contributor

hydra commented Feb 1, 2022

As I understand things, it appears this doesn't work:

failsafeUpdateState uses failsafeIsReceivingRxData to check the RX link state, stored in receivingRxData, however failsafeState.rxLinkState doesn't get set to FAILSAFE_RXLINK_DOWN, which happens only via the call to failsafeOnValidDataFailed which is only called by the RX task, in detectAndApplySignalLossBehaviour which is called from calculateRxChannelsAndUpdateFailsafe which is called from processRx which is called from taskUpdateRxMain.

If the RX task isn't scheduled correctly then this can occur too late and failsafe will still not work.

If I missed something please let me know.

Suggest closing this PR on the grounds that:

  1. it doesn't work as intended.
  2. the design is (arguably) wrong.

@SteveCEvans
Copy link
Member Author

@hydra Firstly for ELRS, I am recoding the time critical bits to be interrupt driven. The incoming interrupt will trigger SPI accesses to read/clear the status (already working) and then the DMA completion kicks off either the read from the FIFO or start receiving. In either case the busy flag must be checked, and this again will be interrupt driven - currently polling using SPI, but that's wasting CPU cycles. All time critical stuff will thus happen completely outside the scheduler with only minimal ISR time required to keep it progressing, leaving the checker with less to do.

Thanks for your observation regarding the failsafe. This is a safety supervisory function, not dissimilar to a watchdog and so it is entirely appropriate, indeed desirable that it not be handled by the RX task. failsafeCheckDataFailurePeriod() is called to check is valid RX data has been received within the phase 1 failsafe period, failsafeState.rxDataFailurePeriod. It does this by comparing a timestamp of the last valid data, failsafeState.validRxDataReceivedAt against current time. Should a timeout be detected then the appropriate failsafe actions will be taken by the call to failsafeUpdateState(). Note that none of this requires the RX task to be running.

You'll see below that failsafeCheckDataFailurePeriod() does set failsafeState.rxLinkState to FAILSAFE_RXLINK_DOWN if needed.

void failsafeCheckDataFailurePeriod(void)
 {
     if (cmp32(millis(), failsafeState.validRxDataReceivedAt) > (int32_t)failsafeState.rxDataFailurePeriod) {
         setArmingDisabled(ARMING_DISABLED_RX_FAILSAFE); // To prevent arming with no RX link
         failsafeState.rxLinkState = FAILSAFE_RXLINK_DOWN;
     }
 }

@hydra
Copy link
Contributor

hydra commented Feb 1, 2022

@SteveCEvans Ok, I'll have another read of what you just said and the code paths, but I still feel the approach here is fundamentally wrong with regard to the dependency inversion. Your point regarding that it is indeed a supervisory function is valid though. It feels like the scheduler should be /told/ about a supervisor task, rather than the scheduler /knowing/ about one, and thus depending on a specific implementation of one, and all it's dependencies.

Getting the dependencies right is what allows dependencies to be mocked/stubbed in test code and leads to more maintainable tests and isolated production code. Isolated code allows multiple developers to work on different aspects of the system at the same time without 'treading on each others toes' as it were. This was one of the fundamental design factors that lead to Cleanflight, on which Betaflight is based. There are other benefits, including making PR's easier to review and limiting the scope of change only to the code being changed.

@SteveCEvans
Copy link
Member Author

@hydra totally understood, but to be able to tell the scheduler in a dynamic way would require implementation of an API to register, say, a watchdog function of which the failsafe would be an example. The scheduler would then have to traverse the list of such functions (or know there could only be one) at the expense of code space.

I think you’d like this book. Captures all the concepts you talk about, but pragmatism is required where code space is so precious.
66DE1C72-DA61-4D65-9FA5-708624E74676

@hydra
Copy link
Contributor

hydra commented Feb 2, 2022

@hydra totally understood, but to be able to tell the scheduler in a dynamic way would require implementation of an API to register, say, a watchdog function of which the failsafe would be an example. The scheduler would then have to traverse the list of such functions (or know there could only be one) at the expense of code space.

Depends, it could have ONE or N. In this case I was thinking of ONE and then the dependencies are fixed.

I think you’d like this book.

yeah, there's lots of good stuff in that book for sure! 😄

@hydra
Copy link
Contributor

hydra commented Feb 2, 2022

Also it could be done at compile time via a data passed to the schedulers init function from main, which would also be ok. Then if the compiler sees that there's only ever one value for the argument it can potentially optimize it out and/or inline the code.

@SteveCEvans
Copy link
Member Author

@hydra I realise there is an easily plugged hole here. We need to make sure not only that data has been received, but also that the data has been interpreted. This extends the idea of a failsafe to cover not only the reception of RC data, but also, like any watchdog, correct operation thereafter. No doubt an interesting topic for discussion.

@ctzsnooze
Copy link
Member

@SteveCEvans if this is now in #11380 can this PR be closed?

@SteveCEvans
Copy link
Member Author

Yes

@SteveCEvans SteveCEvans closed this Mar 2, 2022
Finalizing Firmware 4.3 Release automation moved this from Firmware to Finished (Merged) Mar 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

None yet

8 participants