Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FL-3783] Asynchronous Infrared remote manipulation #3503

Merged
merged 13 commits into from Mar 11, 2024

Conversation

gsurkov
Copy link
Member

@gsurkov gsurkov commented Mar 7, 2024

What's new

  • Load, delete, rename, move whole remotes and individual buttons in a separate thread in order to not stall the UI.

Verification

  1. Install both the firmware and the Infrared application.
  2. Go to Infrared -> Universal Remotes -> TVs
  3. Mash the buttons while the busy animation is playing. The app should ignore the input and successfully load the file.
  4. Perform all other potentially lengthy operations: Rename the signals, move the signals. Mash the buttons during the busy animation as well (Works best on large remotes with hundreds of signals). The result should be the same as in step 3.
  5. Test all other features to make sure that no regressions are present.

Checklist (For Reviewer)

  • PR has description of feature/bug or link to Confluence/Jira task
  • Description contains actions to verify feature/bugfix
  • I've built this code, uploaded it to the device and verified feature/bugfix

Copy link

github-actions bot commented Mar 7, 2024

Compiled f7 firmware for commit bbcc3917:

@CookiePLMonster
Copy link
Contributor

What happens to the concurrent jobs if the user leaves the scenes triggering them very fast? Seems like it's impossible to block on the runner task to wait before freeing the context data gracefully, which seems like it could cause issues if the app gets closed before the task gets the chance to finish work - maybe not exactly in the current use cases, but it would probably happen if future users of this toolbox lib put truly long operations in there.

Ideally the runner spawn could return a handle the user could join/block on, but that would also require freeing it manually - unless thread semantics were to be applied there and the user could perform a detach operation at their own risk.

@gsurkov
Copy link
Member Author

gsurkov commented Mar 8, 2024

What happens to the concurrent jobs if the user leaves the scenes triggering them very fast?

This should not be possible as before starting a concurrent job a busy animation view is put on the view stack which is blocking all input events. In the improbable case in which it happens despite that, no memory leak would occur as the runner will free its resources automatically and the event will be delivered in some other view.

Your remark is perfectly valid though, I will make it more clear in the docs that the application is responsible for keeping all relevant contexts valid until the runner completes (the completion is signaled by the finished_callback).

Adding a return handle would defeat the purpose of this helper, as it would be no different from using a FuriThread directly in terms of code verbosity.

@gsurkov gsurkov marked this pull request as ready for review March 8, 2024 11:29
@CookiePLMonster
Copy link
Contributor

CookiePLMonster commented Mar 8, 2024

Your remark is perfectly valid though, I will make it more clear in the docs that the application is responsible for keeping all relevant contexts valid until the runner completes (the completion is signaled by the finished_callback).

I'm just kind of worried that it might be technically impossible to 100% safeguard against the code getting ripped out from below this thread. It's an edge case, but possibly a valid one nonetheless. Fair warning, it's nitpicky but probably technically valid unless threads keep a strong reference to the application code, which I don't know if they do.

My concern is a scenario covered by Raymond Chen on The Old New Thing - while it concerns Windows, the problem is not platform-specific. Consider the following chain of actions:

  1. App A starts.
  2. App A wants to spawn a Concurrent Runner (I will call it Job from now on, to make it shorter), but it also wants to make sure that the app cannot terminate before the job is done. For this, an Event X gets created.
  3. App A spawns Job 1, and passes Event X in the context, then it proceeds with other work.
  4. Job 1 starts doing its work. In the meantime, App A wants to terminate, but in order to avoid use-after-free of the context, it waits for Event X to become signaled.
  5. Job 1 finishes and executes the finish callback. In that callback, the event gets signaled with furi_event_flag_set.
  6. Now consider what happens if the context switch back to the main thread of App A happens right after furi_event_flag_set gets called, before it returns back to the app code to execute the function epilogue: The thread wakes up because the event is set, frees resources, and terminates. Flipper Application Loader proceeds to unload the FAP code from memory.
  7. In the meantime, context switch back to the finish callback happens and furi_event_flag_set returns. However, App A code has already been unloaded, and this possibly causes a crash.

Now, I am actually not sure how to fix this. On Windows, this class of issues is resolved with FreeLibraryAndExitThread that is a noreturn function and it exits the thread without returning back to the application code. However, a theoretical furi_event_flag_set_and_exit_thread would not help (even if FreeRTOS had means to implement it, which I don't know if it does), because the finish callback is executed on the timer thread! Unless... "exiting" is implemented by returning "two functions up" somehow, maybe via setjmp? The goal there would be to jump from inside furi_event_flag_set_and_exit_thread straight to concurrent_runner_timer_callback, so a firmware function -> firmware function jump, without going through even a single instruction of the application code in that window.

One way to solve this would be to let the concurrent runner accept an optional event flag parameter, and have that flag set from inside concurrent_runner_timer_callback itself. This way if proper synchronization is used, the job code would not ever race against the app teardown. However, that adds complexity to setting up a concurrent runner just to close an extreme edge case, so I wouldn't necessarily be a fan of it either, unless maybe as an _ex call.

EDIT:
setjmp and longjmp would technically be fit for this purpose (longjmp is marked with noreturn!), but they require passing the context parameter around, so it's not good enough :/

EDIT2:
Alternately since the "code disappears from under the running thread" concerns only the process exit and everything else can be guarded against well, maybe the concurrent runner should keep a count of active runners and provide a function to wait for this count to drop to 0? This way it could just be called once on exit (maybe even by the Flipper Application Loader and not the user) and only proceed to free the code once all runners finish.

EDIT3:
Yet another idea, possibly the easiest one - disabling the scheduler (furi_kernel_lock) for the duration of the finish callback would close this potential race entirely. It would only need an additional note in the docs that you may not wait on events/mutexes from the finish callback. This should be fine to enforce, if the application really needs to do that it could do so at the end of the job callback and not in the finish callback. Besides, even without this change, stalling the timer thread is a bad idea.

@DrZlo13
Copy link
Member

DrZlo13 commented Mar 11, 2024

@CookiePLMonster concurrent_runner_timer_callback function is lying on the mcu flash, so it will be safe to jump there

@CookiePLMonster
Copy link
Contributor

CookiePLMonster commented Mar 11, 2024

@CookiePLMonster concurrent_runner_timer_callback function is lying on the mcu flash, so it will be safe to jump there

@DrZlo13 Yes, but the finish_callback can still race with the other thread that could immediately wake up upon signaling an event in that callback. In a theoretical scenario like:

void finish_callback_in_my_app(void* context)
{
   // Do stuff
   furi_event_flag_set(this_event_will_let_the_app_quit, 1);
   
   // The function epilogue (and RAII destructors in C++) can still execute here, while this code may have been released by the app exiting already
}

To fix this, the function signaling the event must either be a noreturn function (which is not viable in this case because teardown after the finish_callback still needs to be executed), or the scheduler must not be allowed to context switch until it's returned from the user callback, hence the idea to employ furi_kernel_lock before calling into finish_callback.

So, a function like this would be safe from the race, with an additional note in the docs that the finish callback should not perform any blocking tasks (sleep, locking mutexes, waiting on queues). Those shouldn't be performed there even now, because the callback executes on the timer thread, but with the scheduler disabled this becomes a matter of utmost importance.

static void concurrent_runner_timer_callback(void* context, uint32_t arg) {
    UNUSED(arg);

    ConcurrentRunner* instance = context;
    furi_thread_join(instance->thread);

    if(instance->finished_callback) {
        furi_kernel_lock();
        instance->finished_callback(instance->context);
        furi_kernel_unlock();
    }

    furi_thread_free(instance->thread);
    free(instance);
}

EDIT:
I apologize for a wrong ping, I need more coffee.

@gsurkov
Copy link
Member Author

gsurkov commented Mar 11, 2024

@CookiePLMonster Thanks for a clear example!

The solution you have proposed won't work though, because we HAVE to wait on a queue in view_dispatcher_send_custom_event(), which is currently called in the finished_callback every time.

After some discussion, we have decided that this implementation has lots of potential misuse cases and we'll come up with a better one in the future (since such API would be nice to have anyway). For now, it will be removed from the firmware API and will become a part of the Infrared application, where it serves a very particular purpose and will not suffer from any above mentioned concerns due to a limited scope of usage.

P.S maybe I'll get rid of it altogether, because there's a simpler way to do just that with plain threads.

@CookiePLMonster
Copy link
Contributor

CookiePLMonster commented Mar 11, 2024

The solution you have proposed won't work though, because we HAVE to wait on a queue in view_dispatcher_send_custom_event(), which is currently called in the finished_callback every time.

Could you instead call it at the very end of your work callback and not the finished callback? From the app's point of view this would most likely be equivalent, but then you'd need to implement the finished callback via something like an event if you want to prevent the app from terminating too early.

EDIT:

P.S maybe I'll get rid of it altogether, because there's a simpler way to do just that with plain threads.

Yeah, since you spin up a new thread every time, you may be right. Maybe it's easier to just spawn a thread with a queue that services those jobs, and avoid any of those issues altogether.

@gsurkov gsurkov marked this pull request as draft March 11, 2024 13:07
@CookiePLMonster
Copy link
Contributor

CookiePLMonster commented Mar 11, 2024

Looking at the revised approach - what happens if furi_thread_start and furi_thread_set_callback are called on a thread that hasn't finished yet? Do they wait for it to finish first or will it crash/throw a furi_check*?

* I realize it is impossible to run two jobs at once with the current design due to these jobs blocking input, but it's a curiosity that popped in my head when I saw this pattern.

@gsurkov
Copy link
Member Author

gsurkov commented Mar 11, 2024

@CookiePLMonster such code will crash if it is compiled with DEBUG=1 passed to fbt due to the furi_assert checks. However (unlike furi_check) these are omitted when debug flags are off, which will lead to unpredictable behaviour. This distinction was chosen way back when as a size-saving measure.

Right now, a separate effort is being made to replace furi_asserts with furi_checks in all user-facing methods in order to secure them even more.

Running into this problem would indeed be improbable in this particular situation.

@gsurkov gsurkov marked this pull request as ready for review March 11, 2024 16:12
skotopes
skotopes previously approved these changes Mar 11, 2024
@skotopes skotopes self-requested a review March 11, 2024 16:55
@skotopes skotopes dismissed their stale review March 11, 2024 16:55

button delete is not working

@skotopes skotopes merged commit 022fccf into dev Mar 11, 2024
9 checks passed
@skotopes skotopes deleted the gsurkov/3783_ir_freeze_fix branch March 11, 2024 17:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants