Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the possibility to register a custom panic handler callback to intercept app crashes (IDFGH-3146) #5163

Closed
leofabri opened this issue Apr 21, 2020 · 31 comments
Labels
Resolution: Won't Do This will not be worked on Status: Done Issue is done internally Type: Feature Request Feature request for IDF

Comments

@leofabri
Copy link

Feature request:

I wish I could set a custom panic handler callback from my main application in order to get some additional info and intercept crashes from my main component.

What I'd like to obtain

I'm not sure if there is another and better way to do what I'm trying to achieve.
I'm currently working on esp-idf v4.0 and it looks like that this feature is still missing even in the future releases currently under development.

Applying these changes to each update is time-consuming, and maybe other users could benefit from this. I would love to have it out of the box.

My current solution. Tested and working

I have edited the esp32 and freertos components in order to get the described feature, however, I wish you could consider adding something like this as part of the framework.


Changes (additions) made to esp-idf v4.0 to get the feature in question:

Notice: the links pointing to the files I've changed are an indicative reference of where the change was applied.

esp-idf/components/esp32/Kconfig:

config ESP32_PANIC_CALLBACK
        bool "Support registration of a user defined callback for the panic handler"
        default y
        help
            Use xt_set_error_handler_callback() to register a custom callback.
            The callback is called by the common error handler so catches exceptions,
            panics and abort() calls.

esp-idf/components/esp32/panic.c:

#if CONFIG_ESP32_PANIC_CALLBACK
/*
* Custom error handler callback registration.
*/
xt_error_handler_callback customErrorHandler = NULL;
xt_error_handler_callback xt_set_error_handler_callback(xt_error_handler_callback f)
{
  xt_error_handler_callback old = customErrorHandler;
  customErrorHandler = f;
  return old;
}
#endif //CONFIG_ESP32_PANIC_CALLBACK

esp-idf/components/esp32/panic.c commonErrorHandler():

    #if CONFIG_ESP32_PANIC_CALLBACK
    if (customErrorHandler) {
        disableAllWdts();
        customErrorHandler(frame, core_id, abort_called);
        reconfigureAllWdts();
    }
    #endif

components/freertos/include/freertos/xtensa_api.h:

/* Typedef for C-callable error handler callback function */
typedef void (*xt_error_handler_callback)(XtExcFrame *, int core_id, bool is_abort);

/*
-------------------------------------------------------------------------------
Call this function to set a callback for the standard error handler.
The callback will be called by the commonErrorHandler on all errors.

    f        - Callback function address, NULL to uninstall callback.

The callback will be passed a pointer to the exception frame, which is created
on the stack of the thread that caused the exception, the core id and
a bool to signal if abort() has been called.

The callback is called with watchdogs disabled.
-------------------------------------------------------------------------------
*/
extern xt_error_handler_callback xt_set_error_handler_callback(xt_error_handler_callback f);

Thank you in advance.

@leofabri leofabri added the Type: Feature Request Feature request for IDF label Apr 21, 2020
@github-actions github-actions bot changed the title Add the possibility to register a custom panic handler callback to intercept app crashes Add the possibility to register a custom panic handler callback to intercept app crashes (IDFGH-3146) Apr 21, 2020
@Alvin1Zhang
Copy link
Collaborator

@leofabri Thanks for raising the feature request, we will evaluate.

@igrr
Copy link
Member

igrr commented Apr 22, 2020

@leofabri Is it important to you at which stage in the panic handler your callback is executed? For example, if we add the custom handler at the same stage as GDB Stub / Core dump, would that work for you?

The other thing, I wouldn't suggest running the callback with watchdogs disabled, since at this point the watchdog remains the only thing that prevents us from getting stuck inside the panic handler. We also need to add some logic for not re-entering the custom callback if it crashes and we re-enter the panic handler.

Finally, if we implement this change, we will only do this on the current master branch, where the panic handler code is quite different from v4.0. We usually don't backport such features to older releases. If you do need the feature specifically on v4.0, I'm afraid you will have to keep using your local patch.

@leofabri
Copy link
Author

Hello @igrr first of all, thank you for getting back to me.
I'm currently using this patch but I wasn't the one that first thought of it; therefore, the code I showed above is basically the original one and may not be perfect.

Regarding the things you pointed out:

  1. I think that adding the custom error handler between GDB Stub and Core dump would be totally fine because I don't need to do anything with a higher priority.

My intention is to use a callback like this:

void Boot::MyErrorCallback(XtExcFrame *frame, int core_id, bool is_abort)
{
  boot_data.crash_data.core_id = core_id;
  boot_data.crash_data.is_abort = is_abort;

  // Save registers:
  for (int i = 0; i < 24; i++)
    boot_data.crash_data.reg[i] = ((uint32_t *)frame)[i + 1];

// Save backtrace:
// (see panic.c::doBacktrace() for code template)
#define _adjusted_pc(pc) (((pc)&0x80000000) ? (((pc)&0x3fffffff) | 0x40000000) : (pc))
  uint32_t i = 0, pc = frame->pc, sp = frame->a1;
  boot_data.crash_data.bt[0].pc = _adjusted_pc(pc);
  pc = frame->a0;
  while (++i < APP_BT_LEVELS)
  {
    uint32_t psp = sp;
    if (!esp_stack_ptr_is_sane(sp))
      break;
    sp = *((uint32_t *)(sp - 0x10 + 4));
    boot_data.crash_data.bt[i].pc = _adjusted_pc(pc - 3);
    pc = *((uint32_t *)(psp - 0x10));
    if (pc < 0x40000000)
      break;
  }
}
  • I would like to get that boot_data.crash_data saved somewhere and I'm still thinking about transmitting it (but the esp has already crashed and this might not be possible) or save it in order to send a backtrace upon the next reboot.
    Unfortunately, I won't have direct access to the esp board and I'm basically trying to fetch as much data as possible in order to fix bugs remotely.

  1. I do agree that disabling the watchdogs could potentially break the entire execution and I think that having a system that is reliable has the highest priority here.
    Maybe disabling them could be added as a "dangerous" option but at the current state of the art, I doubt I would want to do that.
    Adding some more logic to avoid re-entering the callback was something I didn't think of, maybe we should flag it as already executed so that it doesn't get called again.

  1. Regarding the fact of not adding this feature to the version of ESP-IDF I'm working with, I think it's obviously right. I referenced this version because it's the one I'm currently working with and I wasn't sure if other development versions have something different in them.
    After all, this is a new feature, so I'll be probably updating it to a development branch if I want to have it. It would be great to have in future releases.

@igrr
Copy link
Member

igrr commented Apr 22, 2020

Overall, if you had a way to add custom data to Core Dumps, would that satisfy your use case? Because Core Dumps can actually be uploaded to remote server after restart, and used to diagnose crashes.

@leofabri
Copy link
Author

Overall, if you had a way to add custom data to Core Dumps, would that satisfy your use case? Because Core Dumps can actually be uploaded to remote server after restart, and used to diagnose crashes.

Core Dumps are definitely the solution to this, and adding some more data to it could actually satisfy my current use case.
I think though that it would also be interesting to have that custom callback to handle crashes with a custom logic.

@igrr
Copy link
Member

igrr commented Apr 22, 2020

Yes, will look into adding custom callback as an option. I would still prefer to understand what requirements this satisfies, because if we understand the requirements we would rather prefer to implement the missing features in Core dump / GDB stub. Regarding adding custom logic, this is currently possible without modifications to IDF code, using linker wrapping feature. For instance, Memfault SDK uses it to override IDF Core dump implementation with their custom core dumps: https://github.com/memfault/memfault-firmware-sdk/blob/70f2d4ed6884d907d4b39da99434b762209ba568/ports/esp_idf/memfault/CMakeLists.txt#L79.

@leofabri
Copy link
Author

leofabri commented Apr 22, 2020

Yes, I understand. I'm probably asking for a feature that I'm not going to be using immediately, therefore I'm unable to give specific requirements.
Thank you so much for your availability and support. I'm going to use the linker feature to get what I need.

@dexterbg
Copy link

dexterbg commented May 3, 2020

@igrr This feature was actually invented by me for the OVMS project:
openvehicles@daef4b5

I think it was included in a pull request of ours at that time, but can't find that now. Sorry if we missed that.

It's meant to provide crash reasons and backtraces from user devices in the field. Uploading core dumps is not an option as most users are on mobile plans. My code copies just the essential crash info into RTC RAM. After reboot the info is sent as a crash debug record to the OVMS server. We keep the .elf files for all user builds and can feed the backtrace to gdb/addr2line to get an idea what happened.

See the whole OVMS crash debug code here:
https://github.com/openvehicles/Open-Vehicle-Monitoring-System-3/blob/master/vehicle/OVMS.V3/main/ovms_boot.cpp

I agree the current implementation may miss handling some edge cases, but it's been working very well for us and continues to provide crash backtraces (we're still on our idf 3 fork). We're now beginning migration to v4, so if you could add this feature officially to v4, that would be very appreciated.

Thanks & regards,
Michael

@igrr
Copy link
Member

igrr commented May 3, 2020

Hi Michael, thanks for the extra background! I think for us it is much easier to support a concrete story of saving minimal information about the crash than to support arbitrary user code to be run from the panic handler. We already have an internal feature request for saving this additional minimal description of the crash, for the same reason as you mention (limiting data traffic to the cloud). It will likely be implemented in IDF 4.3, since 4.2 is now in feature freeze.
Arbitrary user callbacks are still possible through the linker wrap feature, and I think IDF 4.2 makes overriding the panic handling logic somewhat easier by separating the target-specific code from the common one.

@RoderickJDunn
Copy link

@igrr Hi, I'm also interested in this feature (saving a minimal description of a crash, so it can be uploaded after reset). Is there a public issue where I can track its progress?
If it's still only being tracked internally, do you have an update on when it will be available?
Thanks

@igrr
Copy link
Member

igrr commented Sep 18, 2020

@RoderickJDunn it is indeed only tracked internally at the moment. I see some progress being made, and very likely the implementation will be available publicly before the end of the year. If you create a new issue here on Github for this topic, I will then link it internally so that you get notified once the feature is published.

@crisreimberg
Copy link

@igrr Is the feature of "save the crash info to be send after the reset" available? I am trying to implement this function using an ESP32 but I couldn't do it yet. Thanks.

@igrr
Copy link
Member

igrr commented May 31, 2021

@crisreimberg The feature to obtain the summary from the core dump is implemented, you need to call esp_core_dump_image_get on startup to check if the core dump image exists (optionally after checking the reset reason), and then call esp_core_dump_get_summary to obtain the summary information structure.

/**
* @brief Get the summary of a core dump. This function works only with ELF format core dumps.
*
* @param summary Summary of the core dump
*
* @return ESP_OK on success, otherwise \see esp_err_t
*/
esp_err_t esp_core_dump_get_summary(esp_core_dump_summary_t *summary);

@espressif-bot espressif-bot added Resolution: Won't Do This will not be worked on Status: Done Issue is done internally labels Jul 20, 2021
@chipweinberger
Copy link
Contributor

chipweinberger commented Sep 27, 2022

This is the first google result for "esp32 custom panic handler".

After all the other panic handlers run, I want to:

  • set the color of an LED
  • buzz a piezo
  • (ideally) reboot after short delay

It seems this is still not easily possible? The memfault trick might cover the LED and piezo, but it's pretty involved to have to create a new component with special linking.

edit (March 2024): I just run all my "panic" code after reboot. Simpler. You can still access the panic info

@dizcza
Copy link
Contributor

dizcza commented Dec 26, 2022

This is the first google result for "esp32 custom panic handler".

+1

I'd like to turn the heater off if something wrong happens. As simple as GPIO set 0.

@ftab
Copy link

ftab commented Aug 14, 2023

Similarly, I have a custom board which needs to be shut off in a certain sequence. Namely, an I2C command, a 50ms wait, and a few GPIOs enabled/disabled. Otherwise, there is a very loud pop when the panic handler resets. A way to add code to the panic handler would be great for this.

@chipweinberger
Copy link
Contributor

I ended up doing my "panic" code at boot time, depending on the esp_reset_reason.

might be a viable option for some.

@boarchuz
Copy link
Contributor

boarchuz commented Aug 14, 2023

It only takes a couple lines to do this with linker wrapping. IDF even includes a test case that does this exact thing, for reference:

Custom handler:

void __wrap_esp_panic_handler(panic_info_t *info)

And this line in CMakeLists:
target_link_libraries(${project_elf} PRIVATE "-Wl,--wrap=esp_panic_handler")

@nicklasb
Copy link

@igrr I am not sure what happened to this issues, it seems like a bot made a decision to not to it? :-)

To add a possibility to intervene after a panic is essential to smoothly handling error states, like reporting problems and dynamically resolve them.

Another example, instead of having a reset, going through a short deep sleep does the same thing and allows you easily store information using RTC_DATA_ATTR to orderly avoid ending up in the same situation again.

In many situations it is way better to get stuck in a panic handler than in a constant reset loop.

@igrr
Copy link
Member

igrr commented Sep 11, 2023

it seems like a bot made a decision to not to it? :-)

The bot simply syncs some properties of the issue from our internal issue tracker. In this case I've marked the issue as "won't do" there and the bot has applied the label here on Github.

I didn't mean it as: "we will never support custom panic handler callbacks", though. I meant that for the requirements of the author of the issue, I think the custom panic handler callback is not the right solution: it would be more complex, and the use case is common enough anyway to not require a custom callback. We have implemented a different solution: as part of another ticket we have added an API to extract information from core dumps, and closed this one as "won't do".

If you can describe your use case in more detail, I can give feedback whether we would likely support it via custom panic handler callbacks or not. Just a note, entering deep sleep, despite the seemingly simple behavior, ends up using a fair number of system-level features. Because of this, I don't think that calling esp_deep_sleep_start from the panic handler is safe, in general.

In the meantime, if you feel confident about it, the comment just above shows a very simple way of overriding the panic handler.

@nicklasb
Copy link

nicklasb commented Sep 11, 2023

@igrr I suspected the bot wasn't that smart. :-)

Basically it comes down to me using the ESP32 more as a "real" computer or server than an MCU.
I am developing a framework on top of ESP-IDF (and Arduino) that provides a lot of abstraction for communication, error reporting and other things, so if something goes wrong in all that code, which it does, a reset is a very unproductive and uniforming way to handle the situation. Imagine if your Linux machine only did that.

I have code that can recover from many of these states, and if a process (ok, task) crashes, there is no reason to reset the whole thing down as a first measure.
Sure, at some point a reset might be necessary, but then in a controlled manner.

To me, resetting using deep sleep actually is pretty sufficient, and especially since I get many of the nice things of the reset but gets to keep information that enables the unit to get going again quickly. The reason for the problem is likely to be in my code rather than the system level features anyway, :-)

WRT the comment above, it is not a very portable solution.

And to be quite honest i am not sure why one would not want to have the possibility of a custom error handler in a computer system? It is such an enabling feature?

@igrr
Copy link
Member

igrr commented Sep 11, 2023

Thanks for explaining your use case!

To me, resetting using deep sleep actually is pretty sufficient, and especially since I get many of the nice things of the reset but gets to keep information that enables the unit to get going again quickly.

Assuming you keep the state in RTC memory, this is possible with a reset (not just deep sleep) as well, if you use RTC_NOINIT_ATTR.

WRT the comment above, it is not a very portable solution.

Could you please expand on this a bit? By portable do you mean "portable to another RTOS"? Seems like the panic handler callback wouldn't be portable, either?

And to be quite honest i am not sure why one would not want to have the possibility of a custom error handler in a computer system?

Panic handling is quite a delicate task. Many aspects of the system can not be used or relied upon: heap allocation, C library functions, RTOS functions. Therefore documenting the exact rules for panic handler callbacks — which IDF APIs are allowed to be called and which aren't — is not an easy task. Troubleshooting panic handler issues is often quite tricky, and potential bugs where the panic handler constantly re-enters and never resets the chip are one of the worst classes of bugs to have in a consumer product. Basically, adding this feature is simple but providing support to customers who decide to use it could be quite resource intensive for us.

For users who "know what they are doing" and are okay with taking responsibility for any possible issues there's a way to wrap the panic handler, mentioned above. For example, here's an example of a similar approach in MemFault SDK.

@nicklasb
Copy link

nicklasb commented Sep 11, 2023

Assuming you keep the state in RTC memory, this is possible with a reset (not just deep sleep) as well, if you use RTC_NOINIT_ATTR.

Argh! I mixed those up, and used the RTC_DATA_ATTR instead!
I thought it was so weird that it didn't work for me any longer as I had it working before.

A note about that is that the documentation around RTC_DATA_ATTR is a bit confusing, would be better if it was more specific and not so much about stubs and so on. It doesn't say that this data is reset where that information is expected.
In myworld, that would be the most usual use case for the RTC memory. And when googling on RTC memory I ended up there.

Could you please expand on this a bit? By portable do you mean "portable to another RTOS"? Seems like the panic handler callback wouldn't be portable, either?

If ESP-IDF had a simple callback, the framework would just need code for that functionality and wouldn't have to require users to create specialized project/linker level settings to get its error handling. Basically the suggested solution makes 3rd party libraries using the callback much more difficult to use and spread.

Panic handling is quite a delicate task. ....Therefore documenting the exact rules for panic handler callbacks — which IDF > APIs are allowed to be called and which aren't — is not an easy task.

Then don't. Just state that one needs to be very careful and assume that basically no FreeRTOS functionality works.
No one will sue you if you have disclaimed properly there. And if you find out that something works reliably you can document that later. Using hardware the way it is not supposed to be used has lead to a ton of great stuff, the entire 64 demo scene and almost all its games, for example.

@norventa
Copy link

My use case is thus: a gpio pin turns a relay on/off. The relay turns a kiln on/off. If the kiln is on and a panic happens I would like to insert code to turn the kiln off for safety reasons.

@chipweinberger
Copy link
Contributor

chipweinberger commented Mar 22, 2024

simple solution: turn the gpio off after the device reboots.

I run all my panic handler code after reboot.

@norventa
Copy link

@chipweinberger Thanks for that. The problem I have is when I run in debug and there is no reboot. My app runs for upto 12 hours and I am not in attendance all that time. I have a very intermittent bug that I want to catch and I can't use debug for this reason.

@nicklasb
Copy link

simple solution
Adding an effing callback is a pretty simple solution as well. :-)

It creates a lot of other opportunities, like creating higher-level frameworks that automatically reports issues, which is what I am doing.

TBH, I have not seen much in the way of strong arguments against it here, rather alternative solutions to specific problems, and that is another discussion, helpful as it may be.

@igrr
Copy link
Member

igrr commented Mar 23, 2024

@nicklasb I am sorry, I just realized I haven't replied to your previous comment.

If ESP-IDF had a simple callback, the framework would just need code for that functionality and wouldn't have to require users to create specialized project/linker level settings to get its error handling.

Link-time wrapping can be completely encapsulated in your custom component/library, I don't think anything would have to be done by the application developer. Basically, you can specify the additional linker arguments using a target_link_options call in your component CMakeLists file. All the user will have to do is to add your component to their project.

Then don't. Just state that one needs to be very careful and assume that basically no FreeRTOS functionality works. No one will sue you if you have disclaimed properly there.

How useful would such a restrictive callback be for your use case, and would it enable anything that Core Dump currently doesn't? As I mentioned above to the OP, we would very much rather add the missing data to the Core Dump, and then you can handle the crash after a restart in the sane execution environment of app_main.

(I understand that shutting down GPIOs or peripherals on reset is indeed something we don't support. We'll look into adding an option to do that.)

Then don't. Just state that one needs to be very careful and assume that basically no FreeRTOS functionality works. No one will sue you if you have disclaimed properly there.

I understand your point. Perhaps not to the point of sueing, but we did face issues with customers' use of dangerous features in the past, hence my hesitation.

@norventa
Copy link

@igrr Being sued isn't top of my avoidance list it is being responsible for someone getting hurt.
What is the arguement against making the function panic_abort's linkage weak?

@igrr
Copy link
Member

igrr commented Mar 23, 2024

Lately we have been trying to avoid weak linkage as the method for overriding functions, where possible. The argument is that when two components try to override the same function using weak linkage, the result is unpredictable. When using callback registration, we can at least return an error (indicating that a callback is already registered) or return the previous callback, allowing the components to "chain" the calls.

I would still love to know what functionality/information you are missing in core dumps to implement your error reporting features based on that...

@nicklasb
Copy link

nicklasb commented Mar 23, 2024

Both an error and a especially a chain solves the problem elegantly (I edited for some reason I didn't see the entire answer before, so I suggested just that).

I understand that you feel you lose control a bit there, but and link-time wrapping is less flexible and more difficult, we the users also want some control. Also, it is not as easy to see that there is a linkage like that for the casual developer. Please with sugar on top. :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Resolution: Won't Do This will not be worked on Status: Done Issue is done internally Type: Feature Request Feature request for IDF
Projects
None yet
Development

No branches or pull requests