-
Notifications
You must be signed in to change notification settings - Fork 7.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Guru Meditation (LoadProhibited) in _Unwind_RaiseException (IDFGH-3388) #5360
Comments
@mcilloni Thanks for the details report, we will look into. |
@mcilloni Thanks for the detailed report! We've been working on this issue and recently developed a workaround which mitigates this error in all our tests. This workaround should come up soon on master. The problem lies in fact inside the libgcc unwinding code and occurs during catching an exception. Here's the place where the cpu exception happens: |
Thank you @0xjakob, do you have an ETA for the workaround? |
Interesting. Do you have a reproducer that exhibits this behavior? |
@mcilloni Currently we haven't. But the next time we deploy to github, a new toolchain is released containing the deactivated fix. Then it's a matter of a simple function call at runtime to activate it. I need to have a look, maybe I can also post a toolchain package + patch here... |
@jcmvbkbc It's a bit sporadic, but a very reliable way to reproduce is to run very frequent interrupts while catching exceptions, something like this:
There's also a way to reproduce it in gdb, if you're interested. |
Thanks. Even the patch would be ok, I've no issue building my own toolchains with crosstool-ng. |
@0xjakob I'm interested in this issue from the linux toolchain perspective, as the unwinder is a fairly generic piece of code. It's a bit different with linux though because interrupt handlers completely preserve the state of interrupted task. Thanks for the reproducer, I'll have a look. |
@jcmvbkbc we also found a way to reproduce it while stepping over the last bit of unwinding code restoring the context. You need an ESP32 with openocd connected JTAG debugger though. |
@mcilloni You can download the new toolchain from here: https://dl.espressif.com/dl/xtensa-esp32-elf-gcc8_2_0-esp-2020r2-linux-amd64.tar.gz. You can replace it with the relevant package in your esp tools folder (usually $HOME/.espressif/....). It's the newest release however, I'm not sure whether it works with v4.0. The current github master should work though. You can checkout and build the esp crosstools from here: https://github.com/espressif/crosstool-NG. Dependencies
Build./bootstrap
./configure --enable-local
# optional: make menuconfig -> paths -> disable render toolchain directory readonly
make all-recursive
./ct-ng xtensa-esp32-elf
./ct-ng build Find the build in the builds folder. GCC build with local out-of-tree repository
InstallInstall them with (overrides old install): rsync $HOME/workspace/esp-tc/builds/xtensa-esp32-elf/ $HOME/.espressif/tools/xtensa-esp32-elf/esp-2020r1-8.2.0/xtensa-esp32-elf/ -ravu Please let me know how far you get. |
@mcilloni We've just merged the fix into our internal master which will be deployed to github soon. We'll work on the backports during the next weeks. |
I have been suffering with, I believe, this problem for quite some time. Unfortunately (perhaps), I am not using the latest ESP-IDF. Instead, I am using v3.3.2. I just downloaded the latest toolchain (I could find) for that. |
Opened a new issue: |
@mcilloni |
Seems like our particular fix did got backported already to v4.0 in July. Sorry for the confusion. |
Hi @0xjakob, |
@mcilloni I was not correct again. The new toolchain release has been merged into 4.0 just a week ago: 6093407. For 4.0, there is another exception related fix which hasn't been integrated yet. It reduces the amount of heap memory if and only if an exception is thrown quite a bit. If no exception is thrown at all, then these heap allocations don't happen and this won't crash your app either! |
And to answer your question: No, nothing has to be done with toolchain version 2020r3. It's actually an upstream fix from the xtensa gcc. Feel free to run the test above to confirm this, if you don't believe it :) |
@0xjakob: I've been encountering a very weird issue with exceptions after updating my toolchain to 2020r3 (I'm still on the latest commits of release/v4.0). Our code has some parts that invoke an external library (nlohmann::json); we've been using it on the esp32 for almost a year now, without no relevant issues. This library provides an STL-like basic_json class that supports allocators (we use it to parse JSON on external SPIRAM memory on our WROVER chips) and STL-like access. More in detail, it provides an at() method that behaves quite much like the std::map counterpart, i.e. it either returns a deserialized value of a given type or it throws a json::out_of_range exception. We have code that checks the existence of a key in a JSON object by invoking template<typename RetType, JSON_TEMPLATE_PARAMS>
inline std::optional<RetType> json_read_opt(const GENERIC_JSON &j, const char *const key) {
try {
return j.at(key).template get<RetType>();
// attempt workaround for lousy bug
} catch (const typename GENERIC_JSON::out_of_range &e) {
return std::nullopt;
}
} where JSON_TEMPLATE_PARAMS and GENERIC_JSON are #defines that make the extremely long declaration of a generic basic_json more bearable.
const_reference at(const typename object_t::key_type& key) const
{
// at only works for objects
if (JSON_HEDLEY_LIKELY(is_object()))
{
JSON_TRY
{
return m_value.object->at(key); // <- invocation of std::map::at() and source of the std::out_of_range exception
}
JSON_CATCH (std::out_of_range&)
{
// create better exception explanation
JSON_THROW(out_of_range::create(403, "key '" + std::string(key) + "' not found"));
}
}
else
{
JSON_THROW(type_error::create(304, "cannot use at() with " + std::string(type_name())));
}
} The issue I'm reporting here is that sometimes the std::out_of_range coming from inside I've also observed this behaviour to consistently happen in a precise point of the code (i.e. when an exception is thrown inside of these 3 nested try-catch, with the outer one catching for std::exception), and more weirdly that changing unrelated parts of code (i.e. adding function calls or adding anything even in different components) can prevent this behaviour from happening at all; the code then functions correctly (as it always did before updating IDF and toolchain) until something more is added, which may cause the issue to reappear. Given that I'm running this firmware on a custom board, I do not have functioning JTAG access, so I had to get creative in order to get a backtrace pinpointing to the exact spot the exception is thrown. Luckily,
This, together with the weird fact that shuffling unrelated code around makes the "bad" catch disappear, makes me suspect there could be something wrong with either my board, the exception handling as generated by the compiler, or maybe my code. Thanks a lot for the patience. |
Hi @mcilloni, could you please open a new issue and post the description there? It is easier for us to track then. That being said, is it possible that during throwing an exception, another exception throwing occurs? Like e.g. std::out_of_memory while When you say "down the stack", you mean the direction of caller side or callee side? Regarding JTAG: if you happen to have the correct pins available, then you might still be able to connect a JTAG adapter. |
Sure, I'll do it as soon as I finish typing this reply.
I can't exclude this, so it might well be, but the exception caught by the $ rg -F 'map::at' .espressif/
.espressif/tools/xtensa-esp32-elf/esp-2020r3-8.4.0/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/8.4.0/bits/stl_map.h
539: __throw_out_of_range(__N("map::at"));
548: __throw_out_of_range(__N("map::at"));
Sorry, I wasn't clear enough. When I say "down the stack" I mean in the direction of the caller. I guess I should have said up, not down :) The code looks somewhat like this:
The library then calls
The thing that puzzles me the most is that this happens randomly, and it tends to either stick or disappear based on the build. If I add code and then flash a new build, no matter where this code resides (it may even be just as little as adding a small print), it simply disappears or reappears even if the affected code is untouched. The json that's being processed is always processed at boot time and it never changes because it's directly embedded in the binary itself, using objcopy.
Due to how this board has been designed it's hard for me to pull out the right pins - I've asked my colleagues to provide me one with right pins exposed. |
Environment
git describe --tags
to find it): v4.0-386-gb0f053d82xtensa-esp32-elf-gcc --version
to find it): 8.2.0, both 2019r2 and 2020r1Problem Description
Sometimes, our ESP32 code randomly panics with a LoadProhibited error on our custom ESP32 board:
addr2line matches the backtrace addresses we gathered from the logs above with the following call trace:
As you may see above, the crash site reported is deep inside GCC's stack unwinding code, and in particular it seems like C++ exceptions are somewhat involved in this. We are almost sure this could not caused by an exception escaping due to no mention of
std::terminate()
/std::abort()
being invoked.We already stumbled in issues similar to this one several times before, and we've never been able to pinpoint the exact reason why it happens. We've had a hard time reproducing this issue reliably and we saw it popping out in several parts of our code; we also noticed that shuffling the code around a bit helped reducing (but not mitigating) the issue (i.e. trying to change the order functions are invoked, where exceptions are catch()ed, etc).
In particular, we noticed that when this issue occours, the situation is often very similar to the following:
The project is composed of a lot of components written in C++17, and several of them rely on C++ exceptions.
Expected Behavior
The system does not crash, or the crash can be clearly tracked to an underlying cause in our application code.
Actual Behavior
The system crashes, and the generated backtrace is not helpful at resolving the issue.
The text was updated successfully, but these errors were encountered: