Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guru Meditation (LoadProhibited) in _Unwind_RaiseException (IDFGH-3388) #5360

Closed
mcilloni opened this issue May 28, 2020 · 21 comments
Closed
Assignees

Comments

@mcilloni
Copy link

mcilloni commented May 28, 2020

Environment

  • Development Kit: none
  • Module or chip used: ESP32-WROVER
  • IDF version (run git describe --tags to find it): v4.0-386-gb0f053d82
  • Build System: CMake, idf.py
  • Compiler version (run xtensa-esp32-elf-gcc --version to find it): 8.2.0, both 2019r2 and 2020r1
  • Operating System: Linux
  • Using an IDE?: No
  • Power Supply: external 3.3V

Problem Description

Sometimes, our ESP32 code randomly panics with a LoadProhibited error on our custom ESP32 board:

Guru Meditation Error: Core  0 panic'ed (LoadProhibited). Exception was unhandled.
Core 0 register dump:
PC      : 0x402d05f0  PS      : 0x00060b30  A0      : 0x80296b66  A1      : 0x3ffebcd0  
A2      : 0x00000007  A3      : 0x3fff7470  A4      : 0x3fff7490  A5      : 0x3ffdb490  
A6      : 0x00000003  A7      : 0x00060023  A8      : 0x802d05f0  A9      : 0x3ffebcb0  
A10     : 0x00000000  A11     : 0x3ffebcd0  A12     : 0x00000001  A13     : 0x00000004  
A14     : 0x00000003  A15     : 0x00000004  SAR     : 0x00000020  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x0006003b  LBEG    : 0x40098a2c  LEND    : 0x40098a48  LCOUNT  : 0x00000000  

ELF file SHA256: fbfbb5f5b950a2d7

Backtrace: 0x402d05ed:0x3ffebcd0 0x40296b63:0x3ffebdd0 0x4010ec05:0x3ffebfd0 0x40101bb6:0x3ffec050 0x402d65cd:0x3ffec0f0 0x4023cacd:0x3ffec120 0x4023cb89:0x3ffec160 0x40093ddd:0x3ffec180

Rebooting...

addr2line matches the backtrace addresses we gathered from the logs above with the following call trace:

0x402d05ed: _Unwind_RaiseException at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libgcc/unwind.inc:140
0x40296b63: __cxa_throw at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:90
0x4010ec05: hapbridge::set_response_ble_change(unsigned short, unsigned int, blemesh_val_t, std::optional<blemesh_val_t>) at /home/marco/Workspace/our_project/build/../components/proj_homekit/hapbridge_callbacks_notify_ble.cpp:438
0x40101bb6: blemesh_set_report_handler at /home/marco/Workspace/our_project/build/../components/proj_gweventhandlers/blemesh_event_handler.cpp:102 (discriminator 3)
0x402d65cd: handler_execute at /home/marco/Workspace/esp-homekit-sdk/esp-idf/components/esp_event/esp_event.c:147
0x4023cacd: esp_event_loop_run at /home/marco/Workspace/esp-homekit-sdk/esp-idf/components/esp_event/esp_event.c:553 (discriminator 3)
0x4023cb89: esp_event_loop_run_task at /home/marco/Workspace/esp-homekit-sdk/esp-idf/components/esp_event/esp_event.c:115
0x40093ddd: vPortTaskWrapper at /home/marco/Workspace/esp-homekit-sdk/esp-idf/components/freertos/port.c:143

As you may see above, the crash site reported is deep inside GCC's stack unwinding code, and in particular it seems like C++ exceptions are somewhat involved in this. We are almost sure this could not caused by an exception escaping due to no mention of std::terminate()/ std::abort() being invoked.

We already stumbled in issues similar to this one several times before, and we've never been able to pinpoint the exact reason why it happens. We've had a hard time reproducing this issue reliably and we saw it popping out in several parts of our code; we also noticed that shuffling the code around a bit helped reducing (but not mitigating) the issue (i.e. trying to change the order functions are invoked, where exceptions are catch()ed, etc).

In particular, we noticed that when this issue occours, the situation is often very similar to the following:

  • all stack traces have a frame 1 (_Unwind_RaiseException in unwind.inc) and 2 (__cxa_throw in eh_throw.cc) identical or very similar to the one I posted above:
0x402d05ed: _Unwind_RaiseException at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libgcc/unwind.inc:140
0x40296b63: __cxa_throw at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/eh_throw.cc:90
  • the third frame refers to the last/only statement of a C++ function:
void set_response_ble_change([...]) {
   some_other_function(...); // this is hapbridge_callbacks_notify_ble.cpp:438
}
  • the aforementioned function is being called inside of a try {} catch() with a sigle statement in frame 4:
void blemesh_set_report_handler([...]) {
        // ...
	try {
		hapbridge::set_response_ble_change( // this is blemesh_event_handler.cpp:102
			[...]
		);
	} catch (hapbridge::exception &e) {
		ESP_LOGE(...);
	}

The project is composed of a lot of components written in C++17, and several of them rely on C++ exceptions.

Expected Behavior

The system does not crash, or the crash can be clearly tracked to an underlying cause in our application code.

Actual Behavior

The system crashes, and the generated backtrace is not helpful at resolving the issue.

@github-actions github-actions bot changed the title Guru Meditation (LoadProhibited) in _Unwind_RaiseException Guru Meditation (LoadProhibited) in _Unwind_RaiseException (IDFGH-3388) May 28, 2020
@Alvin1Zhang
Copy link
Collaborator

@mcilloni Thanks for the details report, we will look into.

@0xjakob
Copy link
Contributor

0xjakob commented May 29, 2020

@mcilloni Thanks for the detailed report! We've been working on this issue and recently developed a workaround which mitigates this error in all our tests. This workaround should come up soon on master.

The problem lies in fact inside the libgcc unwinding code and occurs during catching an exception.
In the code restoring the context of the catch to resume execution from there, it may happen that the registers A4-A7 get mixed up due to a window-underflow cpu exception. At the point of the exception, the code there will try to load from the address stored in register a7, which is 0x60023 in your case, which is not valid memory. Hence the LoadProhibited cpu exception.

Here's the place where the cpu exception happens:
https://github.com/espressif/gcc/blob/esp-develop/libgcc/unwind.inc#L140
This is actually this macro: https://github.com/espressif/gcc/blob/esp-develop/libgcc/config/xtensa/unwind-dw2-xtensa.c#L486, which calls the function uw_install_context_1, which installs the context on the stack.

@mcilloni
Copy link
Author

mcilloni commented Jun 4, 2020

Thank you @0xjakob, do you have an ETA for the workaround?

@jcmvbkbc
Copy link

jcmvbkbc commented Jun 4, 2020

In the code restoring the context of the catch to resume execution from there, it may happen that the registers A4-A7 get mixed up due to a window-underflow cpu exception.

Interesting. Do you have a reproducer that exhibits this behavior?

@0xjakob
Copy link
Contributor

0xjakob commented Jun 4, 2020

@mcilloni Currently we haven't. But the next time we deploy to github, a new toolchain is released containing the deactivated fix. Then it's a matter of a simple function call at runtime to activate it. I need to have a look, maybe I can also post a toolchain package + patch here...

@0xjakob
Copy link
Contributor

0xjakob commented Jun 4, 2020

@jcmvbkbc It's a bit sporadic, but a very reliable way to reproduce is to run very frequent interrupts while catching exceptions, something like this:

#define TIMEOUT 19
#define RECURSION 19
static esp_timer_handle_t crash_timer;
static uint32_t result = 0;

uint32_t calc_fac(uint32_t n) {
    if (n == 1 || n == 0) {
        return 1;
    } else {
        return n * calc_fac(n - 1);
    }
}

static void timer_cb(void *arg) {
    result = calc_fac(RECURSION);
}

void app_main(void)
{
    const esp_timer_create_args_t timer_args {
        timer_cb,
        NULL,
        ESP_TIMER_TASK,
        "crash_timer"
    };

    TEST_ESP_OK(esp_timer_create(&timer_args, &crash_timer));
    TEST_ESP_OK(esp_timer_start_periodic(crash_timer, TIMEOUT));

    for (int i = 0; i < 500; i++) {
        bool thrown_value = false;
        try {
            throw true;
        } catch (bool e) {
            thrown_value = e;
        }

        if (thrown_value) {
            printf("ex thrown %d\n", i);
        } else {
            printf("ex not thrown\n");
            TEST_ASSERT(false);
        }
    }

    TEST_ESP_OK(esp_timer_stop(crash_timer));
    TEST_ESP_OK(esp_timer_delete(crash_timer));
}

There's also a way to reproduce it in gdb, if you're interested.

@mcilloni
Copy link
Author

mcilloni commented Jun 4, 2020

@mcilloni Currently we haven't. But the next time we deploy to github, a new toolchain is released containing the deactivated fix. Then it's a matter of a simple function call at runtime to activate it. I need to have a look, maybe I can also post a toolchain package + patch here...

Thanks. Even the patch would be ok, I've no issue building my own toolchains with crosstool-ng.

@jcmvbkbc
Copy link

jcmvbkbc commented Jun 4, 2020

@0xjakob I'm interested in this issue from the linux toolchain perspective, as the unwinder is a fairly generic piece of code. It's a bit different with linux though because interrupt handlers completely preserve the state of interrupted task. Thanks for the reproducer, I'll have a look.

@0xjakob
Copy link
Contributor

0xjakob commented Jun 4, 2020

@jcmvbkbc we also found a way to reproduce it while stepping over the last bit of unwinding code restoring the context. You need an ESP32 with openocd connected JTAG debugger though.

@0xjakob
Copy link
Contributor

0xjakob commented Jun 4, 2020

@mcilloni You can download the new toolchain from here: https://dl.espressif.com/dl/xtensa-esp32-elf-gcc8_2_0-esp-2020r2-linux-amd64.tar.gz. You can replace it with the relevant package in your esp tools folder (usually $HOME/.espressif/....). It's the newest release however, I'm not sure whether it works with v4.0. The current github master should work though.
Once you use the toolchain, you only need to activate the workaround by applying this patch.

You can checkout and build the esp crosstools from here: https://github.com/espressif/crosstool-NG.
You also need to checkout the esp-gcc: https://github.com/espressif/gcc.
But the process is a bit complicated, I assembled a few comments for my own usage mostly. Be sure to start with the toolchain version for 4.0, 2019-r2, iirc...

Dependencies

  • libtool
  • gawk
  • help2man
  • autoconf
  • texinfo
  • libtool-bin

Build

./bootstrap
./configure --enable-local
# optional: make menuconfig -> paths -> disable render toolchain directory readonly
make all-recursive
./ct-ng xtensa-esp32-elf
./ct-ng build

Find the build in the builds folder.

GCC build with local out-of-tree repository

$CT_NG is assumed to contain the path to the crosstools-ng repository.

  • Do all build steps from above.
  • Change to gcc repo and do the changes (espressif/gcc@7740b85).
  • replace $CT_NG/.build/xtensa-esp32-elf/src/gcc with a symlink to the local gcc repo
    (CT_NG is the crosstools-ng dir)
  • Change to .build/xtensa-esp32-elf/build/build-cc-gcc-final
  • Do export PATH=$CT_NG/.build/xtensa-esp32-elf/buildtools/bin:$PATH
  • $ make all && make install-strip

Install

Install them with (overrides old install):

rsync $HOME/workspace/esp-tc/builds/xtensa-esp32-elf/ $HOME/.espressif/tools/xtensa-esp32-elf/esp-2020r1-8.2.0/xtensa-esp32-elf/ -ravu

Please let me know how far you get.

@0xjakob
Copy link
Contributor

0xjakob commented Jul 2, 2020

@mcilloni We've just merged the fix into our internal master which will be deployed to github soon. We'll work on the backports during the next weeks.

@rtyle
Copy link

rtyle commented Aug 20, 2020

I have been suffering with, I believe, this problem for quite some time. Unfortunately (perhaps), I am not using the latest ESP-IDF. Instead, I am using v3.3.2. I just downloaded the latest toolchain (I could find) for that.
https://dl.espressif.com/dl/xtensa-esp32-elf-linux64-1.22.0-96-g2852398-5.2.0.tar.gz
Unfortunately, that didn't fix it.
Can this fix be back ported?

@rtyle
Copy link

rtyle commented Aug 20, 2020

Opened a new issue:
#5766

@0xjakob
Copy link
Contributor

0xjakob commented Aug 21, 2020

@mcilloni Sorry for the delay in backporting! You're still using v4.0, right? Our toolchain fix got integrated and improved in the upstream xtensa gcc repository. Our own current fix only fixes the bug in a particular instance. Other functions using the old context-restore mechanism (like _Unwind_Resume()) are still affected. So we will integrate the upstream fix which fixes the issue for all circumstances and back port that one. We'll assign it a higher priority internally.

@0xjakob
Copy link
Contributor

0xjakob commented Aug 24, 2020

Seems like our particular fix did got backported already to v4.0 in July. Sorry for the confusion.

@mcilloni
Copy link
Author

Hi @0xjakob,
It's nice to know that the fix has been merged in 4.0 and 4.1.
Does the workaround still need to be activated or is it already enabled by default? I can't find the correct functions anywhere in IDF.

@0xjakob
Copy link
Contributor

0xjakob commented Sep 17, 2020

@mcilloni I was not correct again. The new toolchain release has been merged into 4.0 just a week ago: 6093407.
For 4.1 and 4.2, the code is still on the way.

For 4.0, there is another exception related fix which hasn't been integrated yet. It reduces the amount of heap memory if and only if an exception is thrown quite a bit. If no exception is thrown at all, then these heap allocations don't happen and this won't crash your app either!

@0xjakob
Copy link
Contributor

0xjakob commented Sep 17, 2020

And to answer your question: No, nothing has to be done with toolchain version 2020r3. It's actually an upstream fix from the xtensa gcc. Feel free to run the test above to confirm this, if you don't believe it :)

@mcilloni
Copy link
Author

@0xjakob: I've been encountering a very weird issue with exceptions after updating my toolchain to 2020r3 (I'm still on the latest commits of release/v4.0).

Our code has some parts that invoke an external library (nlohmann::json); we've been using it on the esp32 for almost a year now, without no relevant issues. This library provides an STL-like basic_json class that supports allocators (we use it to parse JSON on external SPIRAM memory on our WROVER chips) and STL-like access.

More in detail, it provides an at() method that behaves quite much like the std::map counterpart, i.e. it either returns a deserialized value of a given type or it throws a json::out_of_range exception. We have code that checks the existence of a key in a JSON object by invoking json::at() inside a block and then catching any json::out_of_range it throws; see

        template<typename RetType, JSON_TEMPLATE_PARAMS> 
        inline std::optional<RetType> json_read_opt(const GENERIC_JSON &j, const char *const key) {
            try {
                return j.at(key).template get<RetType>();
            // attempt workaround for lousy bug
            } catch (const typename GENERIC_JSON::out_of_range &e) {
                return std::nullopt;
            }
        }

where JSON_TEMPLATE_PARAMS and GENERIC_JSON are #defines that make the extremely long declaration of a generic basic_json more bearable.

json::at() internally calls std::map::at() in order to fetch from an internal instance of std::map the item associated with the given key, rethrowing std::out_of_range as json::out_of_range if the search fails. The at() function somewhat resembles the following code:

    const_reference at(const typename object_t::key_type& key) const
    {
        // at only works for objects
        if (JSON_HEDLEY_LIKELY(is_object()))
        {
            JSON_TRY
            {
                return m_value.object->at(key); // <- invocation of std::map::at() and source of the std::out_of_range exception
            }
            JSON_CATCH (std::out_of_range&)
            {
                // create better exception explanation
                JSON_THROW(out_of_range::create(403, "key '" + std::string(key) + "' not found"));
            }
        }
        else
        {
            JSON_THROW(type_error::create(304, "cannot use at() with " + std::string(type_name())));
        }
    }

The issue I'm reporting here is that sometimes the std::out_of_range coming from inside json::at() seemingly ends up being caught in a catch (const std::exception&) several frames down in the call stack, even though it should not be possible at all due to it being wrapped in a handling try-catch block.

I've also observed this behaviour to consistently happen in a precise point of the code (i.e. when an exception is thrown inside of these 3 nested try-catch, with the outer one catching for std::exception), and more weirdly that changing unrelated parts of code (i.e. adding function calls or adding anything even in different components) can prevent this behaviour from happening at all; the code then functions correctly (as it always did before updating IDF and toolchain) until something more is added, which may cause the issue to reappear.
I've also tried changing the exception catch()ed inside of json_read_opt<> to std::exception, to no avail; the exception still ends up in the one several frames down the stack.

Given that I'm running this firmware on a custom board, I do not have functioning JTAG access, so I had to get creative in order to get a backtrace pinpointing to the exact spot the exception is thrown. Luckily, std::map::at() only throws std::out_of_range in two specific points, so I could simply modify the STL header to add an invocation of terminate() gated by a global boolean flag; this allowed me to get more or less determine where the exception is thrown, which is the point I marked in the aforementioned json::at() method (i.e. as soon as map::at throws due to the value not being found in the map).

0x402a6022: std::terminate() at /builds/idf/crosstool-NG/.build/xtensa-esp32-elf/src/gcc/libstdc++-v3/libsupc++/eh_terminate.cc:57

0x400f2ab0: nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, ext::allocator<char> >, bool, long long, unsigned long long, double, ext::allocator, nlohmann::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> > >::at(std::__cxx11::basic_string<char, std::char_traits<char>, ext::allocator<char> > const&) const at <redacted>/.espressif/tools/xtensa-esp32-elf/esp-2020r3-8.4.0/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/8.4.0/bits/stl_map.h:541
 (inlined by) nlohmann::basic_json<std::map, std::vector, std::__cxx11::basic_string<char, std::char_traits<char>, ext::allocator<char> >, bool, long long, unsigned long long, double, ext::allocator, nlohmann::adl_serializer, std::vector<unsigned char, std::allocator<unsigned char> > >::at(std::__cxx11::basic_string<char, std::char_traits<char>, ext::allocator<char> > const&) const at <redacted>/build/../components/nlohmann_json/upstream/include/nlohmann/json.hpp:3461

This, together with the weird fact that shuffling unrelated code around makes the "bad" catch disappear, makes me suspect there could be something wrong with either my board, the exception handling as generated by the compiler, or maybe my code.

Thanks a lot for the patience.

@0xjakob
Copy link
Contributor

0xjakob commented Oct 27, 2020

Hi @mcilloni, could you please open a new issue and post the description there? It is easier for us to track then.

That being said, is it possible that during throwing an exception, another exception throwing occurs? Like e.g. std::out_of_memory while type_error::create()?

When you say "down the stack", you mean the direction of caller side or callee side?

Regarding JTAG: if you happen to have the correct pins available, then you might still be able to connect a JTAG adapter.

@mcilloni
Copy link
Author

Hi @mcilloni, could you please open a new issue and post the description there? It is easier for us to track then.

Sure, I'll do it as soon as I finish typing this reply.

That being said, is it possible that during throwing an exception, another exception throwing occurs? Like e.g. std::out_of_memory while type_error::create()?

I can't exclude this, so it might well be, but the exception caught by the catch(const std::exception&) handler has .what() == "map::at" - which libstdc++ only does when a std::map throws a std::out_of_range:

$  rg -F 'map::at' .espressif/
.espressif/tools/xtensa-esp32-elf/esp-2020r3-8.4.0/xtensa-esp32-elf/xtensa-esp32-elf/include/c++/8.4.0/bits/stl_map.h
539:      __throw_out_of_range(__N("map::at"));
548:      __throw_out_of_range(__N("map::at"));

When you say "down the stack", you mean the direction of caller side or callee side?

Sorry, I wasn't clear enough. When I say "down the stack" I mean in the direction of the caller. I guess I should have said up, not down :)

The code looks somewhat like this:

esp_err_t load_defs(const ext::json &input_json)  { 
		try {
			for (const auto &[name, j_val] : j.at("items").items()) {
				const auto d_t { dev_type::from_str(name) };
				
                                // here j_val is converted to the correct value by nlohmann::json
				if (!some_map.try_emplace(d_t, j_val).second) {
					// prints a log
				}
			}
		} catch (const our_exception_type &e) {
			// print what()
			return ESP_ERR_INVALID_ARG;
		} catch (const ext::json::exception &e) {
			// print what()

			return ESP_ERR_INVALID_ARG;
		} catch (const std::exception &e) { / / <- the catch() that's being hit
			// print what()

			return ESP_ERR_INVALID_ARG;
		}

		return ESP_OK;
}

The library then calls

JSON_TEMPLATE
    void from_json(const GENERIC_JSON &j, some_type &t) {
        t.val1 = check_value(j, "R");
        t.val2 = check_value(j, "W");
        t.val3 = check_value(j, "N");
    }

check_value then calls json_read_opt<> ( see my previous comment), which has a try-catch inside, and then it itself calls json::at(), which has another nested try-catch() for std::out_of_range. Trying to catch std::exception inside of check_value has no effect; the exception ends up in "load_defs" anyway.

The thing that puzzles me the most is that this happens randomly, and it tends to either stick or disappear based on the build. If I add code and then flash a new build, no matter where this code resides (it may even be just as little as adding a small print), it simply disappears or reappears even if the affected code is untouched. The json that's being processed is always processed at boot time and it never changes because it's directly embedded in the binary itself, using objcopy.

Regarding JTAG: if you happen to have the correct pins available, then you might still be able to connect a JTAG adapter.

Due to how this board has been designed it's hard for me to pull out the right pins - I've asked my colleagues to provide me one with right pins exposed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants