Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for listing tasks to gdb in gdbstub (IDFGH-729) #2828

Closed
wants to merge 6 commits into from

Conversation

X-Ryl669
Copy link
Contributor

This is supposed to fix #2818

This adds all the required commands in the gdbstub (via UART) to:

  1. List all tasks/threads (including their name and CPU)
  2. Get a backtrace for them
  3. Switch to them to analyze them more carefully

To tease the current support, you can now do this:

Guru Meditation Error: Core  0 panic'ed (StoreProhibited). Exception was unhandled.
Core 0 register dump:
PC      : 0x400d8281  PS      : 0x00060030  A0      : 0x800e1a6a  A1      : 0x3ffda250
0x400d8281: devXXX(HttpdConnData*) at /esp/app/main/HTTPServer.cpp:712

A2      : 0x00000000  A3      : 0x3ffe43b3  A4      : 0x00000007  A5      : 0x0000ff00
A6      : 0x00ff0000  A7      : 0xff000000  A8      : 0x00000038  A9      : 0x00000038
A10     : 0x00000001  A11     : 0x00000001  A12     : 0x3ffc2618  A13     : 0x3ffc2610
A14     : 0x3ffe43a9  A15     : 0x000000c0  SAR     : 0x00000019  EXCCAUSE: 0x0000001d
EXCVADDR: 0x00000000  LBEG    : 0x400012c5  LEND    : 0x400012d5  LCOUNT  : 0xfffffff5

Backtrace: 0x400d8281:0x3ffda250 0x400e1a67:0x3ffda270 0x400e1e67:0x3ffda2a0 0x400e1205:0x3ffda2e0
0x400d8281: devXXX(HttpdConnData*) at /esp/app/main/HTTPServer.cpp:712

0x400e1a67: httpdProcessRequest at /esp/app/components/libesphttpd/core/httpd.c:833

0x400e1e67: httpdRecvCb at /esp/app/components/libesphttpd/core/httpd.c:920

0x400e1205: platHttpServerTask at /esp/app/components/libesphttpd/core/httpd-freertos.c:772


Entering gdb stub now.
$T04#b8GNU gdb (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a) 7.10
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "--host=x86_64-build_apple-darwin16.3.0 --target=xtensa-esp32-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /esp/app/build/nvse.elf...done.
Remote debugging using /dev/cu.SLAB_USBtoUART
0x400d8281 in devXXX (connData=<optimized out>) at /esp/app/main/HTTPServer.cpp:712
712	    *(int*)0 = 0; // Crash here
(gdb) bt
#0  0x400d8281 in devXXX (connData=<optimized out>) at /esp/app/main/HTTPServer.cpp:712
#1  0x400e1a6a in httpdProcessRequest (pInstance=0x3ffd93f8, conn=0x3ffe4388) at /esp/app/components/libesphttpd/core/httpd.c:649
#2  0x400e1e6a in httpdRecvCb (pInstance=0x3ffd93f8, conn=0x3ffe4388,
    data=0x3ffd8bf1 "POST /devXXX HTTP/1.1\r\nHost: 192.168.0.97\r\nUser-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.11; rv:63.0) Gecko/20100101 Firefox/63.0\r\nAccept: */*\r\nAccept-Language: fr,fr-FR;q=0.8,en-US;q=0.5,en"...,
    len=452) at /esp/app/components/libesphttpd/core/httpd.c:920
#3  0x400e1208 in platHttpServerTask (pvParameters=0x3ffd8bd4) at /esp/app/components/libesphttpd/core/httpd-freertos.c:505
(gdb) i th 1
  Id   Target Id         Frame
* 1    Remote target     0x400d8281 in devXXX (connData=<optimized out>) at /esp/app/main/HTTPServer.cpp:712
(gdb) i th 2
  Id   Target Id         Frame
  2    Thread 1 (IDLE1 CPUx) 0x40144846 in esp_pm_impl_waiti () at /esp/esp-idf/components/esp32/pm_esp32.c:487
(gdb) i th 3
  Id   Target Id         Frame
  3    Thread 2 (IDLE0 CPUx) 0x40144846 in esp_pm_impl_waiti () at /esp/esp-idf/components/esp32/pm_esp32.c:487
(gdb) i th 4
  Id   Target Id         Frame
  4    Thread 3 (uart CPUx) xEventGroupWaitBits (xEventGroup=0x3ffb6e88, uxBitsToWaitFor=1, xClearOnExit=1, xWaitForAllBits=1, xTicksToWait=<optimized out>) at /esp/esp-idf/components/freertos/event_groups.c:445
(gdb) thr apply 3 bt

Thread 3 (Thread 2):
#0  0x401447da in esp_pm_impl_waiti () at /esp/esp-idf/components/esp32/pm_esp32.c:487
#1  0x400d2c76 in esp_vApplicationIdleHook () at /esp/esp-idf/components/esp32/freertos_hooks.c:63
#2  0x4008f13c in prvIdleTask (pvParameters=<optimized out>) at /esp/esp-idf/components/freertos/tasks.c:3412


This is a large change with many redundant code since I could not access the other components static's / internal stuff.
Mainly, I've duplicated the dumpHwRegfile function to support Xtensa's XtSolFrame, made a backup of the current panic handler (since as soon as gdb detects there are threads, it switches them, so the gdbregfile is lost), and dealt with FreeRTOS's TCB structure.

I didn't want to copy all the TCB structure declaration (I only needed a the top of the stack which is guaranteed to be the first element in the structure). So I declared a dummy structure with just a single element, and I'm using this. I'm not using uxTaskSnapshotAll here since this does not give the task name and CPU id. Instead, I'm using uxTaskGetSystemState that gives everything (name, TCB/handle, cpu id).

I've added all protections for address validity checking (copied from espcoredump.py and coredump.c)

There is still a bug within gdb when it's receiving a dump with a code that's built with -Os (with inlining) but it has already been reported in your forum, and it's unrelated to this work (it also happens when doing coredump).

Please let me know if you need more changes.

if (!taskCount) {
unsigned runTime = 0;
taskCount = uxTaskGetNumberOfTasks();
taskCount = uxTaskGetSystemState(tasks, 32, &runTime);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any of the task lists are corrupted, is it possible that this causes an exception? It would be desirable for GDB stub to not crash, even if FreeRTOS state is invalid.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's almost impossible to ensure the state of FreeRTOS task's chained list (unless validating each pointer in FreeRTOS but this would be a huge cost for a very sparsely used feature).

I wonder if it would be better to let it crash instead and:

  1. Set a boolean to true the first time we are walking the FreeRTOS task list,
  2. If we re-enter the exception handler, check the boolean. If it's true, simply return the previous exception since it's saved from the first entry. It would then appear in gdb as a single (the running) task and the previous/initial exception frame.

What do you think ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm testing this idea right now, and I'll update the PR when I'm done with this. I've also added a conditional compilation of the FreeRTOS part of the gdbstub (based on the presence of uxTaskGetSystemState).

Copy link
Collaborator

@gerekon gerekon Jan 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igrr, @projectgus
uxTaskGetSystemState tries to get xTaskQueueMutex. If it is busy at the moment of crash (taken by the other CPU or corrupted in some special way) it can lead to freezing when that API is called. May be it is better to use uxTaskGetSnapshotAll. It makes use of TaskSnapshot_t which provides pointer to TCB, but it has type void *, so gdbstub has to know TCB structure to get task name and core ID (X-Ryl669's DumpTCB is the candidate for this). To get task handle TCB address can be converted to TaskHandle_t.

@X-Ryl669
Copy link
Contributor Author

Here it is.
First, if configUSE_TRACE_FACILITY is not selected in menuconfig, the FreeRTOS part is ignored (effectively working like initial code).
Then, the code tries to use the FreeRTOS uxTaskGetSystemState function (only once). If it crashes here, the reentrant calling of the exception is detected and reenteredHandler is set to -1, preventing listing/switching task from GDB.

This should keep it working as a best effort forensic analysis.

@X-Ryl669
Copy link
Contributor Author

X-Ryl669 commented Jan 4, 2019

@igrr or @Spritetm, any news ?

@X-Ryl669
Copy link
Contributor Author

X-Ryl669 commented Jan 18, 2019

A bit of description of the workflow, to help understanding the changes

  1. At first, reenteredHandler is 0 (compiled value stored in BSS)
  2. An exception occurs, the esp_gdbstub_panic_handler is called, and since renteredHandler is not 1, the exception frame is copied to a static backup (in BSS)
  3. We then convert the panic'd frame to the format gdb expects.
  4. When gdb is connected and start talking, it can start the communication with either Hg0 or qfThreadInfo, so we set reenteredHandler to 1 in those call.
  5. Then we query FreeRTOS task information (such as the task name, TCB ptr, and, if available, CPU id). This is where ESP can crash again if FreeRTOS memory is corrupted. If it does, it'll enter the exception handler again, but since we've marked the reenteredHandler, this time, we trash the current exception frame and reuse the backup frame from initial step. We also set reenteredHandler to -1, to signal the GDB command parsing not to mention any thread/task anymore. The current command will garbage output, and GDB will assume the crashing command is not implemented. If it tries any other thread related command, since reenteredHandler is -1, they will all answer as if they were unimplemented, thus GDB will act exactly like it's acting currently, that is, only see one task (the initial that crashed)
  6. If FreeRTOS is not corrupted (99% of the case, I'd say), then this PR implements all the GDB commands required to switch task/thread, enumerate them, get callstack.
    It looks like there is some duplication of code for converting the TCB to the format GDB is expecting, but it's because of XtExcFrame and XtSolFrame being different layout and FreeRTOS stores only the latter in its TCB structures while the exception frame is using the former. In a future PR, and provided you are never going to fill the missing field, we could write a common filler for the untouched value.
  7. I've modified the conversion to support the same restriction/tests/fixes as those found in the coredump python code so it's less likely to crash GDB backtracer. I don't know if it's still relevant in GDB8, let me know.

So to sum up, if you have completely corrupted your memory, this code will act like the previous code, that is giving you the current exception frame and that's it.
However, if you've only crashed/asserted/debug breakpointed, this will give you the complete backtrace for all tasks, and their name.

@X-Ryl669 X-Ryl669 closed this Jan 18, 2019
@X-Ryl669 X-Ryl669 reopened this Jan 18, 2019
@@ -285,12 +388,84 @@ static int gdbHandleCommand(unsigned char *cmd, int len) {
gdbPacketEnd();
} else if (cmd[0]=='?') { //Reply with stop reason
sendReason();
} else {
#if configUSE_TRACE_FACILITY == 1
} else if (cmd[0]=='H' && reenteredHandler != -1) { //Continue with task
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reenteredHandler != -1 is used in several conditions checks. Re-writing this a bit could improve code readability.
For example:

Suggested change
} else if (cmd[0]=='H' && reenteredHandler != -1) { //Continue with task
} else if (reenteredHandler != -1) {
if (cmd[0]=='H') { //Continue with task
} else if (cmd[0]=='T') { //Task alive check
} else if (cmd[0]=='q') { //Extended query
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. That makes sense.

@@ -211,13 +213,22 @@ STRUCT_FIELD (long, 4, XT_STK_OVLY, ovly)
STRUCT_END(XtExcFrame)
*/

static inline bool isValidStack(long sp)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use esp_stack_ptr_is_sane instead of this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know. The isValidStack is extracted from coredump code esp_task_stack_start_is_sane and it's not checking that the stack pointer is aligned on 16 bytes boundary. Is this guaranteed (stack pointer & 0xF == 0) under all normal conditions ?

@@ -228,7 +239,7 @@ static void dumpHwToRegfile(XtExcFrame *frame) {
gdbRegFile.windowstart=0x1; //1
gdbRegFile.configid0=0xdeadbeef; //ToDo
gdbRegFile.configid1=0xdeadbeef; //ToDo
gdbRegFile.ps=frame->ps-PS_EXCM_MASK;
gdbRegFile.ps=(frame->ps & (1U<<5)) ? (frame->ps & ~(1U<<4)) : frame->ps; //Replicate correction from espcoredump.py:546
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you replace (1U<<5) and (1U<<4) with PS_UM and PS_EXCM respectively?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

for (i=0; i<16; i++) gdbRegFile.a[i]=frameAregs[i];
for (i=16; i<64; i++) gdbRegFile.a[i]=0xDEADBEEF;
if (gdbRegFile.a[0] & 0x8000000U) gdbRegFile.a[0] = (gdbRegFile.a[0] & 0x3fffffffU) | 0x40000000U; //Replicate correction from espcoredump.py:560
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to remove comments like //Replicate correction from espcoredump.py:560 in final version, since they seem to be useful for review only.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok

//Fetch the task status
static unsigned getAllTasksHandle(unsigned index, unsigned * handle, const char ** name, unsigned * coreId) {
static unsigned taskCount = 0;
static TaskStatus_t tasks[32];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to use macro for max tasks number supported by GDB stub. The same in the code below which calls uxTaskGetSystemState.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I use CONFIG_ESP32_CORE_DUMP_MAX_TASKS_NUM or simply make my own ?

if (!taskCount) {
unsigned runTime = 0;
taskCount = uxTaskGetNumberOfTasks();
taskCount = uxTaskGetSystemState(tasks, 32, &runTime);
Copy link
Collaborator

@gerekon gerekon Jan 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@igrr, @projectgus
uxTaskGetSystemState tries to get xTaskQueueMutex. If it is busy at the moment of crash (taken by the other CPU or corrupted in some special way) it can lead to freezing when that API is called. May be it is better to use uxTaskGetSnapshotAll. It makes use of TaskSnapshot_t which provides pointer to TCB, but it has type void *, so gdbstub has to know TCB structure to get task name and core ID (X-Ryl669's DumpTCB is the candidate for this). To get task handle TCB address can be converted to TaskHandle_t.

@X-Ryl669
Copy link
Contributor Author

X-Ryl669 commented Jan 21, 2019

I think the task name is really useful to locate task. Loosing this is really a big drawback.

TaskSnapshot_t does not provide the task name (unlike uxTaskGetSystemState). Althrough it's enough to get the TCB pointer, I can't easily convert it to a tskTaskControlBlock to find the task name (without including all of FreeRTOS declarations) with are private in tasks.c. The DumpTCB hack worked because the first element in the tskTaskControlBlock is the stack pointer. However, the task name is located at a varying position (because of macro portUSING_MPU_WRAPPERS) in the structure memory (and I can't use offsetof here because I don't have the tskTaskControlBlockdeclaration). I can hardcode it, but I think it's messy.

Another option would be to modify FreeRTOS declaration of TaskSnapshot_t to add the task name and core ID in it, or create a uxUnsafeTaskGetSystemState that does not lock the mutex .

What do you prefer ?

EDIT: Bah... Forget about that, I can already use pcTaskGetTaskName and xTaskGetAffinity once I cast the TCB to a TaskHandle_t, so yes, I'll do like you said.

@X-Ryl669
Copy link
Contributor Author

X-Ryl669 commented Feb 9, 2019

Summary of the changes:

  • I've applied all changes asked
  • I've merged the common code in the Xt frame to GDB reg file (so it's smaller)
  • Removed the dependency on config TRACE_FACILITY since I'm using uxTaskGetSnapshotAll instead of uxTaskGetSystemState, so now, it's enabled by default whenever gdbstub is selected
  • Fixed a probable bug when re-entering the handler to avoid incoherent GDB behavior if it received a correct packet just before the ESP32 crashed again (unlikely)
  • Tested with a simple design, and it's working perfectly well. Couldn't get FreeRTOS to crash, so if you have example code, I'm all for it.

@gerekon
Copy link
Collaborator

gerekon commented Feb 11, 2019

@igrr @projectgus @X-Ryl669 Changes seem to be OK.

One more thing... It should not be implemented in this PR. It is planned as one of ongoing core dump improvements/fixes. Registers for the task currently running on other CPU are not saved onto the stack, so GDB will show invalid values. Special mechanism needs to be implemented to retrieve them from other CPU upon entering panic handler.

@X-Ryl669
Copy link
Contributor Author

X-Ryl669 commented Feb 11, 2019

The only way to do that would be to trigger an interrupt on the other CPU.
If I've understood the Xtensa lSA the only interrupt that could be triggered is a software level interrupt via the XTHAL_SET_INTSET macro (wsr.intset assembly instruction).
However, it's not clear to me if interrupts are shared between CPU.
If it's the case, then a CPU shoud register an interrupt handler 1 and the other CPU should register int handler 2.
Else, I don't see a way for CPU to communicate (feel free to correct me if I'm wrong)

The handler role would be to save the exception frame to the memory and set a flag so the other cpu can busy loop on it.

I'm still thinking about how to allow live debugging with this and using software interrupt might be a solution, except that it's not clearly explained how it's implemented underneath.

@gerekon
Copy link
Collaborator

gerekon commented Feb 12, 2019

@X-Ryl669

The only way to do that would be to trigger an interrupt on the other CPU.

Yes. ESP32 supports interrupts between xtensa cores. The idea is to use CPU_INTR_FROM_CPU_x to force other CPU to take interrupt. As I wrote above this problem also affects core dump functionality and fix is planned for the one of nearest upcoming releases.

@X-Ryl669
Copy link
Contributor Author

Would it sum up to adding a new reason in cross_int.c (something like REASON_DUMP), or is it too high level ?
I'm wondering if the objective is to stop the other CPU ASAP, or to get a good task dump.
If it's the latter, then it's very easy to request a yield from the other CPU via crossint, that'll save the current thread's register in FreeRTOS.
If it's the former, I wonder that the panic handler code will have to be modified to also capture the exception/register in the CPU_INTR/cross core code, that's a lot of additional instructions for cross core interruptions in general if it does not need it.
Are we allowed to add more interruption clause like DPORT_CPU_INTR_FROM_CPU_4_REG for this instead so we can only capture the exception frame upon that specific interruption ?
Typically, IIUC, we'd like to set up a dport interrupt with a ETS_DPORT_INUM set to 2 for example (since 1 is used in the dport_panic_highint_hdl for usual dport access) and test for it to trigger another function than panicHandler (line 116). That function would save the CPU's frame on the IRAM memory and set a bit to tell the other CPU it's done dumping and then return, right ?

In that case, please consider allowing more action in this (new) handler, like:

  1. Dumping the current registers/frame (what's described above)
  2. Setting a debug breakpoint/watchpoint
  3. Clearing a debug b/w point
  4. Replacing the current registers/frame with one provided

That would permit one CPU to debug the other CPU in live within GDB directly from UART or WIFI!

@igrr
Copy link
Member

igrr commented Feb 13, 2019

This will likely be implemented by extending the existing level 4 interrupt handler to allow doing other "stuff", such as saving the interrupt frame somewhere. This has to be at least level 4 interrupt, so you can interrupt the other CPU even if it is inside a FreeRTOS critical section.

That would permit one CPU to debug the other CPU in live within GDB directly from UART or WIFI!

Not to disencourage you from trying this, but there are other issues to solve. For example, with this approach the "debuggee" CPU can not be allowed to halt inside a critical section. If it does, then the "debugger" CPU can immediately deadlock if it tries to enter a critical section with the same spinlock. Also in the current FreeRTOS CPU0 is responsible for advancing tick count, so debuggee CPU can only be CPU1.

@X-Ryl669
Copy link
Contributor Author

I guess such debugging stub shouldn't be using FreeRTOS. It should be only interrupt based.
Typically, in user code (probably in a FreeRTOS task), the user would trigger an exception (debug break for example).
The exception handler will capture the registers value and store in a static area (for both CPU, using the scheme discussed above).
GDB can already set the memory and retrieve it with the current stub and the "only" thing missing is to be able to set debug/watchpoint and resume.
When listing tasks the current CPU might hold a spinlock, but the code above does not lock, so it's manipulating the task list / snapshotting that might get corrupted.
The tasklist is only missing what the other CPU is currently doing, and I guess it's possible to have it dumps its registers' list without deadlocking (via a high level exception).

Then GDB can ask to set up a debug break/watchpoint (we should be able to deal with this as interrupt too when they happen)

Then GDB can ask to continue/resume. In that case, we should get out of the ISR and let the program resume. Provided the user has set up a debug break/watchpoint, an interrupt will happen when it's hit, and we're back to the GDB stub code.

We could also have a FreeRTOS task that's listening to GDB command on the serial link to catch "Ctrl+C" (not sure how it's done) and in that case, it would trigger the interrupt to re-enter GDB stub.

I don't see in this scheme when FreeRTOS is in the way or where it could deadlock.
But for this to work, one CPU should be able to set the debug break/watchpoint of the other CPU (this is not possible without crossint/dport code I guess) or we'll miss some break.

Also, at first, the GDB stub handler should talk using serial port, but in the future it could be using TCP/IP too since GDB supports this.
However, this might prove a lot more challenging to implement without FreeRTOS/Lwip in the way, so a kind of ISR/task communication will be required for this to happen (typically, the ISR should post the next message to send to GDB in a queue and resume to the GDB's task that'll pump the message and send it to GDB).

igrr pushed a commit that referenced this pull request Mar 8, 2019
…and changing the active task and fetching each task's stack

Merges #2828
@X-Ryl669
Copy link
Contributor Author

Is the current change in master finished and OK ?
In that case, we should close this PR.

@github-actions github-actions bot changed the title Add support for listing tasks to gdb in gdbstub Add support for listing tasks to gdb in gdbstub (IDFGH-729) Mar 13, 2019
@projectgus
Copy link
Contributor

@X-Ryl669 yes, changes were cherry-picked in the commit above. Thanks for the contribution, and please let us know if you have another suggestions or contributions for debugging improvements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gdbstub should also report other task status (IDFGH-498)
4 participants