Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IPv6 packet causes exception #133

Closed
rojer opened this issue Dec 2, 2016 · 18 comments
Closed

IPv6 packet causes exception #133

rojer opened this issue Dec 2, 2016 · 18 comments

Comments

@rojer
Copy link
Contributor

rojer commented Dec 2, 2016

I'm getting the following exception shortly (a few seconds) after connecting to WiFi:

ets Jun  8 2016 00:22:57

rst:0x10 (RTCWDT_RTC_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0x00
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3ffc0008,len:0
load:0x3ffc0008,len:1964
load:0x40078000,len:3668
load:0x40080000,len:260
entry 0x40080034
I (550) heap_alloc_caps: Initializing heap allocator:
I (550) heap_alloc_caps: Region 19: 3FFC0E9C len 0001F164 tag 0
I (552) heap_alloc_caps: Region 25: 3FFE8000 len 00018000 tag 1
I (561) cpu_start: Pro cpu up.
I (567) cpu_start: Single core mode
I (573) cpu_start: Pro cpu start user code
I (598) rtc: rtc v160 Nov 22 2016 19:00:05
I (609) rtc: XTAL 40M
W (696) cpu_start: failed to load RF calibration data, falling back to full calibration
I (2041) phy: phy_version: 258, Nov 29 2016, 15:51:07, 0, 2
I (2106) cpu_start: Starting scheduler on PRO CPU.
tcpip_task_hdlxxx : 3ffc4824, prio:18,stack:2048
I (2119) wifi: frc2_timer_task_hdl:3ffc635c, prio:22, stack:2048
I (2131) wifi: pp_task_hdl : 3ffc8bf4, prio:23, stack:8192
I (2132) wifi: mode : sta (24:0a:c4:00:2f:9a)
I (2133) foo: event: 2
I (3008) wifi: n:6 0, o:1 0, ap:255 255, sta:6 0, prof:1
I (3657) wifi: state: init -> auth (b0)
I (3658) foo: ping
I (3670) wifi: state: auth -> assoc (0)
I (3675) wifi: state: assoc -> run (10)
I (3686) wifi: connected with Cesanta, channel 6
I (3687) foo: event: 4
I (4658) foo: pong
I (5658) foo: ping
I (6658) foo: pong
I (6759) event: ip: 192.168.1.40, mask: 255.255.255.0, gw: 192.168.1.1
I (6759) foo: event: 7
I (7658) foo: ping
I (8658) foo: pong
I (9658) foo: ping
Guru Meditation Error of type InstrFetchProhibited occurred on core   0. Exception was unhandled.
Register dump:
PC      :  00000001  PS      :  00060e30  A0      :  800e9d8e  A1      :  3ffc4700  
A2      :  00000001  A3      :  3ffc9cc0  A4      :  00000001  A5      :  00000008  
A6      :  00000000  A7      :  3ffbd594  A8      :  800e5a50  A9      :  00000002  
A10     :  3ffc9cc0  A11     :  00000001  A12     :  3ffbf2a4  A13     :  00000000  
A14     :  3ffc4008  A15     :  3ffbf07c  SAR     :  00000012  EXCCAUSE:  00000014  
EXCVADDR:  00000000  LBEG    :  4000c2e0  LEND    :  4000c2f6  LCOUNT  :  ffffffff  
Rebooting...

this is latest master esp-idf + esp-idf-template. i only changed ssid pass and later added a couple debug statements to log system event type and provide heartbeat to give an idea when the crash happens.

this only happens on successful wifi connection - if i specify nonexistent ssid or invalid pass, there's no crash. our office AP is nothing special, ESP8266 works with it just fine.
tried on 2 different devices, so it's not a hardware fault.

let me know if you need any info to help debug this.

@nkolban
Copy link
Contributor

nkolban commented Dec 2, 2016

When an exception like this is encountered, we can use gdb to determine where the exception occurred. I believe you are with Cesanta ... I am extremely interested in getting Mongoose supported on ESP32. If it would be useful, I will be delighted to partner with you ... you with your Cesanta code skills and me with some useful ESP32 skills. You can reach me on ESP32 IRC (kolban) or Skype at "neil.kolban".

@rojer
Copy link
Contributor Author

rojer commented Dec 2, 2016

yes, this is my first foray into ESP32 programming. i'm already setting up JTAG debug probe, a bit sooner than i expected :)

@nkolban
Copy link
Contributor

nkolban commented Dec 2, 2016

If you have trouble with JTAG, you can actually use gdb on your PC to examine the details of what failed through the serial port. Again, if you can use me, please do. I find that when approaching any new environment, I like to find folks who may have walked some of the path in the past that can offer tips on how to proceed.

@rojer
Copy link
Contributor Author

rojer commented Dec 2, 2016

in this case the problem is that we don't have useful PC to work with.

@nkolban
Copy link
Contributor

nkolban commented Dec 2, 2016

If needed, I can load your code on my PC ... via a screen share. Where are you compiling source?

@kevinboulain
Copy link

kevinboulain commented Dec 3, 2016

On the manufacturer's forum for my board, I stumbled upon this.

If my laptop or my phone does a DHCP request, the board panics every time, which explains the regularity of the crashes and why it only happens when the board is associated with the access point.

@igrr
Copy link
Member

igrr commented Dec 5, 2016

Thanks for the reports. I have tried both with the latest master and the one which was available on the day this was originally reported. So far I wasn't able to reproduce this. Just to be sure, could any of you please upload the elf file which demonstrates the issue?

From the linked forum issue I got an impression that this may be related to some specific DHCP packet. I wonder if I may ask for a wireshark trace with DHCP packets logged?

/cc @abiliojr, I would like to keep discussion of this issue in one thread.

@rojer
Copy link
Contributor Author

rojer commented Dec 5, 2016

this must be triggered by some traffic, as for example it only repros on the office network for me and not at home. i'll try to correlate it with network activity today.

@rojer
Copy link
Contributor Author

rojer commented Dec 5, 2016

i ran tcpdump on my wlan0 interface at the same time as device is running and crashing, with an idea that it might be triggered by some broadcast traffic, which my laptop should then see as well.
unfortunately, i don't see any smoking gun in the logs, i could not identify a packet that precedes device crash.

@rojer
Copy link
Contributor Author

rojer commented Dec 5, 2016

here are some more findings. i went back in time to see if it is a recent regression - it's not.
the bug is present as far back as 9a3f9af, and that is as far as i can go because at previous commit (72422b3) wifi won't even connect.
so it goes from being unable to connect at all to connect and crash, which is an improvement, of sorts :)

now, there is a minor change i noticed: before 9423390 PC had a different invalid value:

register dump at f394632:

PC      :  002c002c  PS      :  00060e30  A0      :  800e9d8e  A1      :  3ffc47a0  
A2      :  00000001  A3      :  3ffc9c98  A4      :  002c002c  A5      :  00000008  
A6      :  00000000  A7      :  3ffbd588  A8      :  800e5a50  A9      :  00000002  
A10     :  3ffc9c98  A11     :  00000001  A12     :  3ffbf294  A13     :  00000000  
A14     :  3ffc2480  A15     :  3ffbf06c  SAR     :  0000001e  EXCCAUSE:  00000014  
EXCVADDR:  002c002c  LBEG    :  00000000  LEND    :  00000000  LCOUNT  :  00000000  

and at 9423390:

PC      :  00000001  PS      :  00060e30  A0      :  800e9d8e  A1      :  3ffc47a0  
A2      :  00000001  A3      :  3ffc9cc8  A4      :  00000001  A5      :  00000008  
A6      :  00000000  A7      :  3ffbd594  A8      :  800e5a50  A9      :  00000002  
A10     :  3ffc9cc8  A11     :  00000001  A12     :  3ffbf2a4  A13     :  00000000  
A14     :  3ffc40b0  A15     :  3ffbf07c  SAR     :  0000001e  EXCCAUSE:  00000014  
EXCVADDR:  00000000  LBEG    :  4000c2e0  LEND    :  4000c2f6  LCOUNT  :  ffffffff  

i can offer remote access to a machine with ESP32 module connected so you can try to debug it.
ping me on IRC if interested.

@igrr
Copy link
Member

igrr commented Dec 5, 2016

@rojer Thanks for the observation. I suppose I need to try a bunch of different routers.
The fact that PC contains garbage, and always the same kind of garbage (002c002c is also seen in the issue report on the Pycom forum) makes me think that this may be a stack overflow. Since it is triggered by some DHCP packet, it may be a stack overflow in LwIP task. Could you please do a test again, after increasing ESP_TASK_TCPIP_STACK from 2048 to some larger value in components/esp32/include/esp_task.h?
Also please enable "Run FreeRTOS on the first core only" option in menuconfig, Component config, FreeRTOS. LwIP task was moved to the second core in one of the recent changes, and from our internal testing it seems to be causing issues (although different from the one you have reported).

@rojer
Copy link
Contributor Author

rojer commented Dec 5, 2016

no, increasing ESP_TASK_TCPIP_STACK to 4096 didn't help. CONFIG_FREERTOS_UNICORE is set to y.
let me note that the device actually manages to acquire an IP address just fine, crash happens after that (usually a few seconds):

I (3615) wifi: state: auth -> assoc (0)
I (3625) wifi: state: assoc -> run (10)
I (3635) wifi: connected with Cesanta, channel 6
 [0;32mI (3645) foo: event: 4 [0m
 [0;32mI (4615) foo: pong [0m
 [0;32mI (5615) foo: ping [0m
 [0;32mI (6615) foo: pong [0m
 [0;32mI (6715) event: ip: 192.168.1.34, mask: 255.255.255.0, gw: 192.168.1.1 [0m
 [0;32mI (6715) foo: event: 7 [0m
^A^X [0;32mI (7615) foo: ping [0m
 [0;32mI (8615) foo: pong [0m
 [0;32mI (9615) foo: ping [0m
Guru Meditation Error of type InstrFetchProhibited occurred on core   0. Exception was unhandled.
Register dump:
PC      :  00000001  PS      :  00060e30  A0      :  800e9d8e  A1      :  3ffc4fb0  
A2      :  00000001  A3      :  3ffca4bc  A4      :  00000001  A5      :  00000008  
A6      :  00000000  A7      :  3ffbd594  A8      :  800e5a50  A9      :  00000002  
A10     :  3ffca4bc  A11     :  00000001  A12     :  3ffbf2a4  A13     :  00000000  
A14     :  3ffc2490  A15     :  3ffbf07c  SAR     :  00000012  EXCCAUSE:  00000014  
EXCVADDR:  00000000  LBEG    :  4000c2e0  LEND    :  4000c2f6  LCOUNT  :  ffffffff  
Rebooting...

fwiw my router model is TP-Link TD-W9980

@kevinboulain
Copy link

I did some tcpdump (during my previous attempt, before finding the thread talking of DHCP requests triggering the problem, I only tried to capture the 802.11 frames and saw some corrupt ones but replaying them didn't produce anything) and by elimination, I was left with only one packet: an ICMPv6 Router Solicitation.
Each tcpreplay of this packet provoked a panic.

I'm not sure it is useful I post the pcap here, as I can simply reproduce the issue with the rdisc6 command (could not compile it under OS X so I installed it on a laptop with Debian as it is available in the main APT repository via the ndisc6 package).

For the record, here is how to use it:

rdisc6 "$router_ip6" "$interface"

Where $router_ip6 may be found via:

ip -6 route

And $interface is simply your Wi-Fi interface (that you should also see in the ip's command output).

Unfortunately, #define LWIP_IPV6 0 in lwipopts.h will break the build and I'm not sure I can disable IPv6 support on the ESP32 (I'm new with this board & SDK, sorry) to fully assert there is a problem with the IPv6 stack.

Hope this helps.

@rojer
Copy link
Contributor Author

rojer commented Dec 5, 2016

fwiw, this is something i have in common with @ether42 - i also have ipv6 in my network.

@igrr igrr changed the title Exception on WiFi assoc IPv6 packet causes exception Dec 6, 2016
@igrr
Copy link
Member

igrr commented Dec 6, 2016

Thanks @ether42, I have passed this way of reproducing the issue to the team, we should have a fix shortly.

@williamvm
Copy link

i have similar issue on my beloved LoPy with ESP32. Changed IPv6 settings on my wifirouter from 6to4-tunnel to Disable and the crash has not happened again. Router Netgear R7000 i do not run any ÎPv6 software. Not a fix but a workaround.

@macfreek
Copy link

macfreek commented Dec 7, 2016

Kudus to @ether42 for the thorough report. Thanks!

I can confirm that running rdisc6 triggers the crash.

FYI, @ether42, I just had a look at rdisc on OS X, and made two fixes to rdisc. I mailed it's author just a minute ago.

rojer pushed a commit to cesanta/esp-idf that referenced this issue Dec 7, 2016
@rojer
Copy link
Contributor Author

rojer commented Dec 7, 2016

struct netif's ipv6_addr_cb field was non-NULL but the pointer was invalid.
turns out, struct netif was not zeroed on allocation. #148 fixes that and possible other bugs due to uninitialized struct fields.

rojer pushed a commit to cesanta/esp-idf that referenced this issue Dec 7, 2016
On Xtensa windowed ABI backtrace is easy to do, and it's immensely
useful for debugging, so - do it by default.
We try to be careful and not deref bogus pointers while walking the
frames.

Example output (from debugging espressif#133):

Backtrace: 0x1:0x3ffc51e0 0x400e9dfa:0x3ffc5210 0x400e9ebc:0x3ffc5230 0x400ec487:0x3ffc5260

With just addr2line utility from binutils, this already pinpoints the
location of the crash.
rojer pushed a commit to cesanta/esp-idf that referenced this issue Dec 7, 2016
On Xtensa windowed ABI backtrace is easy to do, and it's immensely
useful for debugging, so - do it by default.
We try to be careful and not deref bogus pointers while walking the
frames.

Example output (from debugging espressif#133):

Backtrace: 0x1:0x3ffc51e0 0x400e9dfa:0x3ffc5210 0x400e9ebc:0x3ffc5230 0x400ec487:0x3ffc5260

With just addr2line utility from binutils, this already pinpoints the
location of the crash.
rojer pushed a commit to cesanta/esp-idf that referenced this issue Dec 7, 2016
On Xtensa windowed ABI backtrace is easy to do, and it's immensely
useful for debugging, so - do it by default.
We try to be careful and not deref bogus pointers while walking the
frames.

Example output (from debugging espressif#133):

Backtrace: 0x1:0x3ffc51e0 0x400e9dfa:0x3ffc5210 0x400e9ebc:0x3ffc5230 0x400ec487:0x3ffc5260

With just addr2line utility from binutils, this already pinpoints the
location of the crash.
@igrr igrr closed this as completed in #148 Dec 8, 2016
igrr added a commit that referenced this issue Dec 8, 2016
Zero-fill netif on alloc

LwIP's `struct netif` was not zero-initialized after allocation, so `ipv6_addr_cb ` function pointer contained garbage. This caused a crash if the device was on an IPv6 capable network.

Fixes #133, #147, #142.

Source: #148

See merge request !289
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants