-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core Panic (InstrFetchProhibited) occasionally causes reboot #314
Comments
From https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/fatal-errors.html :
|
I did a new compile based on the
Code: |
Maybe this? espressif/arduino-esp32#3659 |
I have currently very commonly an error with Madavi, including timeouts - and e.g. no BLE notifications during these times. Most of the time status is 2 - on both boards here. Maybe that is also not helping... |
Madavi seem to have problems at the moment. |
Yeah, looks exactly as my problem. Yesterday the ESP32 was running from about 9:00 to 23:00, hence 14 hours. According to the log file, there were 8 reboots.
This leads me to think the issue occurs mostly with slow servers. In the above post, KenthJ has the issue only when accessing his Raspberry Pi. Sensor.community is also EXTREMELY slow. I know this since I am developing a script to download my measurement data from it, and tracing the traffic with Wireshark. To get a page "http://archive.sensor.community/2020-05-17/" my script had to wait for 27 s (!) to get an answer to its GET command. The server is then often not using the MTU of 1506 bytes, but sends the answer in small chunks of around 80 bytes. Well, the mentionned file is 3.3 MB large, but anyway the server behaviour is somewhat unusual. Also there is a comment in the Multigeiger code that HTTPS cannot be used because of performance issues. My guess is that in some cases the ESP32 might ask for package retransmission, while the reminded package is on the way. Closing the HTTP connection too fast might cause late packeges coming in while the ressources are already deleted or reused. Here is the log file of yesterday. |
@jwoelk there was also the suspicion that the issue occurs when a server finishes by resetting the connection (RST) - maybe you could check this with wireshark. |
Yes, the servers reset the connections, but this is only half the truth. File iad-if-wlan_21.05.20_1302.eth is a Wireshark file from my FritzBox 7490. iad-if-wlan_21.05.20_1302.eth.zip The problem with me is that I cannot sit at my laptop during all the day, so I got some very long Wireshark protocols where I cannot really identify ESP32 reboots somewhere in the middle. Some other files don't have any reboot. When a reboot happens, a DHCP grant should show up in the protocol, but I am not sure whether this occurs always. This is the case in this protocol in sequence 160, but this is probably after an intended reset. But I have detected something interesting. Yes, the servers send RST, but because they are expected to do so! It is the ESP32 that requests the reset. Well, in fact I am new to the TCP handshake details, so please correct my vision if you think the behaviour is normal. If you open the FritzBox file with Wireshark, set the filter to (ip.src == 192.168.178.29 || ip.dst == 192.168.178.29) && http to get an overview, and to (ip.src == 192.168.178.29 || ip.dst == 192.168.178.29) to also see the handshakes. 192.168.178.29 is my Multigeiger ("ESP"), 81.169.1880.11 is the sensor.community server ("SC"), and 85.214.240.94 is the Madavi server ("MA"). With the latter filter I see the following: iad-if-wlan_21.05.20_1302_excerpt.txt So in my opinion the ESP requests a TCP connection, requests to disconnect it, then changes its mind and transmits the HTTP packages. At the end it forgets to disconnect. I didn't yet look into the source code to determine how this bad behaviour can occur. I will be out from tomorrow till sunday evening, so maybe can do something next week. |
It's a bug, but maybe not in our code. |
Is there some way to catch this error, display it in the status but otherwise ignore it? |
I am sorry, but this observation is wrong. Instead, everything is ok. My fault is that I did not look at the port numbers, too. The requests to close the connection (FIN,ACK) that the client sends are for the respective previous connections. Why the client does this only after establishing a new connection, hence after keeping the previous connections up for 150 seconds without using it, is strange, but should be ok. Alltogether the HTTP transfers that I looked at are fully conform to the TCP standard. So there is no need to search for bugs in the HTTPClient library at the moment. My apologies again for causing panic. Yesterday evening I got a huge Wireshark log that finally, after running for about 4 hours, includes a reboot. I will look into it in the next days. |
Today I can present a reboot event, documented with a serial log and a Wireshark file, which is 36 MB large, even when zipped. It seems to be too large to be uploaded, so please get it here: https://www.magentacloud.de/share/1.85c23wyt The core panic occured after less than 3 hours running. It is logged in the serial log file as follows: GEIGER: Sending to Madavi ... Backtrace: 0x600a1000:0x3ffb1990 0x400e5a8f:0x3ffb19b0 0x400e5b01:0x3ffb19d0 0x400e6ce5:0x3ffb19f0 0x400d3e59:0x3ffb1a30 0x400d403f:0x3ffb1a70 0x400d4659:0x3ffb1eb0 0x400d3742:0x3ffb1f00 0x400d3829:0x3ffb1f60 0x400ea38d:0x3ffb1fb0 0x4008980d:0x3ffb1fd0 Rebooting... After finally finding the ELF file I can also run the exception decoder: Decoding 10 results The Wireshark file shows some out-of-order packets and some strange behaviour. Some seconds before the core panic the ESP32 requests a TCP connection to Madavi, using port 51375 (seq 129655), Difficult to say what has happened in HTTPClient:disconnect(), though this function is not large. At the moment, I don't intend to dig more deeply into the problem, because there is too much code to study. As the servers are so slow that measurement data are lost anyway from time to time, some sporadic reboots are not really worsening the problem. I hope the reboots will go away when the servers will become faster. By the way: why do we use TCP when no lost HTTP packets are ever resent? |
Additional observation from my side: With send2sensor.community and send2madavi on, the device had a maximum uptime of a few hours, max 7-9h observed over several days. |
Because some people don't have reboot events, I ran my ESP32 yesterday without the THP module (BME280). I expected to get no reboot event as well. But after running for more than 9 hours the ESP32 had a reboot. This time it was a "LoadProhibited" alarm. Still release 1.14.0. ... Backtrace: 0x401811a0:0x3ffb1990 0x400e5a8f:0x3ffb19b0 0x400e5b01:0x3ffb19d0 0x400e6ce5:0x3ffb19f0 0x400d3e59:0x3ffb1a30 0x400d403f:0x3ffb1a70 0x400d4659:0x3ffb1eb0 0x400d3742:0x3ffb1f00 0x400d3829:0x3ffb1f60 0x400ea38d:0x3ffb1fb0 0x4008980d:0x3ffb1fd0 Rebooting... Here is the full PuTTY log file: putty-ESF32_20200528(1)_ohne_thp.log The exception decoder says: Decoding 11 results What I also see in the serial log file are two losses of WLAN connection. Probably this is not relevant. I think I had my mobile lying between the Fritzbox and the ESP32, without putting it into flight mode as during the previous tests. In the Wireshark file I did not see anything against the rules. In seq 180200 the ESP32 (192.168.178.29) was terminating the connection to Madavi. In seq 180678, about 90 s later, it got the DHCP lease. The Wireshark file is iad-if-wlan_28.05.20_0725_ohne_thp.eth.zip in https://www.magentacloud.de/share/1.85c23wyt#$/PR314/ Summary: I still don't know why some people don't see reboot events. Running the ESP32 without THP seems to reduce the frequency of reboot events. |
Hmm, maybe these people just use the MG locally, without Wifi or initial config, displaying the values? |
Another thought - the http related code was not changed over the last releases. |
...and already 2 reboots with 1.14 + BLE in 3h vs. no reboots with 1.13 + BLE in 24h. Can we switch off the ticks with http, as you did with iotwebconf, TW? |
Guess we should first try fixing the HTTPClient code, see the issue linked there. |
Again: Was any http-related code changed between releases 1.13 and 1.14? I'm currently at packet count ~16400 (~2d uptime) with v1.13 - no reboot since power-up. |
I ran my ESP32 in the last days to get some more logs (which I didn't look into in detail). While this is not a very reliable statistics, and the holidays might have influenced the internet traffic load, I think that the repetition rate of HV recharge pulses resp. interrupts does not affect the reboot rate. The interrupt-driven charge pulses were added to 1.14.0, according to issue #192. Acoustical counter ticks are disabled on my board, but the LED flash is on. |
Hi In HTTPclient.cpp in the destructor, i've commented the _client->stop(); HTTPClient::~HTTPClient() I added a new method: In the calling code (http client) void dotherequest(){ // call to http.end(); has been removed. |
@mockmock1 Interesting. But can you describe the root cause of the crash and why your changes fix it? |
I am currently running "http stress test" code on the esp32 against a local python http server (so I can influence the server behaviour and speed easily). Does anybody have an idea which timing parameter triggers the issue? It would be a big win if we could reproduce the issue without having to wait for hours... |
I suspect when the httpclient destructor runs that the connection is already closed and a call to _client->stop(); raises an exception (when I decode the stack it's this line which causes the problem). If I change my code back to the original version I have the issue after 15/20 calls (10 secs between each call). |
We never call the HTTPClient destructor in our code. The objects are created once, kept in global data structures and are never disposed. |
I patched the suggested changes into my local NimBLE branch, so far it is running for ~one hour on one device and ~ 30 min on a second device and sending to sensor.community as well as madavi. I'll keep you posted about the resets :) |
Our crash is in the HTTPClient code, how is the stuff you point to related? |
Moved this issue to next milestone. I'll do V1.15.0 now so everybody has an up-to-date binary with all updates in our code and also in all libraries / frameworks we use. Not sure if it helps with this issue here, but at least we might get rid of some other already fixed issues. |
New release: https://github.com/ecocurious2/MultiGeiger/releases/tag/V1.15.0 Please try if this issue still happens (precisely this issue, for other crashes, please open separate tickets). |
I cleaned up this ticket and removed unrelated posts to make it easier to digest. |
So far no reboot at all with the new release. 👍 |
Unfortunately, I had the first reboot just now, after approx. 10h. Maybe related to #401 in a way that there could be some blocking code that kills the HTTPClient?
|
For me, it is interesting that sending to Madavi almost always works, before something crashes more ore less hard (cf. #401) while sending the data to sensor.community. |
New arduino-esp32 code, including changes to HTTPClient was released, see #408. |
espressif/arduino-esp32#3659 (comment) this sounds promising. |
Current master branch 1.16.0-dev runs very stable (latest check: 19d straight), issue seems to be fixed with change to NimBLE. |
My ESP32 with Si22G and BME280, no LoRaWan, SW Release 1.14.0, reboots several times per day. The serial log says
GEIGER: Sending to Madavi ...
Guru Meditation Error: Core 1 panic'ed (InstrFetchProhibited). Exception was unhandled.
Core 1 register dump:
PC : 0xe00a1000 PS : 0x00060c30 A0 : 0x800e5a92 A1 : 0x3ffb1980
A2 : 0x3ffb2ad8 A3 : 0xe00a1000 A4 : 0x3ffc38a0 A5 : 0x2fd7f6a6
A6 : 0x3ffb1d94 A7 : 0x3ffb1d94 A8 : 0x801811a6 A9 : 0x3ffb1970
A10 : 0x3ffd3060 A11 : 0x00000000 A12 : 0x0000002c A13 : 0x3ffd3090
A14 : 0x3ffb1d74 A15 : 0x00000002 SAR : 0x0000000c EXCCAUSE: 0x00000014
EXCVADDR: 0xe00a1000 LBEG : 0x4000c46c LEND : 0x4000c477 LCOUNT : 0x00000000
Backtrace: 0x600a1000:0x3ffb1980 0x400e5a8f:0x3ffb19a0 0x400e5b01:0x3ffb19c0 0x400e6ce5:0x3ffb19e0 0x400d3e59:0x3ffb1a20 0x400d42b3:0x3ffb1a60 0x400d4596:0x3ffb1eb0 0x400d3742:0x3ffb1f00 0x400d3829:0x3ffb1f60 0x400ea38d:0x3ffb1fb0 0x4008980d:0x3ffb1fd0
Rebooting...
The libraries installed are at the following release numbers:
Adafruit BME280 2.0.2
Adafruit BME680 1.0.7
Adafruit Unified Sensor 1.1.2
Adafruit ADXL343 1.2.0 ?
IoTWebConf 2.3.1
MCCI LoRaWAN LMIC library 3.1.0 (Update to 3.2.0 not installed)
U8g2 2.27.6
I append the full log file of yesterday evening, where this issue (InstrFetchProhibited) occured three times. Some registers in the dump are different, some are the same, so it might be interesting to look at all three occasions. It looks that this issue likely occurs when sending to Madavi. Otherwise, when testing with SW 1.13.1-dev downloaded on May 14, the issues occured when sending to sensor.community.
The issue occurs also when running off a mobile phone USB supply (of course there is no log then), so I think it is not related to my laptop.
The issue doesn't seem to be related to #39 or #82.
putty-ESF32_20200519.log
The text was updated successfully, but these errors were encountered: