Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core Panic (InstrFetchProhibited) occasionally causes reboot #314

Closed
jwoelk opened this issue May 20, 2020 · 40 comments
Closed

Core Panic (InstrFetchProhibited) occasionally causes reboot #314

jwoelk opened this issue May 20, 2020 · 40 comments
Labels
bug Something isn't working
Milestone

Comments

@jwoelk
Copy link

jwoelk commented May 20, 2020

My ESP32 with Si22G and BME280, no LoRaWan, SW Release 1.14.0, reboots several times per day. The serial log says

GEIGER: Sending to Madavi ...
Guru Meditation Error: Core 1 panic'ed (InstrFetchProhibited). Exception was unhandled.
Core 1 register dump:
PC : 0xe00a1000 PS : 0x00060c30 A0 : 0x800e5a92 A1 : 0x3ffb1980
A2 : 0x3ffb2ad8 A3 : 0xe00a1000 A4 : 0x3ffc38a0 A5 : 0x2fd7f6a6
A6 : 0x3ffb1d94 A7 : 0x3ffb1d94 A8 : 0x801811a6 A9 : 0x3ffb1970
A10 : 0x3ffd3060 A11 : 0x00000000 A12 : 0x0000002c A13 : 0x3ffd3090
A14 : 0x3ffb1d74 A15 : 0x00000002 SAR : 0x0000000c EXCCAUSE: 0x00000014
EXCVADDR: 0xe00a1000 LBEG : 0x4000c46c LEND : 0x4000c477 LCOUNT : 0x00000000

Backtrace: 0x600a1000:0x3ffb1980 0x400e5a8f:0x3ffb19a0 0x400e5b01:0x3ffb19c0 0x400e6ce5:0x3ffb19e0 0x400d3e59:0x3ffb1a20 0x400d42b3:0x3ffb1a60 0x400d4596:0x3ffb1eb0 0x400d3742:0x3ffb1f00 0x400d3829:0x3ffb1f60 0x400ea38d:0x3ffb1fb0 0x4008980d:0x3ffb1fd0

Rebooting...

The libraries installed are at the following release numbers:
Adafruit BME280 2.0.2
Adafruit BME680 1.0.7
Adafruit Unified Sensor 1.1.2
Adafruit ADXL343 1.2.0 ?
IoTWebConf 2.3.1
MCCI LoRaWAN LMIC library 3.1.0 (Update to 3.2.0 not installed)
U8g2 2.27.6

I append the full log file of yesterday evening, where this issue (InstrFetchProhibited) occured three times. Some registers in the dump are different, some are the same, so it might be interesting to look at all three occasions. It looks that this issue likely occurs when sending to Madavi. Otherwise, when testing with SW 1.13.1-dev downloaded on May 14, the issues occured when sending to sensor.community.

The issue occurs also when running off a mobile phone USB supply (of course there is no log then), so I think it is not related to my laptop.

The issue doesn't seem to be related to #39 or #82.
putty-ESF32_20200519.log

@ThomasWaldmann
Copy link
Contributor

ThomasWaldmann commented May 20, 2020

From https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-guides/fatal-errors.html :

InstrFetchProhibited

This CPU exception indicates that CPU could not load an instruction
because the the address of the instruction did not belong to a valid
region in instruction RAM or ROM.

Usually this means an attempt to call a function pointer, which does
not point to valid code. PC (Program Counter) register can be used
as an indicator: it will be zero or will contain garbage value (not 0x4xxxxxxx).

@ThomasWaldmann
Copy link
Contributor

ThomasWaldmann commented May 20, 2020

I did a new compile based on the V1.14.0 tagged source (did not keep the .elf file of that version) and the exception decoder showed this:

PC: 0xe00a1000
EXCVADDR: 0xe00a1000

Decoding stack results
0x400e5a8f: HTTPClient::disconnect(bool) at /home/user/.arduino15/packages/esp32/hardware/esp32/1.0.4/libraries/HTTPClient/src/HTTPClient.cpp line 359
0x400e5b01: HTTPClient::end() at /home/user/.arduino15/packages/esp32/hardware/esp32/1.0.4/libraries/HTTPClient/src/HTTPClient.cpp line 347
0x400e6ce5: HTTPClient::begin(String) at /home/user/.arduino15/packages/esp32/hardware/esp32/1.0.4/libraries/HTTPClient/src/HTTPClient.cpp line 216
0x400d3e59: prepare_http(https_client*, char const*) at /tmp/arduino_build_830038/sketch/transmission.cpp line 81
0x400d42b3: send_http_geiger_2_madavi(https_client*, String, unsigned int, unsigned int, unsigned int, unsigned int) at /tmp/arduino_build_830038/sketch/transmission.cpp line 160
0x400d4596: transmit_data(String, int, unsigned int, unsigned int, unsigned int, unsigned int, int, float, float, float, int) at /tmp/arduino_build_830038/sketch/transmission.cpp line 255
0x400d3742: transmit(unsigned long, unsigned long, unsigned long, unsigned long, bool, float, float, float, int) at /home/user/w/MultiGeiger/multigeiger/multigeiger.ino line 224
0x400d3829: loop() at /home/user/w/MultiGeiger/multigeiger/multigeiger.ino line 272
0x400ea38d: loopTask(void*) at /home/user/.arduino15/packages/esp32/hardware/esp32/1.0.4/cores/esp32/main.cpp line 19
0x4008980d: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 143

Code:
https://github.com/espressif/arduino-esp32/blob/1.0.4/libraries/HTTPClient/src/HTTPClient.cpp#L359

@ThomasWaldmann
Copy link
Contributor

Maybe this? espressif/arduino-esp32#3659

@t-pi
Copy link
Contributor

t-pi commented May 20, 2020

I have currently very commonly an error with Madavi, including timeouts - and e.g. no BLE notifications during these times. Most of the time status is 2 - on both boards here. Maybe that is also not helping...

@rexfue
Copy link
Collaborator

rexfue commented May 20, 2020

Madavi seem to have problems at the moment.

@jwoelk
Copy link
Author

jwoelk commented May 21, 2020

Maybe this? espressif/arduino-esp32#3659

Yeah, looks exactly as my problem.

Yesterday the ESP32 was running from about 9:00 to 23:00, hence 14 hours. According to the log file, there were 8 reboots.

  • 7 occurences of "(InstrFetchProhibited)".
  • 1 occurence of "(LoadProhibited)".
  • 6 occurences when sending to sensor.community.
  • 2 occurences when sending to Madavi.

This leads me to think the issue occurs mostly with slow servers. In the above post, KenthJ has the issue only when accessing his Raspberry Pi. Sensor.community is also EXTREMELY slow. I know this since I am developing a script to download my measurement data from it, and tracing the traffic with Wireshark. To get a page "http://archive.sensor.community/2020-05-17/" my script had to wait for 27 s (!) to get an answer to its GET command. The server is then often not using the MTU of 1506 bytes, but sends the answer in small chunks of around 80 bytes. Well, the mentionned file is 3.3 MB large, but anyway the server behaviour is somewhat unusual. Also there is a comment in the Multigeiger code that HTTPS cannot be used because of performance issues.

My guess is that in some cases the ESP32 might ask for package retransmission, while the reminded package is on the way. Closing the HTTP connection too fast might cause late packeges coming in while the ressources are already deleted or reused.

Here is the log file of yesterday.

putty-ESF32_20200520.log

@ThomasWaldmann
Copy link
Contributor

@jwoelk there was also the suspicion that the issue occurs when a server finishes by resetting the connection (RST) - maybe you could check this with wireshark.

@jwoelk
Copy link
Author

jwoelk commented May 21, 2020

@ThomasWaldmann

Yes, the servers reset the connections, but this is only half the truth.

File iad-if-wlan_21.05.20_1302.eth is a Wireshark file from my FritzBox 7490.

iad-if-wlan_21.05.20_1302.eth.zip

The problem with me is that I cannot sit at my laptop during all the day, so I got some very long Wireshark protocols where I cannot really identify ESP32 reboots somewhere in the middle. Some other files don't have any reboot. When a reboot happens, a DHCP grant should show up in the protocol, but I am not sure whether this occurs always. This is the case in this protocol in sequence 160, but this is probably after an intended reset.

But I have detected something interesting. Yes, the servers send RST, but because they are expected to do so! It is the ESP32 that requests the reset.

Well, in fact I am new to the TCP handshake details, so please correct my vision if you think the behaviour is normal.

If you open the FritzBox file with Wireshark, set the filter to

(ip.src == 192.168.178.29 || ip.dst == 192.168.178.29) && http

to get an overview, and to

(ip.src == 192.168.178.29 || ip.dst == 192.168.178.29)

to also see the handshakes. 192.168.178.29 is my Multigeiger ("ESP"), 81.169.1880.11 is the sensor.community server ("SC"), and 85.214.240.94 is the Madavi server ("MA").

With the latter filter I see the following:

iad-if-wlan_21.05.20_1302_excerpt.txt

So in my opinion the ESP requests a TCP connection, requests to disconnect it, then changes its mind and transmits the HTTP packages. At the end it forgets to disconnect.

I didn't yet look into the source code to determine how this bad behaviour can occur. I will be out from tomorrow till sunday evening, so maybe can do something next week.

@jwoelk jwoelk closed this as completed May 21, 2020
@jwoelk jwoelk reopened this May 21, 2020
@ThomasWaldmann ThomasWaldmann added this to the V1.15.0 milestone May 22, 2020
@ThomasWaldmann ThomasWaldmann added the bug Something isn't working label May 22, 2020
@ThomasWaldmann
Copy link
Contributor

It's a bug, but maybe not in our code.

@t-pi
Copy link
Contributor

t-pi commented May 22, 2020

Is there some way to catch this error, display it in the status but otherwise ignore it?

@jwoelk
Copy link
Author

jwoelk commented May 25, 2020

So in my opinion the ESP requests a TCP connection, requests to disconnect it, then changes its mind and transmits the HTTP packages. At the end it forgets to disconnect.

I am sorry, but this observation is wrong. Instead, everything is ok. My fault is that I did not look at the port numbers, too. The requests to close the connection (FIN,ACK) that the client sends are for the respective previous connections. Why the client does this only after establishing a new connection, hence after keeping the previous connections up for 150 seconds without using it, is strange, but should be ok.

Alltogether the HTTP transfers that I looked at are fully conform to the TCP standard. So there is no need to search for bugs in the HTTPClient library at the moment. My apologies again for causing panic.

Yesterday evening I got a huge Wireshark log that finally, after running for about 4 hours, includes a reboot. I will look into it in the next days.

@jwoelk
Copy link
Author

jwoelk commented May 27, 2020

Today I can present a reboot event, documented with a serial log

putty-ESF32_20200527(1).log

and a Wireshark file, which is 36 MB large, even when zipped. It seems to be too large to be uploaded, so please get it here:

https://www.magentacloud.de/share/1.85c23wyt

The core panic occured after less than 3 hours running. It is logged in the serial log file as follows:

GEIGER: Sending to Madavi ...
GEIGER: Error on sending POST: -11
GEIGER: Error on sending POST: -11
GEIGER: Sent to Madavi, status: error, http: -11 -11
GEIGER: Sending to sensor.community ...
Guru Meditation Error: Core 1 panic'ed (InstrFetchProhibited). Exception was unhandled.
Core 1 register dump:
PC : 0xe00a1000 PS : 0x00060c30 A0 : 0x800e5a92 A1 : 0x3ffb1990
A2 : 0x3ffb32dc A3 : 0xe00a1000 A4 : 0x3ffc38a0 A5 : 0x12a09683
A6 : 0x00000000 A7 : 0x00000000 A8 : 0x801811a6 A9 : 0x3ffb1980
A10 : 0x3ffd13dc A11 : 0x00000000 A12 : 0x0000003c A13 : 0x3ffd141c
A14 : 0x3ffb1d74 A15 : 0x00000003 SAR : 0x00000004 EXCCAUSE: 0x00000014
EXCVADDR: 0xe00a1000 LBEG : 0x4000c46c LEND : 0x4000c477 LCOUNT : 0x00000000

Backtrace: 0x600a1000:0x3ffb1990 0x400e5a8f:0x3ffb19b0 0x400e5b01:0x3ffb19d0 0x400e6ce5:0x3ffb19f0 0x400d3e59:0x3ffb1a30 0x400d403f:0x3ffb1a70 0x400d4659:0x3ffb1eb0 0x400d3742:0x3ffb1f00 0x400d3829:0x3ffb1f60 0x400ea38d:0x3ffb1fb0 0x4008980d:0x3ffb1fd0

Rebooting...
ets Jun 8 2016 00:22:57

After finally finding the ELF file I can also run the exception decoder:

Decoding 10 results
0x400e5a8f: HTTPClient::disconnect(bool) at C:\Users\Joachim\AppData\Local\Arduino15\packages\esp32\hardware\esp32\1.0.4\libraries\HTTPClient\src/HTTPClient.cpp line 338
0x400e5b01: HTTPClient::end() at C:\Users\Joachim\AppData\Local\Arduino15\packages\esp32\hardware\esp32\1.0.4\libraries\HTTPClient\src/HTTPClient.cpp line 338
0x400e6ce5: HTTPClient::begin(String) at C:\Users\Joachim\AppData\Local\Arduino15\packages\esp32\hardware\esp32\1.0.4\libraries\HTTPClient\src/HTTPClient.cpp line 338
0x400d3e59: prepare_http(https_client*, char const*) at C:\Users\Joachim\AppData\Local\Temp\arduino_build_215807\sketch/transmission.cpp line 81
0x400d403f: send_http_geiger(https_client*, char const*, unsigned int, unsigned int, unsigned int, unsigned int, int) at C:\Users\Joachim\AppData\Local\Temp\arduino_build_215807\sketch/transmission.cpp line 108
0x400d4659: transmit_data(String, int, unsigned int, unsigned int, unsigned int, unsigned int, int, float, float, float, int) at C:\Users\Joachim\AppData\Local\Temp\arduino_build_215807\sketch/transmission.cpp line 269
0x400d3742: transmit(unsigned long, unsigned long, unsigned long, unsigned long, bool, float, float, float, int) at D:\GIT\MultiGeiger-1.14.0\multigeiger/multigeiger.ino line 224 (discriminator 4)
0x400d3829: loop() at D:\GIT\MultiGeiger-1.14.0\multigeiger/multigeiger.ino line 272
0x400ea38d: loopTask(void*) at C:\Users\Joachim\AppData\Local\Arduino15\packages\esp32\hardware\esp32\1.0.4\cores\esp32/main.cpp line 19
0x4008980d: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 355 (discriminator 1)

The Wireshark file shows some out-of-order packets and some strange behaviour. Some seconds before the core panic the ESP32 requests a TCP connection to Madavi, using port 51375 (seq 129655),
which needs 15 s to send the ACK (seq 129709). A PSH packet is sent, then the HTTP POST in seq 129718. Instead of waiting for the ACK and HTTP OK, the ESP32 requests a second connection to Madavi on port 52259, which is granted after 12 s (seq 129769). After sending the PSH packet, which is ACK'ed, the ESP32 sends the second HTTP POST, which is also ACK'ed, then sends a FIN packet (seq 129793) without waiting for the HTTP OK, that comes in seq 129798 only. After that, the ESP32 doesn't seem to react anymore, though the HTTP OK is repeated 6 times (seq 129803 ... 129852). In seq 130130 a DHCP packet is logged, showing that the ESP32 has rebooted meanwhile.

Difficult to say what has happened in HTTPClient:disconnect(), though this function is not large.

At the moment, I don't intend to dig more deeply into the problem, because there is too much code to study. As the servers are so slow that measurement data are lost anyway from time to time, some sporadic reboots are not really worsening the problem. I hope the reboots will go away when the servers will become faster.

By the way: why do we use TCP when no lost HTTP packets are ever resent?

@t-pi
Copy link
Contributor

t-pi commented May 27, 2020

Additional observation from my side: With send2sensor.community and send2madavi on, the device had a maximum uptime of a few hours, max 7-9h observed over several days.
Switching off madavi only did not help.
Now, with both Wifi transmissions switched off in WLAN config, the device has already an uptime of >12h.

@jwoelk
Copy link
Author

jwoelk commented May 29, 2020

Because some people don't have reboot events, I ran my ESP32 yesterday without the THP module (BME280). I expected to get no reboot event as well. But after running for more than 9 hours the ESP32 had a reboot. This time it was a "LoadProhibited" alarm. Still release 1.14.0.

...
GEIGER: Sending to Madavi ...
GEIGER: Error on sending POST: -11
GEIGER: Sent to Madavi, status: error, http: -11 200
GEIGER: Sending to sensor.community ...
Guru Meditation Error: Core 1 panic'ed (LoadProhibited). Exception was unhandled.
Core 1 register dump:
PC : 0x401811a0 PS : 0x00060c30 A0 : 0x800e5a92 A1 : 0x3ffb1990
A2 : 0x3ffb32dc A3 : 0xdeadbeef A4 : 0x3ffc38a0 A5 : 0xf0232ecd
A6 : 0x4000c2e0 A7 : 0x4000c2e0 A8 : 0x4000c2e0 A9 : 0x3ffb1980
A10 : 0x3ffd13f4 A11 : 0x00000000 A12 : 0x0000003c A13 : 0x3ffd3c3c
A14 : 0x3ffb1d74 A15 : 0x00000003 SAR : 0x00000004 EXCCAUSE: 0x0000001c
EXCVADDR: 0xdeadbeff LBEG : 0x4000c46c LEND : 0x4000c477 LCOUNT : 0x00000000

Backtrace: 0x401811a0:0x3ffb1990 0x400e5a8f:0x3ffb19b0 0x400e5b01:0x3ffb19d0 0x400e6ce5:0x3ffb19f0 0x400d3e59:0x3ffb1a30 0x400d403f:0x3ffb1a70 0x400d4659:0x3ffb1eb0 0x400d3742:0x3ffb1f00 0x400d3829:0x3ffb1f60 0x400ea38d:0x3ffb1fb0 0x4008980d:0x3ffb1fd0

Rebooting...
ets Jun 8 2016 00:22:57
...

Here is the full PuTTY log file:

putty-ESF32_20200528(1)_ohne_thp.log

The exception decoder says:

Decoding 11 results
0x401811a0: HTTPClient::connected() at C:\Users\Joachim\AppData\Local\Arduino15\packages\esp32\hardware\esp32\1.0.4\libraries\HTTPClient\src/HTTPClient.cpp line 395
0x400e5a8f: HTTPClient::disconnect(bool) at C:\Users\Joachim\AppData\Local\Arduino15\packages\esp32\hardware\esp32\1.0.4\libraries\HTTPClient\src/HTTPClient.cpp line 338
0x400e5b01: HTTPClient::end() at C:\Users\Joachim\AppData\Local\Arduino15\packages\esp32\hardware\esp32\1.0.4\libraries\HTTPClient\src/HTTPClient.cpp line 338
0x400e6ce5: HTTPClient::begin(String) at C:\Users\Joachim\AppData\Local\Arduino15\packages\esp32\hardware\esp32\1.0.4\libraries\HTTPClient\src/HTTPClient.cpp line 338
0x400d3e59: prepare_http(https_client*, char const*) at C:\Users\Joachim\AppData\Local\Temp\arduino_build_631902\sketch/transmission.cpp line 81
0x400d403f: send_http_geiger(https_client*, char const*, unsigned int, unsigned int, unsigned int, unsigned int, int) at C:\Users\Joachim\AppData\Local\Temp\arduino_build_631902\sketch/transmission.cpp line 108
0x400d4659: transmit_data(String, int, unsigned int, unsigned int, unsigned int, unsigned int, int, float, float, float, int) at C:\Users\Joachim\AppData\Local\Temp\arduino_build_631902\sketch/transmission.cpp line 269
0x400d3742: transmit(unsigned long, unsigned long, unsigned long, unsigned long, bool, float, float, float, int) at D:\GIT\MultiGeiger-1.14.0\multigeiger/multigeiger.ino line 224 (discriminator 4)
0x400d3829: loop() at D:\GIT\MultiGeiger-1.14.0\multigeiger/multigeiger.ino line 272
0x400ea38d: loopTask(void*) at C:\Users\Joachim\AppData\Local\Arduino15\packages\esp32\hardware\esp32\1.0.4\cores\esp32/main.cpp line 19
0x4008980d: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 355 (discriminator 1)

What I also see in the serial log file are two losses of WLAN connection. Probably this is not relevant. I think I had my mobile lying between the Fritzbox and the ESP32, without putting it into flight mode as during the previous tests.

In the Wireshark file I did not see anything against the rules. In seq 180200 the ESP32 (192.168.178.29) was terminating the connection to Madavi. In seq 180678, about 90 s later, it got the DHCP lease. The Wireshark file is iad-if-wlan_28.05.20_0725_ohne_thp.eth.zip in

https://www.magentacloud.de/share/1.85c23wyt#$/PR314/

Summary: I still don't know why some people don't see reboot events. Running the ESP32 without THP seems to reduce the frequency of reboot events.

@t-pi
Copy link
Contributor

t-pi commented May 29, 2020

Hmm, maybe these people just use the MG locally, without Wifi or initial config, displaying the values?
My BLE Multigeiger (without BME) ran for 2d straight without reboots, with both sensor.community and madavi switched off. Today around noon I switched only sensor.community back on - and it rebooted around 20h and now, 4 min ago...

@t-pi
Copy link
Contributor

t-pi commented Jun 2, 2020

Another thought - the http related code was not changed over the last releases.
@ThomasWaldmann, could the interrupt based timer for PWM sound generation disturb the timings of the network stack and provoque the Core Panic? Could some timer or memory issue there accumulate over time?
I have a release 1.13 (+BLE module for uptime insight on lorastick) running for about a day now, while 1.15-dev reboots every few hours (~6-9h). I just set up a 1-14 (+BLE) for verification.

@t-pi
Copy link
Contributor

t-pi commented Jun 2, 2020

...and already 2 reboots with 1.14 + BLE in 3h vs. no reboots with 1.13 + BLE in 24h. Can we switch off the ticks with http, as you did with iotwebconf, TW?

@ThomasWaldmann
Copy link
Contributor

#314 (comment)

Guess we should first try fixing the HTTPClient code, see the issue linked there.

@t-pi
Copy link
Contributor

t-pi commented Jun 3, 2020

Again: Was any http-related code changed between releases 1.13 and 1.14?
If not, the HTTPClient code is basically working, just finicky with timing / resp. interference to its own timings and we need to reduce the interrupt load.
Could there be some core pinning or suchlike to avoid interference between PWM interrupt management and network stack?

I'm currently at packet count ~16400 (~2d uptime) with v1.13 - no reboot since power-up.

@jwoelk
Copy link
Author

jwoelk commented Jun 3, 2020

I ran my ESP32 in the last days to get some more logs (which I didn't look into in detail).
On Sunday 5/31 I had 4 reboots within 4 h 50 min runtime.
On Monday 6/1 I had 3 reboots within 4 h 11 min runtime.
Today I connected the HV (+pole of the tube) via the 1 GOhm resistor to my multimeter. This causes a much shorter repetition rate of HV recharge pulses (0.3 s instead of 3.8 s, see #192). Within 15 h 10 min runtime I saw 9 reboots (0.59 reboots/h vs. 0.77 reboot/h for the normal case).

While this is not a very reliable statistics, and the holidays might have influenced the internet traffic load, I think that the repetition rate of HV recharge pulses resp. interrupts does not affect the reboot rate.

The interrupt-driven charge pulses were added to 1.14.0, according to issue #192.

Acoustical counter ticks are disabled on my board, but the LED flash is on.

@mockmock1
Copy link

Hi
I have probably the same issue on my project, decoding the stack, i've made some changes (probably not the best way but this solved the reboot problem)

In HTTPclient.cpp in the destructor, i've commented the _client->stop();

HTTPClient::~HTTPClient()
{
if(_client){
// _client->stop();
}
...

I added a new method:
void HTTPClient::disconnect_(){
if(connected()) {
_client->stop();
}
}

In the calling code (http client)

void dotherequest(){
HTTPClient http;
...
http.addHeader("Content-Type", "application/json");
int httpResponseCode = http.POST(requestBody);
http.disconnect_();// <---- new method just after the POST request, do not put delay() between both lines
if (httpResponseCode==200){
...
}

// call to http.end(); has been removed.
}

@ThomasWaldmann
Copy link
Contributor

@mockmock1 Interesting. But can you describe the root cause of the crash and why your changes fix it?

@ThomasWaldmann
Copy link
Contributor

I am currently running "http stress test" code on the esp32 against a local python http server (so I can influence the server behaviour and speed easily).

Does anybody have an idea which timing parameter triggers the issue?

It would be a big win if we could reproduce the issue without having to wait for hours...

@mockmock1
Copy link

I suspect when the httpclient destructor runs that the connection is already closed and a call to _client->stop(); raises an exception (when I decode the stack it's this line which causes the problem).

If I change my code back to the original version I have the issue after 15/20 calls (10 secs between each call).
If I add a delay(100); just before the end of the function where I'm doing the http POST, then the httpclient destructor is called after this small delay and I got an exception after each call.
The HTTP server (AsyncWebServer) is running on another esp32, it seems OK - I cannot see any exception/errors in the logs.

@ThomasWaldmann
Copy link
Contributor

We never call the HTTPClient destructor in our code. The objects are created once, kept in global data structures and are never disposed.

@t-pi
Copy link
Contributor

t-pi commented Jun 28, 2020

I patched the suggested changes into my local NimBLE branch, so far it is running for ~one hour on one device and ~ 30 min on a second device and sending to sensor.community as well as madavi. I'll keep you posted about the resets :)

@ThomasWaldmann
Copy link
Contributor

Our crash is in the HTTPClient code, how is the stuff you point to related?

@ThomasWaldmann
Copy link
Contributor

Moved this issue to next milestone. I'll do V1.15.0 now so everybody has an up-to-date binary with all updates in our code and also in all libraries / frameworks we use. Not sure if it helps with this issue here, but at least we might get rid of some other already fixed issues.

@ThomasWaldmann
Copy link
Contributor

New release: https://github.com/ecocurious2/MultiGeiger/releases/tag/V1.15.0

Please try if this issue still happens (precisely this issue, for other crashes, please open separate tickets).

@ThomasWaldmann
Copy link
Contributor

ThomasWaldmann commented Mar 21, 2021

I cleaned up this ticket and removed unrelated posts to make it easier to digest.

@ecocurious2 ecocurious2 deleted a comment from M-a-r-i-a-n-n-e Mar 21, 2021
@ecocurious2 ecocurious2 deleted a comment from M-a-r-i-a-n-n-e Mar 21, 2021
@ecocurious2 ecocurious2 deleted a comment from rexfue Mar 21, 2021
@ecocurious2 ecocurious2 deleted a comment from M-a-r-i-a-n-n-e Mar 21, 2021
@ecocurious2 ecocurious2 deleted a comment from M-a-r-i-a-n-n-e Mar 21, 2021
@ecocurious2 ecocurious2 deleted a comment from rexfue Mar 21, 2021
@ecocurious2 ecocurious2 deleted a comment from M-a-r-i-a-n-n-e Mar 21, 2021
@ecocurious2 ecocurious2 deleted a comment from rexfue Mar 21, 2021
@ecocurious2 ecocurious2 deleted a comment from M-a-r-i-a-n-n-e Mar 21, 2021
@M-a-r-i-a-n-n-e
Copy link
Contributor

Please try if this issue still happens (precisely this issue, for other crashes, please open separate tickets).

So far no reboot at all with the new release. 👍

@M-a-r-i-a-n-n-e
Copy link
Contributor

M-a-r-i-a-n-n-e commented Mar 22, 2021

Unfortunately, I had the first reboot just now, after approx. 10h. Maybe related to #401 in a way that there could be some blocking code that kills the HTTPClient?

GEIGER: Sending to Madavi ...
GEIGER: Sent to Madavi, status: ok, http: 200 200
GEIGER: Sending to sensor.community ...
Guru Meditation Error: Core  1 panic'ed (LoadProhibited). Exception was unhandled.
Core 1 register dump:
PC      : 0x40207890  PS      : 0x00060230  A0      : 0x800e9c6e  A1      : 0x3ffcfb80
A2      : 0x3ffd157c  A3      : 0x00000000  A4      : 0x3ffc7804  A5      : 0xde12a7c1  
A6      : 0x4000c2e0  A7      : 0x4000c2e0  A8      : 0x4000c2e0  A9      : 0x3ffcfb70
A10     : 0x3ffe06d4  A11     : 0x00000000  A12     : 0x0000003c  A13     : 0x3ffdfaa8
A14     : 0x3ffcff64  A15     : 0x00000003  SAR     : 0x00000004  EXCCAUSE: 0x0000001c  
EXCVADDR: 0x00000014  LBEG    : 0x4000c46c  LEND    : 0x4000c477  LCOUNT  : 0x00000000

ELF file SHA256: 0000000000000000

Backtrace: 0x40207890:0x3ffcfb80 0x400e9c6b:0x3ffcfba0 0x400e9cdd:0x3ffcfbc0 0x400eb3b1:0x3ffcfbe0 0x400d468d:0x3ffcfc20 0x400d4873:0x3ffcfc60 0x400d4e95:0x3ffd00a0 0x400d3f52:0x3ffd00f0 0x400d4045:0x3ffd0150 0x400eea4c:0x3ffd01a0 0x40091afe:0x3ffd01c0

Rebooting...

PC: 0x40207890: HTTPClient::connected() at ~\.platformio\packages\framework-arduinoespressif32\libraries\HTTPClient\src\HTTPClient.cpp line 406
EXCVADDR: 0x00000014

Decoding stack results
0x40207890: HTTPClient::connected() at ~\.platformio\packages\framework-arduinoespressif32\libraries\HTTPClient\src\HTTPClient.cpp line 406
0x400e9c6b: HTTPClient::disconnect(bool) at ~\.platformio\packages\framework-arduinoespressif32\libraries\HTTPClient\src\HTTPClient.cpp line 370
0x400e9cdd: HTTPClient::end() at ~\.platformio\packages\framework-arduinoespressif32\libraries\HTTPClient\src\HTTPClient.cpp line 358
0x400eb3b1: HTTPClient::begin(String) at ~\.platformio\packages\framework-arduinoespressif32\libraries\HTTPClient\src\HTTPClient.cpp line 220
0x400d468d: prepare_http(https_client*, char const*) at multigeiger/transmission.cpp line 91
0x400d4873: send_http_geiger(https_client*, char const*, unsigned int, unsigned int, unsigned int, unsigned int, int) at multigeiger/transmission.cpp line 118
0x400d4e95: transmit_data(String, int, unsigned int, unsigned int, unsigned int, unsigned int, int, float, float, float, int) at multigeiger/transmission.cpp line 279
0x400d3f52: transmit(unsigned long, unsigned long, unsigned long, unsigned long, bool, float, float, float, int) at #/Multigeiger/multigeiger/multigeiger.ino line 228
0x400d4045: loop() at #/Multigeiger/multigeiger/multigeiger.ino line 282
0x400eea4c: loopTask(void*) at ~\.platformio\packages\framework-arduinoespressif32\cores\esp32\main.cpp line 37
0x40091afe: vPortTaskWrapper at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/freertos/port.c line 143

@M-a-r-i-a-n-n-e
Copy link
Contributor

M-a-r-i-a-n-n-e commented Mar 23, 2021

For me, it is interesting that sending to Madavi almost always works, before something crashes more ore less hard (cf. #401) while sending the data to sensor.community.

@ThomasWaldmann ThomasWaldmann modified the milestones: V1.16.0, V1.15.1 Mar 25, 2021
@ThomasWaldmann
Copy link
Contributor

New arduino-esp32 code, including changes to HTTPClient was released, see #408.

@ThomasWaldmann ThomasWaldmann modified the milestones: V1.15.1, V1.16.0 Apr 9, 2021
@ThomasWaldmann
Copy link
Contributor

espressif/arduino-esp32#3659 (comment) this sounds promising.

@t-pi
Copy link
Contributor

t-pi commented Aug 12, 2021

Current master branch 1.16.0-dev runs very stable (latest check: 19d straight), issue seems to be fixed with change to NimBLE.
Closing for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants