-
Notifications
You must be signed in to change notification settings - Fork 7.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
httpd server stops working after a while (IDFGH-1594) #3851
Comments
@fshamshirdar Thanks for reporting this. We'll try reproducing this on our side. In the meanwhile, it will be helpful if you could provide the monitor log for this. |
Thanks for following up, there is actually not much log when it happens. Here is all I have got: httpd_txrx: httpd_sock_err: error in send : 11 |
@fshamshirdar From the looks of it I think this is happening due to Wi-Fi disconnection.
If your ESP32 is running in station mode, could you check if it is visible on the Wi-Fi network right after the error occurs? Also are there any Wi-Fi related messages in the monitor log? |
No, the WiFi is running on softAP. Error 104 happens since the way I test the system is to refresh the webpage multiple times (some sort of stress testing) and is expectable. What does error 11 means? that's the last error I get from HTTP right before it goes down. I should mention that the Bluetooth module is running in the BLE mode with WiFi coexistence "balance". |
That could be happening if the send buffer is full and it times out waiting for space. Could you try increasing
Even in case of error, the server should keep functioning. How do you test if it goes down or not? Also, please enable debug log level on the esp_http_server component. That may provide some more insight. |
@anurag-kar,
Seems to me that these errors has a common source Description: Here is full logs: beacon.zip - 1st,2nd case http-server settings: Environment:
|
@no1seman Thanks for reporting. The third case may have to do with parsing only, I'll check that. The first case seems like the server is waiting to send an HTTP response. Since it runs on a single thread, it cannot start processing the next request until it finishes responding to the current request. Probably, the send buffer is full, so it keeps waiting till there is enough space for the response data. According to the log it is waiting for about 4 seconds :
The second case is different from what @fshamshirdar is experiencing, because he expects
That is expected, because |
@anurag-kar, "But in your case it seems the client is closing the connection unexpectedly, while receiving HTTP response" - yes, you are right, I've got the following error in Chome: Chrome coundn't load frist chunk and canceled connection, but the question is: why esp failed correctly send that chunk? 100% the is on the esp side. As I have mentioned in previous post, I have 2 interfaces on esp: ethernet and WiFi STA. On Ethernet such an error occurs rarely, but if I try to connect esp32 http-server via esp32 WiFi STA interface I got these errors: W (35212) httpd_txrx: httpd_sock_err: error in send : 104 with a probability of 50%. Here is the log:
full log: So, It comes out that esp connected to WiFi router via WiFi to my laptop connected to the same router via 1Gb/s Ethernet (all of them: router, esp & laptop is on my working table) can not with high probability send or recieve one of chunk of data via WiFi and that hangs an error. It's very strange. Turnig on LwIP debug:
in browser I've got: http://192.168.3.130/schema.json net::ERR_INVALID_HTTP_RESPONSE error If you have any ideas about that error? |
@anurag-kar,
This error:
occurs after ACK retrasmission |
finally catched: '104 error':
Whireshark:
Full pcap file is here: Wish that this helps to solve the problem. |
@anurag-kar, Single thread on Ethernet IF (10 iterations): Single thread on WiFi IF (10 iterations): Dual thread test on Ethernet (2x10 iterations): Full log: Explanation of test :
Ths script is simply curl http-resources from esp32 in cycle and checks md5 hash from HTTP response header with md5 hash of downloaded file and aggregate statistics.
PS Now I'm 99% sure that all mentioned in this issue thread problems has single source - errors in trasmission data between server and client. |
@ no1seman -> possibly 99% is a little bit too much :-) |
Well, I solved the problem by turnig on sockets debug in lwip (by setting #define SOCKETS_DEBUG LWIP_DBG_ON in lwipopts.h) but its not a solution, not a solution for production ( Adding 'lwip sockets debug' - I added delays and now everything works really fine. If adding delays helps - that means that there is some locks or logic errors in tcp_transport or lwip components and wish that they will be found and fixed. |
@no1seman @bigbrassbed I thank you for your efforts and insights. It is quite a finding. I'll try and reproduce this behavior on my side. In the meanwhile, @fshamshirdar could you please try introducing delays or use LWIP_DBG_ON to see if that fixes your issue as well? |
confirmed for my situation receiving data (...when I eventually found the right place for switching on debug-messages and those appeared :-) ) |
...unfortunately not allways helps enabling debug messages (... possibly not enough total 'delay' added) |
@bigbrassbed and @no1seman Just to ensure that this is not due to some recent changes on master, could you please check your tests with IDF release v3.2.2 ? Because otherwise this kind of issue would have showed up at other places as well, not just in the context of the HTTP server.
@no1seman, could you also post the difference between the expected and received files in case of checksum mismatch? I believe you already have local copies of the files that have been kept on the server. The comparison could help pinpoint if there is some pattern to the data loss (like only at the end of the file or only the beginning). If these mismatch patterns are not random then it could still help point to some issue in the |
|
@no1seman With the wireshark logs that you provided, I was able to recreate the TCP stream : What I notice is that there is no CRLF terminator to indicate the end of the headers section:
The transaction contains two exchanges, so you can compare it to the first response which does have correct termination. From the monitor log I can clearly make out that the server did send the CRLF to terminate the headers section:
From the Update: By comparing the expected and received files, I only see that the difference is replacement of two consecutive bytes with null characters. The packet size remains the same, just this starnge replacement happens once in a while. Interesting! |
@anurag-kar, |
@no1seman Thanks for testing this! What I can make out is that the headers problem is occurring due to some glitch in transmission (like you said). At least it seems to happen during If possible, could you please provide a wireshark capture during |
@anurag-kar, Prerequisites:
How to use:
On last test I've got the following result: Full log is here: Will you please try to test on you site? |
@no1seman Thanks! I'll try your test program this weekend. |
@negativekelvin,
I'm a little bit confused, because later I have an idea that there is a problem with SPI PSRAM, but mbedtls_md5_ret() (look into my code) works fine with buffer located in SPI PSRAM (in later versions of my test script I have checked md5 not only calculated by http server but compared it with one made on my host linux and it always was correct) , but when data transmitted by tcp_transport from SPI PSRAM it damages occasionally. So, seems that the source of the problem found, sure that you will provide a fix soon. PS When I switching on CONFIG_SPIRAM_TRY_ALLOCATE_WIFI_LWIP = y with igrr's patched lib the things goes from bad to worth, test give much more errors that in other cases. |
@anurag-kar; @no1seman made some more test: lwip_recvfrom: netconn_recv err=-3, netbuf=0x0 => sockets.c around line 1038 basically we are sitting in netconn_recv_data waiting that the timeout occurs, I fiddled around with timeout-values, but 'wifi softAP' resets the connection (no messages are comming up for this event) long before the timeout is triggered. I haven't figured out so far what the triggering moment for wifi softAP is. (to recap: transmission via wifi STA shows no problems) |
are you using SPI PSRAM during you tests (CONFIG_ESP32_SPIRAM_SUPPORT==y)? |
CONFIG_SPIRAM_SUPPORT=y SPI RAM configCONFIG_SPIRAM_BOOT_INIT=y |
no difference -> wifi softAP still stops working (after a while) |
thank you for answer. Don't know if it help you, but I modifyed my test app to support SoftAP: https://github.com/no1seman/transmit_bug_ap I can't reproduce you bug ("W (206326) httpd_txrx: httpd_sock_err: error in send : 11"), but all problems that I have with Ethernet/Wifi STA - is the same: freezings, bad content and so on. If you will change in provided above app sdkconfig, web-server settings, softAP settings and http headers similar to yours I can try to test on my site and also it would help to find out the bug faster. Instructions is the same: #3851 (comment) |
I got it somehow running :-) first connect to softAP so big question: why does softAP reset the connection? |
@no1seman could you please explain your solution a bit? How and where did you add a delay? |
@zahednejad203, long answer: my bug was related to PSRAM cache issue. So if you are using external PSRAM on ESP32 rev. 1 - just don't do it, because there is no final solution at that moment and as Expressif's said: #2892 (comment) it will be solved only in rev. 3 silicon. If you are not using external PSRAM - your problem is in smth else. |
Or work in unicore mode, then SPI-RAM works. |
@neoniousTR |
It seems to be that UniCore does not help. httpd hangs randomly. If WDT is disabled - system stay functional (i.e other tasks running), Same behaviour in AP and STA modes. `E (2969827) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time: abort() was called at PC 0x400d4eef on core 0 Backtrace:0x400973b2:0x3ffb06a0 0x40097115:0x3ffb06c0 0x4009a476:0x3ffb06e0 0x400d4eef:0x3ffb0750 0x40083802:0x3ffb0770 0x4000bfed:0x3ffbe3f0 0x4009a861:0x3ffbe400 0x4009b54a:0x3ffbe420 0x4009111a:0x3ffbe460 0x40172219:0x3ffbe480 0x4009387b:0x3ffbe4a0 0x40172522:0x3ffbe4c0 0x40094002:0x3ffbe520 0x400d335b:0x3ffbe5c0 0x400fdf20:0x3ffbe630 0x400fe02e:0x3ffbe670 0x4009a6e1:0x3ffbe690 0x40097115: esp_system_abort at /Users/igor/esp/esp-idf/components/esp_system/system_api.c:68 0x4009a476: abort at /Users/igor/esp/esp-idf/components/newlib/abort.c:46 0x400d4eef: task_wdt_isr at /Users/igor/esp/esp-idf/components/esp_common/src/task_wdt.c:179 (discriminator 1) 0x40083802: _xt_lowint1 at /Users/igor/esp/esp-idf/components/freertos/xtensa/xtensa_vectors.S:1105 0x4009a861: vPortExitCritical at /Users/igor/esp/esp-idf/components/freertos/xtensa/port.c:419 0x4009b54a: xQueueGenericReceive at /Users/igor/esp/esp-idf/components/freertos/queue.c:1541 0x4009111a: sys_mutex_lock at /Users/igor/esp/esp-idf/components/lwip/port/esp32/freertos/sys_arch.c:82 0x40172219: sock_inc_used_locked at /Users/igor/esp/esp-idf/components/lwip/lwip/src/api/sockets.c:378 0x4009387b: tryget_socket_unconn_locked at /Users/igor/esp/esp-idf/components/lwip/lwip/src/api/sockets.c:489 0x40172522: lwip_selscan at /Users/igor/esp/esp-idf/components/lwip/lwip/src/api/sockets.c:1921 0x40094002: lwip_select at /Users/igor/esp/esp-idf/components/lwip/lwip/src/api/sockets.c:2058 0x400d335b: esp_vfs_select at /Users/igor/esp/esp-idf/components/vfs/vfs.c:985 0x400fdf20: httpd_server at /Users/igor/esp/esp-idf/components/esp_http_server/src/httpd_main.c:174 0x400fe02e: httpd_thread at /Users/igor/esp/esp-idf/components/esp_http_server/src/httpd_main.c:227 0x4009a6e1: vPortTaskWrapper at /Users/igor/esp/esp-idf/components/freertos/xtensa/port.c:143` |
This loop consume 100% of CPU, If WDT is off - ESP not accept new connections. |
I am having the same issue with my ESP32 httpd server randomly hanging. I am currently on IDF version 4.2. This is problematic for having the ESP as a home automation note, and suddenly no longer responding. Have there been any other root-causes or work-around identified? I am thinking to switch to the mongoose webserver to see if it works more reliably. |
@shaeberling Two things I tried that gave me some improvements. I am using the camera_web_server example of the esp-who kit. Changed the receive timeout to 30 seconds in esp-idf/components/esp_http_server/include/esp_http_server.h #define HTTPD_DEFAULT_CONFIG() { \
.task_priority = tskIDLE_PRIORITY+5, \
.stack_size = 4096, \
.core_id = tskNO_AFFINITY, \
.server_port = 80, \
.ctrl_port = 32768, \
.max_open_sockets = 7, \
.max_uri_handlers = 8, \
.max_resp_headers = 8, \
.backlog_conn = 5, \
.lru_purge_enable = false, \
.recv_wait_timeout = 30, \//<---- This one here
.send_wait_timeout = 5, \
.global_user_ctx = NULL, \
.global_user_ctx_free_fn = NULL, \
.global_transport_ctx = NULL, \
.global_transport_ctx_free_fn = NULL, \ https://scientric.com/2019/11/07/esp32-cam-stream-capture/ |
Thank you @Louis-Riel . I am not sure this is the problem that I have. While I haven't reproduced it yet while I had it connected to read the logs, I think it might be due to the core going to 99% and then no longer handling http requests, since I can still ping it. Since I have no easy way to repro, I decided to with mongoose a try. It was pretty easy for me to add it to the code base and it has been working flawlessly so far. |
I finally root-caused the issue and have an easy repro. Browsers by default keep the connection to the server alive to speed up future requests. However, by disconnecting the client device from the network before it is able to close the connection results in that socket remaining open. Eventually, we run out of the maximum configured socket count and the server stops responding. Here is an easy repro:
The default max open socket count for httpd is 10, so after about 10 requests you will see that it is no longer responding. When using mongoose, you'll instead get a My solution for now is to send Am I missing a timeout setting that force closes open sockets after they have been idle for a while? |
@shaeberling Maybe setting |
@no1seman @fshamshirdar PSRAM issue has been fixed now, please try with @shaeberling Did above suggestion help you? |
I reproduce the same issue on my project with lru_purge_enable set to true. |
@shaeberling @SamyRICHET I was able to reproduce the issue with steps mentioned here. Can you try the patch attached below and check if it fixes the issue (when lru_purge_enable is set to true). Please let us know if the patch works for you. |
@shubhamkulkarni97 For me, this patch work on my project. Thanks you |
Hi. |
Hi there,
I have developed some modules for my ESP32 device (WiFi softAP, HTTP server, WebSocket, spiffs for local storage and SPI to communicate with a SPI master). Now the issue with this system is that after a few minutes of running, the wifi or webserver stops working properly. By that I mean sometimes the webpage does not load with the following log message, or ping packet goes with "Destination Host Unreachable".
"W (206326) httpd_txrx: httpd_sock_err: error in send : 11"
This problem goes away completely when I disable the SPI module. Even though I applied a much higher priority to the httpd_config_t, this problem persists.
Any thoughts?
The text was updated successfully, but these errors were encountered: