LwIP socket: unwanted behaviour and memory leak after "pulling-the-plug" event (IDFGH-7051) #8668

ChrisHul · 2022-03-27T16:50:44Z

Environment

Development Kit:
Kit version
Module or chip used: ESP32-WROOM-32U
IDF version: v4.4
Build System: Cmake
Compiler version: xtensa-esp32-elf-gcc.exe (crosstool-NG esp-2021r2-patch2) 8.4.0
Operating System: Windows
(Windows only) environment type: PowerShell.
Using an IDE?: Eclipse Version: 2021-12 (4.22.0)
Power Supply: USB

Problem Description

I started having problems connecting an ESP32 client to an ESP32 running Mongoose MQTT server. When resetting the client a few times the server would refuse new connection giving error 23 (too many connections). Running https://github.com/nopnop2002/esp-idf-mqtt-broker server software.
Connecting and disconnecting with software is no problem. Closing the socket on one end will signal error/termination reading the other end and this will trigger a closure there as well.
When I reset the client board, the connections will stall on the server side and are made both unreadable and unwritable. Mongoose does not detect this. Update: stalled socket resolves on its own after 2 hours (!) allowing reading and returning n<0. New update: 2 hours is the default TCP connection keep-alive. Default value can be change with setsockopt( ) call.
Another work-around is modifying the software so that this condition would be detected and the socket closed. This also eliminates the socket.
After this I discovered that this action leaves behind allocated memory on the Heap (randomly 0-36 bytes). The leak accumulates and does not go away but remains steady if I stop breaking connections. I did this because there are reports suggesting memory could be returned after a rather long period of time in lwIP.

Expected Behavior

The socket should not stall, but rather raise an error to be detected by caller. In any case it should release allocated memory when closing.

Actual Behavior

The socket connection stalls and leaves behind dead memory if closed. Update: if not closed it resolves with termination code after a very long delay (2 hours).

Steps to reproduce

Terminate the connection abruptly and monitoring the Heap size.

Code to trigger zombie socket shutdown

		for (struct client *next, *client = s_clients; client != NULL; client = next) {
			next = client->next;
			if( cid.len == client->cid.len &&
				 memcmp( cid.ptr, client->cid.ptr, cid.len) == 0)
			{
	        		       client->c->is_closing = 1; // will close connection in manager poll
			}
		}

Mongoose own shutdown trigger

    n = c->is_tls ? mg_tls_recv(c, buf, len) : mg_sock_recv(c, buf, len);
        ...
    if (n == 0) {
      // Do nothing
    } else if (n < 0) {
       c->is_closing = 1;  // Error, or normal termination
    } else if (n > 0) { ...

TCP keep-alive time (defaults to 7200) can be changed as follows:

       // Set tcp keepalive options
        setsockopt( (int) c->fd, SOL_SOCKET, SO_KEEPALIVE, &keepAlive, sizeof(int));
        setsockopt( (int) c->fd, IPPROTO_TCP, TCP_KEEPIDLE, &keepIdle, sizeof(int));
        setsockopt( (int) c->fd, IPPROTO_TCP, TCP_KEEPINTVL, &keepInterval, sizeof(int));
        setsockopt( (int) c->fd, IPPROTO_TCP, TCP_KEEPCNT, &keepCount, sizeof(int));

// If your code is longer than 30 lines, GIST is preferred.

Debug Logs

Debug log goes here, should contain the backtrace, as well as the reset source if it is a crash.
Please copy the plain text here for us to search the error log. Or attach the complete logs but leave the main part here if the log is *too* long.

Other items if possible

sdkconfig file (attach the sdkconfig file from your project folder)
elf file in the build folder (note this may contain all the code details and symbols of your project.)
coredump (This provides stacks of tasks.)

The text was updated successfully, but these errors were encountered:

negativekelvin · 2022-03-27T19:10:19Z

I modified the software

Show your modifications

ChrisHul · 2022-03-27T19:45:16Z

Updated issue description with added server code used to shutdown unwanted (dead) socket connection.
Update: Left the server running and it turns out two connections closed automatically (triggered by len < 0) from recv. One 1,5 hours after connection was interrupted and the other one 2 hours. The ms count is so accurate that it must have been programmed that way.

sramrajkar · 2022-04-14T13:48:10Z

Hi,
I have a very similar set up, except my ESP32 module is on a custom board. I see this issue as well. For me there is no recover of any kind once the sockets max out, it never recovers. Pulling the ethernet cable and reconnecting kills all zombie sockets and my mqtt client reconnects.
@ChrisHul I want to try your work around, where is this work around to be applied? In the loop that polls the mqtt mgr?

ChrisHul · 2022-04-14T14:21:10Z

Hi,
I have updated problem description regarding TCP keep-alive which defaults to 2 hours. This value can be changed with approriate API as shown above.
My work-around consisted of detecting if client name already existed when opening a new connection. This would mean that the older connection is redundant and setting connection->is_closing will resolve inside mongoose poll loop eliminating the dead connection.
Both solution need to be implemented in the event handler upon MQTT_CMD_CONNECT handling.

ESP-YJM · 2022-04-15T03:47:57Z

@ChrisHul I think you shoud add mechanisms for detecting zombie clients in server side, you can use MQTT PING/PONG packet or TCP keepalive. For the memory leak, can you tell more details about how to reproduce and try many times in your local to test it, check it whether execute about 100 times will occupy about large memory.

sramrajkar · 2022-04-15T14:35:21Z

@ChrisHul I may have found a solution of myself.
The TCP based solution you mentioned is mostly working, but only on mqtt clients that are well behaved. I am working with some sensors that are mostly buggy and they dont close old ports and also keep the TCP connections active but request new connections at the broker.
For the client id match based solution, I myself in the past have used random number generator to get client ID, so this method will not work in that condition.
I am not keep on editing the mongoose file itself.
My method involves counting the number connections that mgr thinks is active and once it reaches a threshold I kill all open connection letting devices reconnect via the mqtt re-connection mechanism that most devices will implement. This works for me as my product is going to support a fixed maximum number of MQTT clients.
Here is the code in the loop that calls poller for the manager
for (c = mgr.conns; c != NULL; c = tmp) { tmp = c->next; connectionCtr++; } if(connectionCtr > 6u) { for (c = mgr.conns; c != NULL; c = tmp) { tmp = c->next; if(c->id != 1) { c->is_closing = 1; } } }

ChrisHul · 2022-04-15T16:47:58Z

Thank You both @sramrajkar and @ESP-YJM for sharing your thoughts,

Politically I agree fully with @ESP-YJM in the view of regarding the MQTT protocol as the primary responsible for time-out issues. Unfortunately Mongoose does not support MQTT time-out and will only respond to client pinging. Full stop. No timing function is enabled in this software. If there is a better MQTT layer which allows server implementation, please advice.

I have also updated the issue description with the server/broker software being used. This is the location where I have applied the modifications mentioned above. Apoligies for not stating that previously.

When it comes to the memory leak, I have actually stress-tested the server with a client that resets twice a minute and reconnects. Both short and long-term results indicate that there is a memory leak. Leaving the two devices interacting over the night will produce a heap reduction of 15-20 kB in the server device. Short term I can observe the heap being reduced by an average of 20 bytes every cycle. I have checked the Mongoose memory handling and it seems clean, so I suspect the lwIP component is at fault. I'll be happy to provide further details for reproducing this issue.

I consider both suggested fixes as work-arounds and far from ideal. For the time being I prioritize getting my project working. Sorry I couldn't help @sramrajkar, but I wish You good luck with your project.

ESP-YJM · 2022-04-18T07:30:09Z

@ChrisHul I think you can refer to https://docs.espressif.com/projects/esp-idf/en/latest/esp32/api-reference/system/heap_debug.html?highlight=heap_trace_stop#how-to-diagnose-memory-leaks and add some heap debug.

ChrisHul · 2022-04-18T18:47:00Z

@ESP-YJM I did previously check the esp-idf instructions for debugging memory leak issues, but the structure itself of the lwIP socket software is rather overwhelming. Debugging the closure of the socket would include code such as:

err = netconn_prepare_delete(sock->conn);

where one can assume that the execution is not linear, but deferred to a later occasion, maybe triggered by an interrupt or a timer. This makes it very difficult to pinpoint any code as being susceptible to memory leaks. Memory is being allocated and freed all the time and would rapidly fill up the debugging record if left active over time.
I'll try to "allocate" some of my time in the coming weeks and try out a number of testing setups to see the effects of it.

ESP-YJM · 2023-10-18T08:10:00Z

@ChrisHul Any update for this issue?

ChrisHul · 2023-10-18T17:00:14Z

Hi @ESP-YJM ! Not really. Have been working on hardware part and main processor programming. Will check with updated ESP32 software when I return to the networking again. Thanks for shown interest.

github-actions bot changed the title ~~LwIP socket: unwanted behaviour and memory leak after "pulling-the-plug" event~~ LwIP socket: unwanted behaviour and memory leak after "pulling-the-plug" event (IDFGH-7051) Mar 27, 2022

espressif-bot added the Status: Opened Issue is new label Mar 27, 2022

espressif-bot added Status: In Progress Work is in progress and removed Status: Opened Issue is new labels Apr 15, 2022

Alvin1Zhang added the Awaiting Response awaiting a response from the author label Oct 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LwIP socket: unwanted behaviour and memory leak after "pulling-the-plug" event (IDFGH-7051) #8668

LwIP socket: unwanted behaviour and memory leak after "pulling-the-plug" event (IDFGH-7051) #8668

ChrisHul commented Mar 27, 2022 •

edited

negativekelvin commented Mar 27, 2022

ChrisHul commented Mar 27, 2022

sramrajkar commented Apr 14, 2022

ChrisHul commented Apr 14, 2022 •

edited

ESP-YJM commented Apr 15, 2022

sramrajkar commented Apr 15, 2022

ChrisHul commented Apr 15, 2022

ESP-YJM commented Apr 18, 2022

ChrisHul commented Apr 18, 2022 •

edited

ESP-YJM commented Oct 18, 2023

ChrisHul commented Oct 18, 2023

LwIP socket: unwanted behaviour and memory leak after "pulling-the-plug" event (IDFGH-7051) #8668

LwIP socket: unwanted behaviour and memory leak after "pulling-the-plug" event (IDFGH-7051) #8668

Comments

ChrisHul commented Mar 27, 2022 • edited

Environment

Problem Description

Expected Behavior

Actual Behavior

Steps to reproduce

Code to trigger zombie socket shutdown

Mongoose own shutdown trigger

Debug Logs

Other items if possible

negativekelvin commented Mar 27, 2022

ChrisHul commented Mar 27, 2022

sramrajkar commented Apr 14, 2022

ChrisHul commented Apr 14, 2022 • edited

ESP-YJM commented Apr 15, 2022

sramrajkar commented Apr 15, 2022

ChrisHul commented Apr 15, 2022

ESP-YJM commented Apr 18, 2022

ChrisHul commented Apr 18, 2022 • edited

ESP-YJM commented Oct 18, 2023

ChrisHul commented Oct 18, 2023

ChrisHul commented Mar 27, 2022 •

edited

ChrisHul commented Apr 14, 2022 •

edited

ChrisHul commented Apr 18, 2022 •

edited