Skip to content

esp32: fix socket locks across TCP connects#2332

Open
pguyot wants to merge 1 commit into
atomvm:release-0.7from
pguyot:w23/esp32-socket-locks
Open

esp32: fix socket locks across TCP connects#2332
pguyot wants to merge 1 commit into
atomvm:release-0.7from
pguyot:w23/esp32-socket-locks

Conversation

@pguyot
Copy link
Copy Markdown
Collaborator

@pguyot pguyot commented Jun 8, 2026

These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later

Signed-off-by: Paul Guyot <pguyot@kallisys.net>
@petermm
Copy link
Copy Markdown
Contributor

petermm commented Jun 8, 2026

AMP, pick and choose:

Code Review — esp32: fix socket locks across TCP connects

Commit: db8444df1ea43b3fb96e33e079a53983a05d5f05
File: src/platforms/esp32/components/avm_builtins/socket_driver.c

Summary

The commit correctly fixes the main use-after-free in socket_events_handler (copying
process_id out of the locked list before sending the message) and stops holding the
global socket-list lock across blocking netconn_connect / netconn_accept. The non-blocking
accept + re-register-accepter approach is a reasonable way to tolerate spurious/stale ready
events from lwIP.

However, the change introduces / leaves several real correctness holes:

# Severity Issue
1 High Failed do_connect leaves a live port with platform_data == NULL; gen-call handlers will NULL-deref
2 High do_tcp_server_netconn_event removes all accepters but only services/re-appends the last → leak + lost accepters
3 Medium do_receive_data clears conn = NULL after delete + list removal → recycled-netconn race window
4 Medium do_listen does not NULL-check netconn_new(...), and holds the global socket lock across bind/getaddr
5 Low purge_ready_connections matches on an already-freed netconn * (identity-token fragility under address reuse)
6 Low Unchecked malloc in socket_events_handler / do_accept
7 Low netbuf leaks on do_receive_data error paths (pre-existing, worth fixing while here)

The raw struct netconn * is used as the only event identity throughout; under heap/lwIP
address reuse this remains inherently fragile. Fixes below reduce the worst races but don't
make recycled pointers impossible to misclassify.


1. (High) Failed connect leaves a live port that will NULL-deref

do_connect sets ctx->platform_data = NULL on connect failure, but the only guard added is
in the netconn_event_internal branch of socket_consume_mailbox. Every gen-call handler
(do_close, do_send, do_recvfrom, do_sockname, …) dereferences ctx->platform_data
unconditionally. After a failed connect the port stays alive and unusable, so:

  • it leaks the Context / port, and
  • any subsequent close/send/recv/… on it crashes.

Fix — terminate the port when init fails, and guard non-init messages defensively:

         term cmd_name = term_get_tuple_element(gen_message.req, 0);

+        if (cmd_name != INIT_ATOM && IS_NULL_PTR(ctx->platform_data)) {
+            int32_t pid = term_to_local_process_id(gen_message.pid);
+            uint64_t ref_ticks = term_to_ref_ticks(gen_message.ref);
+
+            if (cmd_name == CLOSE_ATOM) {
+                if (UNLIKELY(memory_ensure_free(ctx, REPLY_SIZE) != MEMORY_GC_OK)) {
+                    AVM_ABORT();
+                }
+                do_send_reply(ctx, OK_ATOM, ref_ticks, pid);
+                return NativeTerminate;
+            }
+
+            do_send_error_reply(ctx, ERR_CONN, ref_ticks, pid);
+            mailbox_remove_message(&ctx->mailbox, &ctx->heap);
+            return NativeContinue;
+        }
+
         switch (cmd_name) {
             //TODO: remove this
             case INIT_ATOM:
                 TRACE("init\n");
                 do_init(ctx, &gen_message);
+                if (IS_NULL_PTR(ctx->platform_data)) {
+                    // Init failed (it already sent its error reply); don't leave a
+                    // live port whose handlers will NULL-deref platform_data.
+                    return NativeTerminate;
+                }
                 break;

The init error reply is sent synchronously by do_connect/do_listen before returning, so
the caller still receives the error before the port terminates (same pattern as do_close).


2. (High) do_tcp_server_netconn_event drops/leaks all but the last accepter

The loop removes every accepter from the list but only keeps the last in accepter; the
others are removed from the list and never freed/serviced → memory leak and lost accept
requests.

    MUTABLE_LIST_FOR_EACH (accepter_head, tmp, &tcp_data->accepters_list_head) {
        accepter = GET_LIST_ENTRY(accepter_head, struct TCPServerAccepter, accepter_head);
        list_remove(accepter_head);   // removes ALL, keeps only last
    }

Fix — pop a single accepter:

-    struct ListHead *accepter_head;
-    struct ListHead *tmp;
     struct TCPServerAccepter *accepter = NULL;
-    MUTABLE_LIST_FOR_EACH (accepter_head, tmp, &tcp_data->accepters_list_head) {
+
+    if (!list_is_empty(&tcp_data->accepters_list_head)) {
+        struct ListHead *accepter_head = list_first(&tcp_data->accepters_list_head);
         accepter = GET_LIST_ENTRY(accepter_head, struct TCPServerAccepter, accepter_head);
         list_remove(accepter_head);
     }

(list_first / list_is_empty are available in src/libAtomVM/list.h.)


3. (Medium) Recycled-netconn race in do_receive_data

do_receive_data deletes the netconn first, removes the socket from the list, and only then
sets socket_data->conn = NULL. During the window between netconn_delete and list_remove,
the freed (and possibly recycled) conn pointer is still published in platform->sockets, so
a concurrent socket_events_handler (different scheduler thread) can match a recycled pointer
to this dying socket. do_close is better (it sets conn = NULL first) but still writes it
outside the lock that the event handler reads under.

Fix — clear the published conn under the socket-list write lock before deleting, then
operate on a local copy:

         if (socket_data->type == TCPClientSocket) {
             struct ESP32PlatformData *platform = ctx->global->platform_data;
-            if (UNLIKELY(netconn_close(socket_data->conn) != ERR_OK)) {
+            struct netconn *closed_conn = socket_data->conn;
+
+            struct ListHead *sockets_head = synclist_wrlock(&platform->sockets);
+            UNUSED(sockets_head);
+            socket_data->conn = NULL;
+            synclist_unlock(&platform->sockets);
+
+            if (UNLIKELY(netconn_close(closed_conn) != ERR_OK)) {
                 TRACE("do_receive_data: netconn_close failed\n");
             }
-            if (UNLIKELY(netconn_delete(socket_data->conn) != ERR_OK)) {
+            if (UNLIKELY(netconn_delete(closed_conn) != ERR_OK)) {
                 TRACE("do_receive_data: netconn_delete failed\n");
             }
-            struct ListHead *sockets_head = synclist_wrlock(&platform->sockets);
+            sockets_head = synclist_wrlock(&platform->sockets);
             UNUSED(sockets_head);
             list_remove(&socket_data->sockets_head);
-            purge_ready_connections(platform, socket_data->conn);
+            purge_ready_connections(platform, closed_conn);
             synclist_unlock(&platform->sockets);
-            socket_data->conn = NULL;
         }

Apply the same "NULL under lock before delete" ordering in do_close and in the do_connect
failure path for consistency.


4. (Medium) do_listen: missing NULL-check + over-broad lock

do_listen calls netconn_bind(conn, …) without checking netconn_new_with_proto_and_callback
for NULL (unlike do_connect, which AVM_ABORT()s). It also holds the global socket lock
across bind / getaddr / listen, which is broader than necessary.

Keep the lock around listen → publish (an early accept event between a successful listen
and publishing the server socket would otherwise be misclassified as an unknown ready conn),
but move new/bind/getaddr out of it:

-    // Lock list of sockets before the event callback is called
-    struct ListHead *sockets = socket_data_preinit(platform);
     struct netconn *conn = netconn_new_with_proto_and_callback(NETCONN_TCP, 0, socket_callback);
+    if (IS_NULL_PTR(conn)) {
+        AVM_ABORT();
+    }

     err_t status = netconn_bind(conn, IP_ADDR_ANY, port);
     if (UNLIKELY(status != ERR_OK)) {
         netconn_delete(conn);
-        socket_data_postinit(platform);
         do_send_error_reply(ctx, status, ref_ticks, pid);
         return;
     }
@@
     status = netconn_getaddr(conn, &naddr, &nport, 1);
     if (UNLIKELY(status != ERR_OK)) {
         netconn_delete(conn);
-        socket_data_postinit(platform);
         do_send_error_reply(ctx, status, ref_ticks, pid);
         return;
     }

+    // Hold the socket-list lock from listen until the server socket is published,
+    // otherwise an early accept event could be misclassified as an unknown conn.
+    struct ListHead *sockets = socket_data_preinit(platform);
     status = netconn_listen_with_backlog(conn, backlog);
     if (UNLIKELY(status != ERR_OK)) {
-        netconn_delete(conn);
         socket_data_postinit(platform);
+        netconn_delete(conn);
         do_send_error_reply(ctx, status, ref_ticks, pid);
         return;
     }

5. (Low) purge_ready_connections matches a freed pointer

In all three callers (do_receive_data, do_connect failure, do_close),
purge_ready_connections runs after netconn_delete(conn) and compares
ready->netconn == conn. It never dereferences the pointer, so it's not a UAF — but using the
raw, freed pointer as an identity token is fragile: lwIP/heap can recycle the same address, so
a purge could theoretically remove a ready entry belonging to a brand-new connection at the
same address. The mitigation in #3 narrows the window. A complete solution needs a stronger
event identity (see "Deeper fix" below). Acceptable for now; worth a code comment.


6. (Low) Unchecked malloc

socket_events_handler (the ReadyConnection alloc) and do_accept (the TCPServerAccepter
alloc) dereference malloc results without checking. Match the codebase convention:

                 struct ReadyConnection *ready = (struct ReadyConnection *) malloc(sizeof (struct ReadyConnection));
+                if (IS_NULL_PTR(ready)) {
+                    AVM_ABORT();
+                }
                 ready->netconn = event.netconn;
         struct TCPServerAccepter *accepter = malloc(sizeof(struct TCPServerAccepter));
+        if (IS_NULL_PTR(accepter)) {
+            AVM_ABORT();
+        }
         accepter->accepting_process_pid = pid;

7. (Low) netbuf leaks on do_receive_data error paths (pre-existing)

The early-return error paths in do_receive_data leak buf:

     status = netbuf_data(buf, &data, &data_len);
     if (UNLIKELY(status != ERR_OK)) {
         TRACE("do_receive_data: netbuf_data error: %i\n", status);
+        netbuf_delete(buf);
         do_send_socket_error(ctx, status);
         return NativeContinue;
     }
@@
         if (addr_term == term_invalid_term()) {
             TRACE("do_receive_data: socket_addr_to_tuple error\n");
+            netbuf_delete(buf);
             do_send_socket_error(ctx, ERR_BUF);
             return NativeContinue;
         }
@@
     if (netbuf_next(buf) == 0) {
         TRACE("do_receive_data: netbuf error : got more parts\n");
+        netbuf_delete(buf);
         do_send_socket_error(ctx, ERR_BUF);
         return NativeContinue;
     }

Items reviewed and considered OK

  • netconn_set_nonblocking(listening conn, 1) never reset. Permanent non-blocking on the
    listening socket is consistent with the new accept design; acceptable (could be set once
    after listen for clarity).
  • do_accept resetting ready_connections = 0 on ERR_WOULDBLOCK. If the accept queue
    really held a connection, netconn_accept would return it; ERR_WOULDBLOCK means the hint
    was stale, so clearing it and registering an accepter is reasonable.
  • accepted_conn context creation moved after accept success. Correct — no Context is
    leaked on the ERR_WOULDBLOCK/early-return path.

Deeper fix (only if races persist)

If spurious accepts, lost ready events, or stuck netconn_recv remain after the above, stop
using the bare struct netconn * as the sole event identity. Wrap each socket in a
driver-owned object carrying a generation counter / state flag, and enqueue
{conn, generation, len} so stale or recycled events can be discarded deterministically.


Verification notes

These findings are from static review; the ESP32 socket driver was not built/run as part of
this review. Recommended validation: build the esp32 platform and exercise TCP
connect-failure, server accept under load (multiple simultaneous accept waiters), and
close-during-receive paths.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants