esp32: fix socket locks across TCP connects by pguyot · Pull Request #2332 · atomvm/AtomVM

pguyot · 2026-06-08T05:55:50Z

These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).

SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later

Signed-off-by: Paul Guyot <pguyot@kallisys.net>

petermm · 2026-06-08T09:07:18Z

AMP, pick and choose:

Code Review — `esp32: fix socket locks across TCP connects`

Commit: db8444df1ea43b3fb96e33e079a53983a05d5f05
File: src/platforms/esp32/components/avm_builtins/socket_driver.c

Summary

The commit correctly fixes the main use-after-free in socket_events_handler (copying
process_id out of the locked list before sending the message) and stops holding the
global socket-list lock across blocking netconn_connect / netconn_accept. The non-blocking
accept + re-register-accepter approach is a reasonable way to tolerate spurious/stale ready
events from lwIP.

However, the change introduces / leaves several real correctness holes:

#	Severity	Issue
1	High	Failed `do_connect` leaves a live port with `platform_data == NULL`; gen-call handlers will NULL-deref
2	High	`do_tcp_server_netconn_event` removes all accepters but only services/re-appends the last → leak + lost accepters
3	Medium	`do_receive_data` clears `conn = NULL` after delete + list removal → recycled-netconn race window
4	Medium	`do_listen` does not NULL-check `netconn_new(...)`, and holds the global socket lock across `bind`/`getaddr`
5	Low	`purge_ready_connections` matches on an already-freed `netconn *` (identity-token fragility under address reuse)
6	Low	Unchecked `malloc` in `socket_events_handler` / `do_accept`
7	Low	`netbuf` leaks on `do_receive_data` error paths (pre-existing, worth fixing while here)

The raw struct netconn * is used as the only event identity throughout; under heap/lwIP
address reuse this remains inherently fragile. Fixes below reduce the worst races but don't
make recycled pointers impossible to misclassify.

1. (High) Failed `connect` leaves a live port that will NULL-deref

do_connect sets ctx->platform_data = NULL on connect failure, but the only guard added is
in the netconn_event_internal branch of socket_consume_mailbox. Every gen-call handler
(do_close, do_send, do_recvfrom, do_sockname, …) dereferences ctx->platform_data
unconditionally. After a failed connect the port stays alive and unusable, so:

it leaks the Context / port, and
any subsequent close/send/recv/… on it crashes.

Fix — terminate the port when init fails, and guard non-init messages defensively:

         term cmd_name = term_get_tuple_element(gen_message.req, 0);

+        if (cmd_name != INIT_ATOM && IS_NULL_PTR(ctx->platform_data)) {
+            int32_t pid = term_to_local_process_id(gen_message.pid);
+            uint64_t ref_ticks = term_to_ref_ticks(gen_message.ref);
+
+            if (cmd_name == CLOSE_ATOM) {
+                if (UNLIKELY(memory_ensure_free(ctx, REPLY_SIZE) != MEMORY_GC_OK)) {
+                    AVM_ABORT();
+                }
+                do_send_reply(ctx, OK_ATOM, ref_ticks, pid);
+                return NativeTerminate;
+            }
+
+            do_send_error_reply(ctx, ERR_CONN, ref_ticks, pid);
+            mailbox_remove_message(&ctx->mailbox, &ctx->heap);
+            return NativeContinue;
+        }
+
         switch (cmd_name) {
             //TODO: remove this
             case INIT_ATOM:
                 TRACE("init\n");
                 do_init(ctx, &gen_message);
+                if (IS_NULL_PTR(ctx->platform_data)) {
+                    // Init failed (it already sent its error reply); don't leave a
+                    // live port whose handlers will NULL-deref platform_data.
+                    return NativeTerminate;
+                }
                 break;

The init error reply is sent synchronously by do_connect/do_listen before returning, so
the caller still receives the error before the port terminates (same pattern as do_close).

2. (High) `do_tcp_server_netconn_event` drops/leaks all but the last accepter

The loop removes every accepter from the list but only keeps the last in accepter; the
others are removed from the list and never freed/serviced → memory leak and lost accept
requests.

    MUTABLE_LIST_FOR_EACH (accepter_head, tmp, &tcp_data->accepters_list_head) {
        accepter = GET_LIST_ENTRY(accepter_head, struct TCPServerAccepter, accepter_head);
        list_remove(accepter_head);   // removes ALL, keeps only last
    }

Fix — pop a single accepter:

-    struct ListHead *accepter_head;
-    struct ListHead *tmp;
     struct TCPServerAccepter *accepter = NULL;
-    MUTABLE_LIST_FOR_EACH (accepter_head, tmp, &tcp_data->accepters_list_head) {
+
+    if (!list_is_empty(&tcp_data->accepters_list_head)) {
+        struct ListHead *accepter_head = list_first(&tcp_data->accepters_list_head);
         accepter = GET_LIST_ENTRY(accepter_head, struct TCPServerAccepter, accepter_head);
         list_remove(accepter_head);
     }

(list_first / list_is_empty are available in src/libAtomVM/list.h.)

3. (Medium) Recycled-netconn race in `do_receive_data`

do_receive_data deletes the netconn first, removes the socket from the list, and only then
sets socket_data->conn = NULL. During the window between netconn_delete and list_remove,
the freed (and possibly recycled) conn pointer is still published in platform->sockets, so
a concurrent socket_events_handler (different scheduler thread) can match a recycled pointer
to this dying socket. do_close is better (it sets conn = NULL first) but still writes it
outside the lock that the event handler reads under.

Fix — clear the published conn under the socket-list write lock before deleting, then
operate on a local copy:

         if (socket_data->type == TCPClientSocket) {
             struct ESP32PlatformData *platform = ctx->global->platform_data;
-            if (UNLIKELY(netconn_close(socket_data->conn) != ERR_OK)) {
+            struct netconn *closed_conn = socket_data->conn;
+
+            struct ListHead *sockets_head = synclist_wrlock(&platform->sockets);
+            UNUSED(sockets_head);
+            socket_data->conn = NULL;
+            synclist_unlock(&platform->sockets);
+
+            if (UNLIKELY(netconn_close(closed_conn) != ERR_OK)) {
                 TRACE("do_receive_data: netconn_close failed\n");
             }
-            if (UNLIKELY(netconn_delete(socket_data->conn) != ERR_OK)) {
+            if (UNLIKELY(netconn_delete(closed_conn) != ERR_OK)) {
                 TRACE("do_receive_data: netconn_delete failed\n");
             }
-            struct ListHead *sockets_head = synclist_wrlock(&platform->sockets);
+            sockets_head = synclist_wrlock(&platform->sockets);
             UNUSED(sockets_head);
             list_remove(&socket_data->sockets_head);
-            purge_ready_connections(platform, socket_data->conn);
+            purge_ready_connections(platform, closed_conn);
             synclist_unlock(&platform->sockets);
-            socket_data->conn = NULL;
         }

Apply the same "NULL under lock before delete" ordering in do_close and in the do_connect
failure path for consistency.

4. (Medium) `do_listen`: missing NULL-check + over-broad lock

do_listen calls netconn_bind(conn, …) without checking netconn_new_with_proto_and_callback
for NULL (unlike do_connect, which AVM_ABORT()s). It also holds the global socket lock
across bind / getaddr / listen, which is broader than necessary.

Keep the lock around listen → publish (an early accept event between a successful listen
and publishing the server socket would otherwise be misclassified as an unknown ready conn),
but move new/bind/getaddr out of it:

-    // Lock list of sockets before the event callback is called
-    struct ListHead *sockets = socket_data_preinit(platform);
     struct netconn *conn = netconn_new_with_proto_and_callback(NETCONN_TCP, 0, socket_callback);
+    if (IS_NULL_PTR(conn)) {
+        AVM_ABORT();
+    }

     err_t status = netconn_bind(conn, IP_ADDR_ANY, port);
     if (UNLIKELY(status != ERR_OK)) {
         netconn_delete(conn);
-        socket_data_postinit(platform);
         do_send_error_reply(ctx, status, ref_ticks, pid);
         return;
     }
@@
     status = netconn_getaddr(conn, &naddr, &nport, 1);
     if (UNLIKELY(status != ERR_OK)) {
         netconn_delete(conn);
-        socket_data_postinit(platform);
         do_send_error_reply(ctx, status, ref_ticks, pid);
         return;
     }

+    // Hold the socket-list lock from listen until the server socket is published,
+    // otherwise an early accept event could be misclassified as an unknown conn.
+    struct ListHead *sockets = socket_data_preinit(platform);
     status = netconn_listen_with_backlog(conn, backlog);
     if (UNLIKELY(status != ERR_OK)) {
-        netconn_delete(conn);
         socket_data_postinit(platform);
+        netconn_delete(conn);
         do_send_error_reply(ctx, status, ref_ticks, pid);
         return;
     }

5. (Low) `purge_ready_connections` matches a freed pointer

In all three callers (do_receive_data, do_connect failure, do_close),
purge_ready_connections runs after netconn_delete(conn) and compares
ready->netconn == conn. It never dereferences the pointer, so it's not a UAF — but using the
raw, freed pointer as an identity token is fragile: lwIP/heap can recycle the same address, so
a purge could theoretically remove a ready entry belonging to a brand-new connection at the
same address. The mitigation in #3 narrows the window. A complete solution needs a stronger
event identity (see "Deeper fix" below). Acceptable for now; worth a code comment.

6. (Low) Unchecked `malloc`

socket_events_handler (the ReadyConnection alloc) and do_accept (the TCPServerAccepter
alloc) dereference malloc results without checking. Match the codebase convention:

                 struct ReadyConnection *ready = (struct ReadyConnection *) malloc(sizeof (struct ReadyConnection));
+                if (IS_NULL_PTR(ready)) {
+                    AVM_ABORT();
+                }
                 ready->netconn = event.netconn;

         struct TCPServerAccepter *accepter = malloc(sizeof(struct TCPServerAccepter));
+        if (IS_NULL_PTR(accepter)) {
+            AVM_ABORT();
+        }
         accepter->accepting_process_pid = pid;

7. (Low) `netbuf` leaks on `do_receive_data` error paths (pre-existing)

The early-return error paths in do_receive_data leak buf:

     status = netbuf_data(buf, &data, &data_len);
     if (UNLIKELY(status != ERR_OK)) {
         TRACE("do_receive_data: netbuf_data error: %i\n", status);
+        netbuf_delete(buf);
         do_send_socket_error(ctx, status);
         return NativeContinue;
     }
@@
         if (addr_term == term_invalid_term()) {
             TRACE("do_receive_data: socket_addr_to_tuple error\n");
+            netbuf_delete(buf);
             do_send_socket_error(ctx, ERR_BUF);
             return NativeContinue;
         }
@@
     if (netbuf_next(buf) == 0) {
         TRACE("do_receive_data: netbuf error : got more parts\n");
+        netbuf_delete(buf);
         do_send_socket_error(ctx, ERR_BUF);
         return NativeContinue;
     }

Items reviewed and considered OK

netconn_set_nonblocking(listening conn, 1) never reset. Permanent non-blocking on the
listening socket is consistent with the new accept design; acceptable (could be set once
after listen for clarity).
do_accept resetting ready_connections = 0 on ERR_WOULDBLOCK. If the accept queue
really held a connection, netconn_accept would return it; ERR_WOULDBLOCK means the hint
was stale, so clearing it and registering an accepter is reasonable.
accepted_conn context creation moved after accept success. Correct — no Context is
leaked on the ERR_WOULDBLOCK/early-return path.

Deeper fix (only if races persist)

If spurious accepts, lost ready events, or stuck netconn_recv remain after the above, stop
using the bare struct netconn * as the sole event identity. Wrap each socket in a
driver-owned object carrying a generation counter / state flag, and enqueue
{conn, generation, len} so stale or recycled events can be discarded deterministically.

Verification notes

These findings are from static review; the ESP32 socket driver was not built/run as part of
this review. Recommended validation: build the esp32 platform and exercise TCP
connect-failure, server accept under load (multiple simultaneous accept waiters), and
close-during-receive paths.

esp32: fix socket locks across TCP connects

db8444d

Signed-off-by: Paul Guyot <pguyot@kallisys.net>

pguyot mentioned this pull request Jun 8, 2026

scheduler: avoid redundant scheduler wake on process handoff #2329

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

esp32: fix socket locks across TCP connects#2332

esp32: fix socket locks across TCP connects#2332
pguyot wants to merge 1 commit into
atomvm:release-0.7from
pguyot:w23/esp32-socket-locks

pguyot commented Jun 8, 2026

Uh oh!

petermm commented Jun 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pguyot commented Jun 8, 2026

Uh oh!

petermm commented Jun 8, 2026

Code Review — esp32: fix socket locks across TCP connects

Summary

1. (High) Failed connect leaves a live port that will NULL-deref

2. (High) do_tcp_server_netconn_event drops/leaks all but the last accepter

3. (Medium) Recycled-netconn race in do_receive_data

4. (Medium) do_listen: missing NULL-check + over-broad lock

5. (Low) purge_ready_connections matches a freed pointer

6. (Low) Unchecked malloc

7. (Low) netbuf leaks on do_receive_data error paths (pre-existing)

Items reviewed and considered OK

Deeper fix (only if races persist)

Verification notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Code Review — `esp32: fix socket locks across TCP connects`

1. (High) Failed `connect` leaves a live port that will NULL-deref

2. (High) `do_tcp_server_netconn_event` drops/leaks all but the last accepter

3. (Medium) Recycled-netconn race in `do_receive_data`

4. (Medium) `do_listen`: missing NULL-check + over-broad lock

5. (Low) `purge_ready_connections` matches a freed pointer

6. (Low) Unchecked `malloc`

7. (Low) `netbuf` leaks on `do_receive_data` error paths (pre-existing)