esp32: fix socket locks across TCP connects#2332
Conversation
Signed-off-by: Paul Guyot <pguyot@kallisys.net>
|
AMP, pick and choose: Code Review —
|
| # | Severity | Issue |
|---|---|---|
| 1 | High | Failed do_connect leaves a live port with platform_data == NULL; gen-call handlers will NULL-deref |
| 2 | High | do_tcp_server_netconn_event removes all accepters but only services/re-appends the last → leak + lost accepters |
| 3 | Medium | do_receive_data clears conn = NULL after delete + list removal → recycled-netconn race window |
| 4 | Medium | do_listen does not NULL-check netconn_new(...), and holds the global socket lock across bind/getaddr |
| 5 | Low | purge_ready_connections matches on an already-freed netconn * (identity-token fragility under address reuse) |
| 6 | Low | Unchecked malloc in socket_events_handler / do_accept |
| 7 | Low | netbuf leaks on do_receive_data error paths (pre-existing, worth fixing while here) |
The raw struct netconn * is used as the only event identity throughout; under heap/lwIP
address reuse this remains inherently fragile. Fixes below reduce the worst races but don't
make recycled pointers impossible to misclassify.
1. (High) Failed connect leaves a live port that will NULL-deref
do_connect sets ctx->platform_data = NULL on connect failure, but the only guard added is
in the netconn_event_internal branch of socket_consume_mailbox. Every gen-call handler
(do_close, do_send, do_recvfrom, do_sockname, …) dereferences ctx->platform_data
unconditionally. After a failed connect the port stays alive and unusable, so:
- it leaks the
Context/ port, and - any subsequent
close/send/recv/… on it crashes.
Fix — terminate the port when init fails, and guard non-init messages defensively:
term cmd_name = term_get_tuple_element(gen_message.req, 0);
+ if (cmd_name != INIT_ATOM && IS_NULL_PTR(ctx->platform_data)) {
+ int32_t pid = term_to_local_process_id(gen_message.pid);
+ uint64_t ref_ticks = term_to_ref_ticks(gen_message.ref);
+
+ if (cmd_name == CLOSE_ATOM) {
+ if (UNLIKELY(memory_ensure_free(ctx, REPLY_SIZE) != MEMORY_GC_OK)) {
+ AVM_ABORT();
+ }
+ do_send_reply(ctx, OK_ATOM, ref_ticks, pid);
+ return NativeTerminate;
+ }
+
+ do_send_error_reply(ctx, ERR_CONN, ref_ticks, pid);
+ mailbox_remove_message(&ctx->mailbox, &ctx->heap);
+ return NativeContinue;
+ }
+
switch (cmd_name) {
//TODO: remove this
case INIT_ATOM:
TRACE("init\n");
do_init(ctx, &gen_message);
+ if (IS_NULL_PTR(ctx->platform_data)) {
+ // Init failed (it already sent its error reply); don't leave a
+ // live port whose handlers will NULL-deref platform_data.
+ return NativeTerminate;
+ }
break;The init error reply is sent synchronously by do_connect/do_listen before returning, so
the caller still receives the error before the port terminates (same pattern as do_close).
2. (High) do_tcp_server_netconn_event drops/leaks all but the last accepter
The loop removes every accepter from the list but only keeps the last in accepter; the
others are removed from the list and never freed/serviced → memory leak and lost accept
requests.
MUTABLE_LIST_FOR_EACH (accepter_head, tmp, &tcp_data->accepters_list_head) {
accepter = GET_LIST_ENTRY(accepter_head, struct TCPServerAccepter, accepter_head);
list_remove(accepter_head); // removes ALL, keeps only last
}Fix — pop a single accepter:
- struct ListHead *accepter_head;
- struct ListHead *tmp;
struct TCPServerAccepter *accepter = NULL;
- MUTABLE_LIST_FOR_EACH (accepter_head, tmp, &tcp_data->accepters_list_head) {
+
+ if (!list_is_empty(&tcp_data->accepters_list_head)) {
+ struct ListHead *accepter_head = list_first(&tcp_data->accepters_list_head);
accepter = GET_LIST_ENTRY(accepter_head, struct TCPServerAccepter, accepter_head);
list_remove(accepter_head);
}(list_first / list_is_empty are available in src/libAtomVM/list.h.)
3. (Medium) Recycled-netconn race in do_receive_data
do_receive_data deletes the netconn first, removes the socket from the list, and only then
sets socket_data->conn = NULL. During the window between netconn_delete and list_remove,
the freed (and possibly recycled) conn pointer is still published in platform->sockets, so
a concurrent socket_events_handler (different scheduler thread) can match a recycled pointer
to this dying socket. do_close is better (it sets conn = NULL first) but still writes it
outside the lock that the event handler reads under.
Fix — clear the published conn under the socket-list write lock before deleting, then
operate on a local copy:
if (socket_data->type == TCPClientSocket) {
struct ESP32PlatformData *platform = ctx->global->platform_data;
- if (UNLIKELY(netconn_close(socket_data->conn) != ERR_OK)) {
+ struct netconn *closed_conn = socket_data->conn;
+
+ struct ListHead *sockets_head = synclist_wrlock(&platform->sockets);
+ UNUSED(sockets_head);
+ socket_data->conn = NULL;
+ synclist_unlock(&platform->sockets);
+
+ if (UNLIKELY(netconn_close(closed_conn) != ERR_OK)) {
TRACE("do_receive_data: netconn_close failed\n");
}
- if (UNLIKELY(netconn_delete(socket_data->conn) != ERR_OK)) {
+ if (UNLIKELY(netconn_delete(closed_conn) != ERR_OK)) {
TRACE("do_receive_data: netconn_delete failed\n");
}
- struct ListHead *sockets_head = synclist_wrlock(&platform->sockets);
+ sockets_head = synclist_wrlock(&platform->sockets);
UNUSED(sockets_head);
list_remove(&socket_data->sockets_head);
- purge_ready_connections(platform, socket_data->conn);
+ purge_ready_connections(platform, closed_conn);
synclist_unlock(&platform->sockets);
- socket_data->conn = NULL;
}Apply the same "NULL under lock before delete" ordering in do_close and in the do_connect
failure path for consistency.
4. (Medium) do_listen: missing NULL-check + over-broad lock
do_listen calls netconn_bind(conn, …) without checking netconn_new_with_proto_and_callback
for NULL (unlike do_connect, which AVM_ABORT()s). It also holds the global socket lock
across bind / getaddr / listen, which is broader than necessary.
Keep the lock around listen → publish (an early accept event between a successful listen
and publishing the server socket would otherwise be misclassified as an unknown ready conn),
but move new/bind/getaddr out of it:
- // Lock list of sockets before the event callback is called
- struct ListHead *sockets = socket_data_preinit(platform);
struct netconn *conn = netconn_new_with_proto_and_callback(NETCONN_TCP, 0, socket_callback);
+ if (IS_NULL_PTR(conn)) {
+ AVM_ABORT();
+ }
err_t status = netconn_bind(conn, IP_ADDR_ANY, port);
if (UNLIKELY(status != ERR_OK)) {
netconn_delete(conn);
- socket_data_postinit(platform);
do_send_error_reply(ctx, status, ref_ticks, pid);
return;
}
@@
status = netconn_getaddr(conn, &naddr, &nport, 1);
if (UNLIKELY(status != ERR_OK)) {
netconn_delete(conn);
- socket_data_postinit(platform);
do_send_error_reply(ctx, status, ref_ticks, pid);
return;
}
+ // Hold the socket-list lock from listen until the server socket is published,
+ // otherwise an early accept event could be misclassified as an unknown conn.
+ struct ListHead *sockets = socket_data_preinit(platform);
status = netconn_listen_with_backlog(conn, backlog);
if (UNLIKELY(status != ERR_OK)) {
- netconn_delete(conn);
socket_data_postinit(platform);
+ netconn_delete(conn);
do_send_error_reply(ctx, status, ref_ticks, pid);
return;
}5. (Low) purge_ready_connections matches a freed pointer
In all three callers (do_receive_data, do_connect failure, do_close),
purge_ready_connections runs after netconn_delete(conn) and compares
ready->netconn == conn. It never dereferences the pointer, so it's not a UAF — but using the
raw, freed pointer as an identity token is fragile: lwIP/heap can recycle the same address, so
a purge could theoretically remove a ready entry belonging to a brand-new connection at the
same address. The mitigation in #3 narrows the window. A complete solution needs a stronger
event identity (see "Deeper fix" below). Acceptable for now; worth a code comment.
6. (Low) Unchecked malloc
socket_events_handler (the ReadyConnection alloc) and do_accept (the TCPServerAccepter
alloc) dereference malloc results without checking. Match the codebase convention:
struct ReadyConnection *ready = (struct ReadyConnection *) malloc(sizeof (struct ReadyConnection));
+ if (IS_NULL_PTR(ready)) {
+ AVM_ABORT();
+ }
ready->netconn = event.netconn; struct TCPServerAccepter *accepter = malloc(sizeof(struct TCPServerAccepter));
+ if (IS_NULL_PTR(accepter)) {
+ AVM_ABORT();
+ }
accepter->accepting_process_pid = pid;7. (Low) netbuf leaks on do_receive_data error paths (pre-existing)
The early-return error paths in do_receive_data leak buf:
status = netbuf_data(buf, &data, &data_len);
if (UNLIKELY(status != ERR_OK)) {
TRACE("do_receive_data: netbuf_data error: %i\n", status);
+ netbuf_delete(buf);
do_send_socket_error(ctx, status);
return NativeContinue;
}
@@
if (addr_term == term_invalid_term()) {
TRACE("do_receive_data: socket_addr_to_tuple error\n");
+ netbuf_delete(buf);
do_send_socket_error(ctx, ERR_BUF);
return NativeContinue;
}
@@
if (netbuf_next(buf) == 0) {
TRACE("do_receive_data: netbuf error : got more parts\n");
+ netbuf_delete(buf);
do_send_socket_error(ctx, ERR_BUF);
return NativeContinue;
}Items reviewed and considered OK
netconn_set_nonblocking(listening conn, 1)never reset. Permanent non-blocking on the
listening socket is consistent with the new accept design; acceptable (could be set once
afterlistenfor clarity).do_acceptresettingready_connections = 0onERR_WOULDBLOCK. If the accept queue
really held a connection,netconn_acceptwould return it;ERR_WOULDBLOCKmeans the hint
was stale, so clearing it and registering an accepter is reasonable.accepted_conncontext creation moved after accept success. Correct — noContextis
leaked on theERR_WOULDBLOCK/early-return path.
Deeper fix (only if races persist)
If spurious accepts, lost ready events, or stuck netconn_recv remain after the above, stop
using the bare struct netconn * as the sole event identity. Wrap each socket in a
driver-owned object carrying a generation counter / state flag, and enqueue
{conn, generation, len} so stale or recycled events can be discarded deterministically.
Verification notes
These findings are from static review; the ESP32 socket driver was not built/run as part of
this review. Recommended validation: build the esp32 platform and exercise TCP
connect-failure, server accept under load (multiple simultaneous accept waiters), and
close-during-receive paths.
These changes are made under both the "Apache 2.0" and the "GNU Lesser General
Public License 2.1 or later" license terms (dual license).
SPDX-License-Identifier: Apache-2.0 OR LGPL-2.1-or-later