Skip to content

Conversation

@zarqman
Copy link
Contributor

@zarqman zarqman commented Jan 11, 2026

Dnsruby is leaking sockets (both TCP and UDP). The sockets are closed, but @socket is used as a key in several hashes in SelectThread and those keys and their values are global and kept forever. This PR ensures those key/value pairs are cleaned up.

This has been tested in production here and shows a marked improvement in overall memory consumption. Sockets are being GC'd properly now.

A count of in-memory objects (via sigdump gem).

# Before
97,841: UDPSocket
4: Socket    (TCP sockets)

# After
6: UDPSocket
0: Socket

@keithrbennett
Copy link
Contributor

I ran this through Codex (OpenAI GPT models) and got this as a message I could paste here:

Thanks for the fix! I checked other socket-close paths and found a couple of places where we still leave per-socket state behind:

  • lib/dnsruby/select_thread.rb:441 (EOF in tcp_read): closes the socket and removes it from @@sockets, but does not clear @@socket_hash, @@socket_remaining_queries,
    @@socket_is_persistent, or @@tcp_buffers.
  • lib/dnsruby/select_thread.rb:254 (clean_up_closed_sockets): removes closed sockets from @@sockets but never clears the per-socket hashes.
  • lib/dnsruby/select_thread.rb:231 (idle stop): closes non‑persistent sockets and clears @@socket_hash, but leaves @@socket_remaining_queries, @@socket_is_persistent, and @@tcp_buffers
    behind.

If the intent is to fully clean up per-socket state on any close, it might be worth reusing the new cleanup here too (or centralizing socket teardown).

@alexdalitz alexdalitz self-assigned this Jan 12, 2026
@alexdalitz alexdalitz merged commit 1a89d4c into alexdalitz:master Jan 12, 2026
4 checks passed
@alexdalitz
Copy link
Owner

@zarqman Thanks for the patch!

@keithrbennett Thanks for this - however, all paths lead through the method with @zarqman has fixed, so it should be good.

@zarqman zarqman deleted the fix-leaking-sockets branch January 12, 2026 17:20
@keithrbennett
Copy link
Contributor

@alexdalitz Are you sure? I pulled the newly updated master branch down to the local copy of my fork and asked Codex again, and it was pretty specific about its answer:


• I checked the local dnsruby checkout and the earlier issues are still present. The PR fix added cleanup in remove_id_from_mutex_synchronized_block, but there are still other close paths
that bypass it and don’t clear the per‑socket hashes/buffers.

  • lib/dnsruby/select_thread.rb:231 idle-stop block closes non‑persistent sockets and clears only @@socket_hash, leaving @@socket_remaining_queries, @@socket_is_persistent, and
    @@tcp_buffers.
  • lib/dnsruby/select_thread.rb:256 clean_up_closed_sockets removes from @@sockets but doesn’t clear the per‑socket hashes/buffers.
  • lib/dnsruby/select_thread.rb:445 tcp_read EOF path closes the socket and deletes from @@sockets, but doesn’t clear @@socket_hash, @@socket_remaining_queries, @@socket_is_persistent, or
    @@tcp_buffers.

@zarqman
Copy link
Contributor Author

zarqman commented Jan 12, 2026

@keithrbennett, I won't speak for @alexdalitz, but the socket/selector code is pretty gnarly. Don't know if you've looked at it yourself.

It appears to me that the code paths with partial socket cleanups still end up being fully cleaned up later by remove_id_from_mutex_synchronized_block. TBH, based on my experience with the LLMs, I strongly doubt any of them are capable of getting a full grasp on what's really going on inside Dnsruby.

If you're up for an experiment, ask Codex to write a script to demonstrate an actual memory leak using the code paths it has identified as concerns.

@keithrbennett
Copy link
Contributor

@zarqman @alexdalitz I am mortified (but happy for both of you) to say that on further investigation, the Codex critique was bogus. I'm really sorry I wasted your time with this.

I'm sorry to say I did not look at the code myself. I have been using Codex for weeks to identify issues in my project's code base, and it's been super insightful and reliable, so I guess I was overly confident in it.

@zarqman It was a great idea to ask it to write a script to reproduce the error, and I did -- and the script proved that it had been wrong and the sockets are being cleaned up.

In case you're curious, the script is at https://gist.github.com/keithrbennett/4f29f770dc9051bd86a2f0e93764d529.

This was a big lesson for me. I had this interesting conversation with the AI agent after all this:


Me:

› Can we do a postmortem on this whole exercise? Why do you think you mistakenly claimed that the sockets were not being cleaned up properly?

Codex:

• Short postmortem:

  • I inferred a potential leak from reading the code paths (closed sockets removed from @@sockets but not necessarily @@socket_hash in the EOF path). That was a hypothesis, not a confirmed defect.
  • The first script runs didn’t actually exercise the intended path: the Errno::EINVAL connect errors and timeouts meant we weren’t reliably hitting the cleanup logic I was concerned about.
  • I should have been explicit that this was a suspected risk and framed it as “needs confirmation,” not “is leaking.”
  • Once the script was fixed to avoid the reuse/connect error and we added state summaries + final verdict, the evidence showed cleanup was happening.

In short: I over‑interpreted a static read without verifying the runtime path. The corrected script + your output show the sockets are being cleaned up (at least for the exercised path).

Me:

› So I guess in the future I should ask for a script to prove the hypothesis?

Codex:

• Yes — that’s the best move. A quick, targeted script (or test) that exercises the exact code path turns “looks risky” into evidence. A good prompt is: “Write a minimal repro script that hits path X and prints the specific internal state that would confirm or refute the leak.”

@zarqman
Copy link
Contributor Author

zarqman commented Jan 13, 2026

@keithrbennett, thanks for sharing all that. Super interesting! I think we're all learning how to use LLMs the best, and I found your transcript quite informative in that regard. Thanks for the followup. 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants