Skip to content

Conversation

@baranyaib90
Copy link
Contributor

Hi Aaron,
I hope you are doing well! I suppose you won't like me now. I have a lot of things for review.
Content list:

  1. TCP support to fix TCP support #186
  2. UDP reply truncation feature to fix https_dns_proxy + dnsmasq + systemd-resolved dnssec failure #184
  3. Code quality stuff: more compiler warnings and warning fixes
  4. c-ares API modernization
  5. Reset HTTPS connection to avoid half-open TCP problems to fix A persistent timeout occurs when the network changes (static IP changed or PPPoE reconnected). #189
  6. systemd improvements (mostly for my Ubuntu server)
  7. Version bump (I guess it is time)

@stangri it I may ask, please remove file https://github.com/openwrt/packages/blob/master/net/https-dns-proxy/patches/010-cmakelists-remove-cflags.patch if you would uplift the OpenWRT package after this pull request would be merged. I have compiled the OpenWRT package myself and I have not met any warnings. If there would be, please let me know. Thanks!

Take your time.
Best regards,
Balázs

Preparations for TCP serving
- added TCP client limit option, also can be used
  to disable TCP DNS server in case of issues
- readme and help pages updated
- combined API with UDP DNS server
- using minimally necessary public header
- added TCP only statistic counters
- added TCP functional tests
  - basic querries with valgrind
  - fragmented TCP request processing
  - testing 4k large response support
- started to use stricter compiler warnings
To ensure code quality and make life harder.

Notes:
- choosed to use enum or int (e.g. ares_status_t)
- CURLINFO_SSL_VERIFYRESULT was wrong
- https_fetch_ctx_cleanup's curl_result_code parameter
- int options range checked in options.c
If there are not any successful request.
Goal is to indicate service readyness after
bootstrap has finished.
Because of larger feature pack
@baranyaib90
Copy link
Contributor Author

Ps.: I was testing my changes for 2 months and I have even asked issue reporters to test with proxy built from my repo if it works well. You can check the conversation on bug reports. I suppose it will be good enough.

@aarond10
Copy link
Owner

bugbot run

Copy link

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bugbot free trial expires on September 2, 2025
Learn more in the Cursor dashboard.

src/main.c Outdated
https_client_init(&https_client, &opt, (opt.stats_interval ? &stat : NULL), loop);

struct addrinfo *listen_addrinfo = get_listen_address(opt.listen_addr);
((struct sockaddr_in*) listen_addrinfo->ai_addr)->sin_port = htons((uint16_t)opt.listen_port);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: DNS Server Fails IPv6 Port Handling

The DNS server incorrectly casts network address structures to sockaddr_in when accessing port numbers. This assumption breaks for IPv6 addresses, leading to memory corruption when setting ports and incorrect port values when extracting them, impacting both UDP and TCP operations.

Additional Locations (2)

Fix in Cursor Fix in Web

if (len <= 0) {
if (len == 0 || errno == ECONNRESET) {
DLOG_CLIENT("Connection closed");
} else if (errno == EAGAIN && errno == EWOULDBLOCK) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Buffer Type Mismatch and Error Handling Bug

In read_cb, the recv call uses &buf instead of buf for the buffer, which is a type mismatch. Also, the errno check for EAGAIN and EWOULDBLOCK uses && instead of ||. This condition will never be true, causing non-blocking errors to be mishandled.

Fix in Cursor Fix in Web

@baranyaib90
Copy link
Contributor Author

I have fixed the bugbot findings. Before merge lets squash the fixups.
The sin_port tricking was intentional, since the port is on the same memory area both in sockaddr_in and sockaddr_in6 structs. But lets be safe then.

src/dns_server.c Outdated
/* recvfrom can write to addrlen */
socklen_t tmp_addrlen = d->addrlen;
ssize_t len = recvfrom(w->fd, buf, REQUEST_MAX, 0, (struct sockaddr*)&raddr,
char tmp_buf[UINT16_MAX]; // stack buffer for largest UDP packet to support EDNS
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry a little about this. It's probably OK but might blow up on some small systems with limited stack space.
We copy it into a heap allocated buffer anyway, so what about

char *dns_req = (char *)malloc(UINT16_MAX);
recvfrom(...);
...
dns_req = (char *)realloc(dns_req, len);
if (dns_req == NULL) { ... }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even on openwrt the stack limit is 8MB per thread. Stack size grows only by usage. For my taste 64k is not a huge deal and I would not mess with heap for temporary buffers. Also I would not reserve more on heap (and shrik later) to avoid heap fragmentation. So I'm not expecting any issues with this.
Anyway: I will think about this.

return udp_size;
}

static void truncate_dns_response(char *buf, size_t *buflen, const uint16_t size_limit) {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wow!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow what? :D What a long overkill function? well yes :(
I have tested this a lot manually with valgrind+dig not to make any mistakes, but still I'm not sure if it is 100% OK, but I hope that calling into this will happen extremely rarely, since EDNS0 UDP buffer limit is enough for most oversized response.

Copy link
Owner

@aarond10 aarond10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are some solid changes. Thanks! The UINT16_MAX stack allocation is the only thing that stands out as possibly problematic. Hope I didn't miss anything. :D

@baranyaib90
Copy link
Contributor Author

You are welcome! Lets not hurry, check again a few days later. I know, that it is a large change to take in...

@baranyaib90
Copy link
Contributor Author

I have been thinking about the stack buffer and I would like to keep it. My compromise is to lower the size from 64k to 4k. Before the limit was 1500 byte, so this should be more than enough. Now it will be logged if it was not enough. In case of TCP this does not cause issue, just the read callback will be called multiple times. I hope that this is acceptable by you, if not, then please let me know your proposal.

@aarond10
Copy link
Owner

aarond10 commented Aug 29, 2025 via email

@aarond10 aarond10 merged commit daa0d39 into aarond10:master Sep 1, 2025
2 checks passed
@baranyaib90
Copy link
Contributor Author

@stangri do you plan to make the new command line arguments configurable trough the luci page?

@stangri
Copy link
Contributor

stangri commented Sep 1, 2025

@stangri do you plan to make the new command line arguments configurable trough the luci page?

I've noticed the following changes, please let me know if I missed any:

  -T tcp_client_limit    Number of TCP clients to serve. (Default: 20, Disabled: 0, Min: 1, Max: 200)
  -q                     Use HTTP/3 (QUIC) only.
  -m max_idle_time       Maximum idle time in seconds allowed for reusing a HTTPS connection.
                         (Default: 118, Min: 0, Max: 3600)
  -L conn_loss_time      Time in seconds to tolerate connection timeouts of reused connections.
                         This option mitigates half-open TCP connection issue (e.g. WAN IP change).
                         (Default: 15, Min: 5, Max: 60)
  -s statistic_interval  Optional statistic printout interval.
                         (Default: 0, Disabled: 0, Min: 1, Max: 3600)
  -F log_limit           Flight recorder: storing desired amount of logs from all levels
                         in memory and dumping them on fatal error or on SIGUSR2 signal.
                         (Default: 0, Disabled: 0, Min: 100, Max: 100000)

I'll add support for uci configuration for these in the init script first. Like with some other packages I think the less used (or more expert-like) settings should be separated into the advanced tab. Once the init script/uci support these options it would be a trivial update (beyond adding tabs) to support these in the luci app too.

PS. So with the binary you can set these options per-instance. The WebUI for the instance settings already looks overloaded, so If you're willing to provide your feedback/confirm which settings have more to do with environment/OS and can be globalized (like the user/group should most likely be the same for all instances) and which settings might be more resolver-specific I'd be all ears.

@baranyaib90
Copy link
Contributor Author

Advanced tab sounds great!

Resolver specific:

  • HTTP/3 (QUIC) option if supported by openwrt libcurl
  • max_idle_time, since it depends on resolver server configuration

Other options are more like global.

@stangri
Copy link
Contributor

stangri commented Sep 1, 2025

Advanced tab sounds great!

The more I thought about integrating more of the CLI options into the init script, the more I think following makes sense:

  • the options can be defined either at the main config level or instance level
  • if the instance level option isn't defined, it's inherited from the main config level
  • the default value options are no longer added to the PROCD command

So for example:

config main 'config'
	option dnsmasq_config_update '*'
	option force_dns '1'
	list force_dns_port '53'

config https-dns-proxy
	option bootstrap_dns '1.1.1.1,1.0.0.1'
	option resolver_url 'https://cloudflare-dns.com/dns-query'
	option listen_addr '127.0.0.1'
	option listen_port '5053'
	option user 'nobody'
	option group 'nogroup'

config https-dns-proxy
	option bootstrap_dns '8.8.8.8,8.8.4.4'
	option resolver_url 'https://dns.google/dns-query'
	option listen_addr '127.0.0.1'
	option listen_port '5054'
	option user 'nobody'
	option group 'nogroup'

Can become more compact:

config main 'config'
	option dnsmasq_config_update '*'
	option force_dns '1'
	list force_dns_port '53'
	option listen_addr '127.0.0.1'
	option user 'nobody'
	option group 'nogroup'

config https-dns-proxy
	option bootstrap_dns '1.1.1.1,1.0.0.1'
	option resolver_url 'https://cloudflare-dns.com/dns-query'
	option listen_port '5053'

config https-dns-proxy
	option bootstrap_dns '8.8.8.8,8.8.4.4'
	option resolver_url 'https://dns.google/dns-query'
	option listen_port '5054'

And the PROCD command no longer includes -a 127.0.0.1 as it's a default value.

That gives flexibility in configuring and in adding options support to the WebUI.

That also means that I would have to update the init script in multiple places if some default values change upstream/here.

I've documented it at https://docs.openwrt.melmac.ca/https-dns-proxy/#global-settings-for-instances and CI is rebuilding 2025.09.01-r1 with the init file fixes.

@baranyaib90
Copy link
Contributor Author

Nice page. The only thing I'm missing is the settings value range. e.g. (Default: 0, Disabled: 0, Min: 1, Max: 3600)
Obviously it is up to you to include them or not.

@stangri
Copy link
Contributor

stangri commented Sep 2, 2025

Nice page. The only thing I'm missing is the settings value range. e.g. (Default: 0, Disabled: 0, Min: 1, Max: 3600) Obviously it is up to you to include them or not.

We're on the same page. I wanted to add another column with the value options, but the table is pretty narrow to fit the theme as it is. I might have to play with the theme/settings at some point to allow another column. All the docs are in this repo: https://github.com/stangri/docs.openwrt.melmac.ca

@stangri
Copy link
Contributor

stangri commented Sep 10, 2025

@baranyaib90 on 24.10.2 (x86-64) I got the following in dmesg:

[2379127.133000] traps: https-dns-proxy[18318] trap divide error ip:404bde sp:7ffeae0f68d0 error:0 in https-dns-proxy[403000+7000]
[2380210.516055] traps: https-dns-proxy[19980] trap divide error ip:404bde sp:7fff96e5f0d0 error:0 in https-dns-proxy[403000+7000]
[2394799.142337] traps: https-dns-proxy[22169] trap divide error ip:404bde sp:7ffcb278bcd0 error:0 in https-dns-proxy[403000+7000]

@baranyaib90
Copy link
Contributor Author

baranyaib90 commented Sep 10, 2025

It looks like a divide by 0 error, but the location is unclear to me. If you can reproduce the issue, could you run the proxy with debug logs? I would need the last ~1000 logs of the stopped proxy.
Please open an issue for this.

@stangri
Copy link
Contributor

stangri commented Sep 10, 2025

Sorry, clarification: I was looking for something else at dmesg and saw these, I don't know how/when it happened, I'll try to reproduce but not sure if i can.

@baranyaib90
Copy link
Contributor Author

Actually you may not need to help me out with the reproduction. I have reviewed the new code I introduced, and I may have found the issue: answers could be 0 at 866a916#diff-593ced5c0d2c1e50378c33c17cd07cac9c6ba0856e3919474081ef40c68155abR142
I will try to come up with a fix Tomorrow.

@baranyaib90
Copy link
Contributor Author

I would like to ask you to build and test from my repo (https://github.com/baranyaib90/https_dns_proxy) to see if the division by 0 issue got fixed or not. Thanks!

@stangri
Copy link
Contributor

stangri commented Sep 12, 2025

Thanks @baranyaib90 I've built a binary from your repo. Any insight on how to try to trigger an event which previously resulted in a division by zero?

@baranyaib90
Copy link
Contributor Author

Well hard to tell actually. I suspect div by 0 could have happened in 2 ways: DNS response had:

  1. No answer, just large authority and additional section, or
  2. Way too many answers, like a few hundred.

I would recommend instead to:

  1. Check dmesg with human date and try to recall what have you done. E.g. with browser history check what sites were you visiting and visit those sites again.
  2. Lets wait 1 week and check if crash have happened or not.

@baranyaib90
Copy link
Contributor Author

Hi @stangri, are there any new crashes mentioned in dmesg since?
I'm planning to raise the pull request soon.

@stangri
Copy link
Contributor

stangri commented Sep 20, 2025

Hey @baranyaib90, nothing in dmesg. Your change either worked or at least did no harm, so I'd vote in favour of PR/merging this code. I'll create PR for the OpenWrt package once your code is merged in this repo.

@baranyaib90 baranyaib90 mentioned this pull request Sep 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants