Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DISABLE_ASYNCTCP_ACK_TIMEOUTS build flag. #1712

Closed
wants to merge 4 commits into from

Conversation

mmakaay
Copy link
Member

@mmakaay mmakaay commented Apr 22, 2021

What does this implement/fix?

This change makes it possible to disable AsyncTCP's TCP ACK timeout checks for the API server, which on some systems cause regular faulty API client disconnects. As a workaround, these checks can be disabled by setting the DISABLE_ASYNCTCP_ACK_TIMEOUTS build flag.

An api component configuration option is provided to make it easy to enable the workaround build flag from the YAML configuration.

Types of changes

  • Bugfix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Configuration change (this will require users to update their yaml configuration files to keep working)

Related issue or feature (if applicable): fixes

Pull request in esphome-docs with documentation (if applicable): esphome/esphome-docs#1129

Test Environment

  • ESP32
  • ESP8266
  • Windows
  • Mac OS
  • Linux

Example entry for config.yaml:

api:
  ack_timeout_workaround: true

Initially, I didn't create the api config option for enabling the workaround build flag. Reasons for this were:

  • Possibly disabling the ack timeouts is a good idea for any build. If that would be decided at some point, it would invalidate the configuration option for it, causing a backward incompatible config change.
  • Possibly, we might find that this fix is needed for an identifiable subset of devices (e.g. single core ESP32 varieties). In such case, having a build flag option to apply the fix is a flexible way of automating that fix when needed.
  • Moreover, disabling the TCP ACK timeout checks from the API-server would tightly couple the underlying TCP implementation to the more abstract config level of the API. IMO, such option would therefore be wildly out of place in the API config.

However, @jesserockz suggested adding an api option on Discord, so I updated the code to add one.
The coupling was my main concern, but it is not too bad at all. There are currently no other TCP implementations implemented, and even if there were, enabling the build flag would not hurt the build if there were any.

Explain your changes

The AsyncTCP library implements TCP ACK timeout checking in its server code, to check for disconnected clients. When an ACK timeout is spotted, the related client will be forcibly disconnected from the server.

Unfortunately, the implementation of these checks can be a bit buggy at times. This results in API client connections being evicted for no good reason. For users this shows up as devices that regularly get marked as unavailable within Home Assistant.

An attempt was made to fix an identified race condition in pull request #4 for the AsyncTCP fork that is in use by ESPHome. However, as you can read here, the integration of this patched version of AsyncTCP in ESPHome was reverted, because for somebody else, the fix actually resulted in extra disconnects. Although there are other reports that the fix does work, it is of no use when others get into troubles because of it.

Looking into other options, I realized that I could fully disable the ACK timeout checks by setting the timeout to zero. This is a built-in option for the AsyncTCP library. I tested my device with the unpatched version of AsyncTCP, but including the introduced build flag. The result was above expectations. I got a really stable device, which still could detect disconnected clients. For details, check out my little story on this.

Checklist:

  • The code change is tested and works locally.
  • Tests have been added to verify that the new code works (under tests/ folder).

If user exposed functionality or configuration variables are added/changed:

@probot-esphome
Copy link

Hey there @OttoWinter, mind taking a look at this pull request as its been labeled with an integration (api) you are listed as a codeowner for? Thanks!
(message by CodeOwnersMention)

@mmakaay
Copy link
Member Author

mmakaay commented May 5, 2021

After a long debug session yesterday, @glmnet and me might have ended up with a fix that would make this change no longer needed.
The fix for AsyncTCP was improved upon, possibly fixing the disconnect issues at the core.

I will check that fix on my devices. If things are fixed, I will close this pull request.
I will also modify the related docs change. The issue description can stay, but the solution will be different (use ESPHome version x.x.x or later).

@mmakaay
Copy link
Member Author

mmakaay commented May 12, 2021

Yes, I'm closing this pull request for now.
The AsyncTCP fix seems to work, and this supercedes the need for this special compile flag.
I will modify the related documentation pull request to explain that an upgrade to 1.18.0 fixes some category of disconnect issues.

@mmakaay mmakaay closed this May 12, 2021
@github-actions github-actions bot locked and limited conversation to collaborators Sep 14, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants