Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wi-Fi Radio [Dis-]Connection handling: exceptions, reconnection, logging #3837

Closed
anecdata opened this issue Dec 17, 2020 · 12 comments
Closed
Labels
enhancement espressif applies to multiple Espressif chips network
Milestone

Comments

@anecdata
Copy link
Member

anecdata commented Dec 17, 2020

I'm not sure if this really belongs as an issue (there are a couple related already), but I wonder if we need to handle more reasons for disconnection with auto-reconnect, meaningful exception message, more logging coverage, or other handling.

I'm not sure the full distinction between the <100 codes and the >=200 codes. The >=200 codes seem to be more fatal, but we do reconnect on a 205. It may help to explicitly itemize which reasons:

  • should or should not raise an exception
  • should or should not get auto-reconnected
  • should or should not get logged

Some users are experiencing failure to connect initially or to re-connect, reasons sometimes unknown. Clearly there are config and physical reasons why a connection may be frail, and I suppose we don't want to bog down the core with unnecessary strings (can we at least return the numeric codes?), but here's a survey of disconnection reasons, and where the core reacts to them (at least that I've found so far):

Disconnection reasons (code, docs):

typedef enum {
    WIFI_REASON_UNSPECIFIED              = 1,
    WIFI_REASON_AUTH_EXPIRE              = 2,    // ‡ event_handler: log & reconnect
    WIFI_REASON_AUTH_LEAVE               = 3,    // ‡
    WIFI_REASON_ASSOC_EXPIRE             = 4,    // ‡ event_handler: log & reconnect
    WIFI_REASON_ASSOC_TOOMANY            = 5,
    WIFI_REASON_NOT_AUTHED               = 6,    // ‡ event_handler: log & reconnect
    WIFI_REASON_NOT_ASSOCED              = 7,    // ‡
    WIFI_REASON_ASSOC_LEAVE              = 8,    // ‡
    WIFI_REASON_ASSOC_NOT_AUTHED         = 9,
    WIFI_REASON_DISASSOC_PWRCAP_BAD      = 10,
    WIFI_REASON_DISASSOC_SUPCHAN_BAD     = 11,
    WIFI_REASON_IE_INVALID               = 13,
    WIFI_REASON_MIC_FAILURE              = 14,
    WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT   = 15,   // ‡ event_handler: log & reconnect
    WIFI_REASON_GROUP_KEY_UPDATE_TIMEOUT = 16,
    WIFI_REASON_IE_IN_4WAY_DIFFERS       = 17,
    WIFI_REASON_GROUP_CIPHER_INVALID     = 18,
    WIFI_REASON_PAIRWISE_CIPHER_INVALID  = 19,
    WIFI_REASON_AKMP_INVALID             = 20,   // ‡
    WIFI_REASON_UNSUPP_RSN_IE_VERSION    = 21,
    WIFI_REASON_INVALID_RSN_IE_CAP       = 22,
    WIFI_REASON_802_1X_AUTH_FAILED       = 23,
    WIFI_REASON_CIPHER_SUITE_REJECTED    = 24,
    WIFI_REASON_INVALID_PMKID            = 53,
    WIFI_REASON_BEACON_TIMEOUT           = 200,  // ‡
    WIFI_REASON_NO_AP_FOUND              = 201,  // connect: exception
    WIFI_REASON_AUTH_FAIL                = 202,  // connect: exception
    WIFI_REASON_ASSOC_FAIL               = 203,
    WIFI_REASON_HANDSHAKE_TIMEOUT        = 204,
    WIFI_REASON_CONNECTION_FAIL          = 205,  // event_handler: log & reconnect
    WIFI_REASON_AP_TSF_RESET             = 206,
} wifi_err_reason_t;

‡ = disconnection reasons seen in the field with a small fleet of original esp32 over about a six-month period

@tannewt tannewt added enhancement espressif applies to multiple Espressif chips network labels Dec 22, 2020
@tannewt tannewt added this to the Long term milestone Dec 22, 2020
@tannewt
Copy link
Member

tannewt commented Dec 22, 2020

I'm all for expanding what we reconnect on. Perhaps we need to invert the test and only not reconnect on the reason we get when we've triggered the disconnect.

@anecdata
Copy link
Member Author

I'm not sure if there is a reason that is 1:1 with manual disconnect, so maybe that's better as a variable. There's also the possibility of manual disconnect coinciding with some network issue that also triggers a disconnect (probably rare).

This mentions Reasons for various disconnect causes, but doesn't mention a Reason associated with with a manual disconnect.

@tannewt
Copy link
Member

tannewt commented Dec 22, 2020

@anecdata I think the simplest thing would be to trigger a disconnect and see what reason you get back. :-)

@anecdata
Copy link
Member Author

anecdata commented Dec 22, 2020

I haven't found a way to do that from CircuitPython, so I suppose it's about time to begin the trek up the learning curve of how to make changes to the core ;-)

hmm, we never actually call esp_wifi_disconnect anywhere?

@tannewt
Copy link
Member

tannewt commented Dec 23, 2020

I haven't found a way to do that from CircuitPython, so I suppose it's about time to begin the trek up the learning curve of how to make changes to the core ;-)

hmm, we never actually call esp_wifi_disconnect anywhere?

The only way I can think of is to set the radio.enable = False.

@anecdata
Copy link
Member Author

anecdata commented Dec 30, 2020

WIFI_REASON_ASSOC_LEAVE == 8 seems to be the magic reason arising either by calling esp_wifi_stop() while connected but without disconnecting first (in the case of radio.enable = False; tested locally), or by calling esp_wifi_disconnect() while connected (tested locally).

There may be external causes for WIFI_REASON_ASSOC_LEAVE to arise, but since it does arise after explicit disconnect or stop, we shouldn't automatically reconnect in the handler for this reason. We may discover a future case where we want to distinguish code-initiated disconnect / stop from an external cause.

Most other disconnection reasons should be fair game for automatic reconnection attempts.

@anecdata
Copy link
Member Author

anecdata commented Jan 12, 2021

A couple of observations in testing so far:

  • Sometimes, despite a ConnectionError: Unknown failure exception raised to the user, there is a good connection and wifi proceeds normally. This may arise when the connect call completes with an Unknown failure, but a subsequent wi-fi event asynchronously triggers the event handler, which retries the connection. Not necessarily intuitive to the user, but maximizes opportunities to keep a connection stable.

  • A non-existent SSID works as expected: W (59331) wifi: reason 201 0xc9 [NO AP FOUND] raises ConnectionError: No network with that ssid to the user

  • An incorrect password is not as clean. There's not an explicit Reason for incorrect password (might expect a WIFI_REASON_AUTH_FAIL, but I haven't seen that). Instead, there will be some sequence of retries with various Reason codes, sometimes duplicates and in no particular order that I could discern, e.g.:

W (172361) wifi: reason 4 0x04 [WIFI_REASON_ASSOC_EXPIRE]
W (174431) wifi: reason 205 0xcd [WIFI_REASON_CONNECTION_FAIL]
W (187721) wifi: reason 15 0x0f [WIFI_REASON_4WAY_HANDSHAKE_TIMEOUT]

usually followed (currently) by Unknown failure.

I'm working on a PR to flow more disconnect reason info back to the user to help users and support folks to differentiate various failure scenarios within "Unknown failure".

The logging is reasonably thorough, but requires a DEBUG build. Maybe until ESP32-S2 is considered Stable, we could enable logging by default, at least for Warnings and more severe?

Addendum: it's useful to note in the docs which Reasons arise solely from the AP sending a frame with that reason, vs. Reasons that may arise from internally detecting a condition within the station.

@tannewt
Copy link
Member

tannewt commented Jan 13, 2021

The logging is reasonably thorough, but requires a DEBUG build. Maybe until ESP32-S2 is considered Stable, we could enable logging by default, at least for Warnings and more severe?

This sounds good to me!

Addendum: it's useful to note in the docs which Reasons arise solely from the AP sending a frame with that reason, vs. Reasons that may arise from internally detecting a condition within the station.

Would you mind adding them? I don't know the difference myself.

@anecdata
Copy link
Member Author

anecdata commented Jan 13, 2021

re: logging ...the build process is mostly a black box to me, but maybe there's a way to do it there in the automated builds. I can start to look into that, any guidance would be helpful.

Espressif does provide APIs to set a lower per-module severity threshold at runtime than the one set at compile time (I think that would involve a new circuitpython API).

re: Reason causes ...I'm not sure where you were thinking to add to the docs (always happy to), I only meant that in Espessif's docs), they itemize those Reasons that arise internally from low-level IDF logic vs. those that arise from standard IEEE 802.11 reasons (often corresponding to a specific management frame) vs. those that could arise from either cause.

@tannewt
Copy link
Member

tannewt commented Jan 14, 2021

re: logging ...the build process is mostly a black box to me, but maybe there's a way to do it there in the automated builds. I can start to look into that, any guidance would be helpful.

I think it's a setting you can set with menuconfig. However, I just realized we actually disable the debug UART in non-debug builds so that it can be used by user code. I think that's still best. Maybe we just need better error messages in circuitpython?

@anecdata
Copy link
Member Author

anecdata commented Jan 21, 2021

I'm inclined to close this issue. Current status:

Logging

  • All wifi Events are now logged from the event_handler (PR 3903), I think we should leave this in place as it's very useful for troubleshooting
  • All Disconnection reasons are logged from the event_handler, ditto my vote for keeping this in place
  • There is no logging in wifi outside of the event_handler in init.c. However, I'd suggest when major new features are added, we consider adding it in for some period of time until the feature is stable so DEBUG users can troubleshoot more easily

Auto-Reconnect vs. Exception

We now automatically try to re-connect for all but three disconnection reasons (PR 3992):

// wifi is stopped by user code: wifi.radio.enable = False,
// or otherwise disconnected in the core,
// or it is received from the AP
WIFI_REASON_ASSOC_LEAVE
 
// SSID not found in the connect() scan
WIFI_REASON_NO_AP_FOUND

 // unclear when this arises; wrong password gets other codes in my environments
WIFI_REASON_AUTH_FAIL

The first case is usually intended, but we may in the future want to distinguish between intended and unintended WIFI_REASON_ASSOC_LEAVE.

The latter two reasons will trigger exceptions.

We may want to tweak these reasons as we get more experience. Some codes that are currently retried may turn out to be fatal and should raise an exception.

@tannewt
Copy link
Member

tannewt commented Jan 22, 2021

Let's close for now. We can always open a new issue when we have additional specific improvements in mind.

@tannewt tannewt closed this as completed Jan 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement espressif applies to multiple Espressif chips network
Projects
None yet
Development

No branches or pull requests

2 participants