Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WiFi.status() is not reflecting the true state #7432

Closed
TD-er opened this issue Jul 7, 2020 · 20 comments · Fixed by #8607
Closed

WiFi.status() is not reflecting the true state #7432

TD-er opened this issue Jul 7, 2020 · 20 comments · Fixed by #8607
Assignees

Comments

@TD-er
Copy link
Contributor

TD-er commented Jul 7, 2020

This has been an issue for quite some time and also lots of issues have been reported which are probably related to this incorrect state of the WiFi.

So this is merely a collection issue to gather all insights and link topics, as I keep finding my own replies in lost of those topics over and over again, but still feeling lost in this problem.

Related issues:

And lots more.
In essence all calls that may check the WiFi.status() and base their actions on it may run into these problems.

First let's have a look at the enum-mapping performed here:

wl_status_t ESP8266WiFiSTAClass::status() {
    station_status_t status = wifi_station_get_connect_status();

    switch(status) {
        case STATION_GOT_IP:
            return WL_CONNECTED;
        case STATION_NO_AP_FOUND:
            return WL_NO_SSID_AVAIL;
        case STATION_CONNECT_FAIL:
        case STATION_WRONG_PASSWORD:
            return WL_CONNECT_FAILED;
        case STATION_IDLE:
            return WL_IDLE_STATUS;
        default:
            return WL_DISCONNECTED;
    }
}

typedef enum {
    STATION_IDLE = 0,
    STATION_CONNECTING,
    STATION_WRONG_PASSWORD,
    STATION_NO_AP_FOUND,
    STATION_CONNECT_FAIL,
    STATION_GOT_IP
} station_status_t;

Note that the case of STATION_CONNECTING results in WL_DISCONNECTED

What I'm observing on some nodes (really hard to reproduce on some and happening almost always on others) is this:

Initial attempt to connect is stuck forever, as the WiFi status never gets to WL_CONNECTED
I checked by calling wifi_station_get_connect_status() and see the state is stuck at STATION_CONNECTING.

However the web server may serve pages and the WiFiEventStationModeGotIP event has fired.
So all seems to be working already, but the state is not updated.
In one issue it was mentioned to call WiFi.setAutoReconnect(true); to fix this, but that's not the magic fix here.

My work-around for this is to keep track of how long it takes to get a successful connection and if that times out, I call my own resetWiFi() function.

void resetWiFi() {
  WifiDisconnect();
  initWiFi();
}

void initWiFi()
{
#ifdef ESP8266

  // See https://github.com/esp8266/Arduino/issues/5527#issuecomment-460537616
  // FIXME TD-er: Do not destruct WiFi object, it may cause crashes with queued UDP traffic.
  //  WiFi.~ESP8266WiFiClass();
  //  WiFi = ESP8266WiFiClass();
#endif // ifdef ESP8266

  WiFi.persistent(false); // Do not use SDK storage of SSID/WPA parameters
  WiFi.setAutoReconnect(false);
  // The WiFi.disconnect() ensures that the WiFi is working correctly. If this is not done before receiving WiFi connections,
  // those WiFi connections will take a long time to make or sometimes will not work at all.
  WiFi.disconnect();
  setWifiMode(WIFI_OFF);

#if defined(ESP32)
  WiFi.onEvent(WiFiEvent);
#else
  // WiFi event handlers
  stationConnectedHandler = WiFi.onStationModeConnected(onConnected);
  stationDisconnectedHandler = WiFi.onStationModeDisconnected(onDisconnect);
  stationGotIpHandler = WiFi.onStationModeGotIP(onGotIP);
  stationModeDHCPTimeoutHandler = WiFi.onStationModeDHCPTimeout(onDHCPTimeout);
  APModeStationConnectedHandler = WiFi.onSoftAPModeStationConnected(onConnectedAPmode);
  APModeStationDisconnectedHandler = WiFi.onSoftAPModeStationDisconnected(onDisconnectedAPmode);
#endif
}

// ********************************************************************************
// Disconnect from Wifi AP
// ********************************************************************************
void WifiDisconnect()
{
  #if defined(ESP32)
  WiFi.disconnect();
  #else // if defined(ESP32)
  ETS_UART_INTR_DISABLE();
  wifi_station_disconnect();
  ETS_UART_INTR_ENABLE();
  #endif // if defined(ESP32)
}

The initWiFi() is also called as one of the first functions in my setup()

The WiFi status is also incorrect when the unit gets disconnected.
For example when the ESP node is kicked from the access point (MikroTik AP allows you to disconnect a specific client via the web interface) or whatever other reason there may be to disconnect a node.

This is the code I use to detect if I have an IP-address:

#ifdef CORE_POST_2_5_0
# include <AddrList.h>
#endif // ifdef CORE_POST_2_5_0


bool hasIPaddr() {
#ifdef CORE_POST_2_5_0
  bool configured = false;

  for (auto addr : addrList) {
    if ((configured = (!addr.isLocal() && (addr.ifnumber() == STATION_IF)))) {
      /*
         Serial.printf("STA: IF='%s' hostname='%s' addr= %s\n",
                    addr.ifname().c_str(),
                    addr.ifhostname(),
                    addr.toString().c_str());
       */
      break;
    }
  }
  return configured;
#else // ifdef CORE_POST_2_5_0
  return WiFi.isConnected();
#endif // ifdef CORE_POST_2_5_0
}

N.B. the CORE_POST_2_5_0 define is set by me when compiling with a specific core version.

Some times, when the node gets disconnected, the WiFiEventStationModeDisconnected event is fired, but the WiFi state and/or the presence of the IP-address remains.
The only way to get out of this, is to call my resetWiFi() function and start over to create a connection.

For some reason, TCP/IP traffic is not causing crashes in this WiFi limbo state, but UDP is causing crashes.

So it would be really helpful if we could either fix this or at least explain it so we can use work-around which don't feel like "don't know why but it makes issues harder to reproduce", which has been the main modus operandi for the last 2 years with these WiFi issues.

@TD-er
Copy link
Contributor Author

TD-er commented Jul 8, 2020

For those that may have access to the (closed source) SDK or at least knowledge of what's happening in there.
It would be nice if my hypothesis could be confirmed or disproved.

My hypothesis:
It looks like the internals of the SDK also act on events to switch the WiFi status state machine.

The enum values somewhat suggest the order of how events should happen:

typedef enum {
    STATION_IDLE = 0,
    STATION_CONNECTING,
    STATION_WRONG_PASSWORD,
    STATION_NO_AP_FOUND,
    STATION_CONNECT_FAIL,
    STATION_GOT_IP
} station_status_t;

What if the events of STATION_CONNECTING and STATION_GOT_IP are processed out of order?
For example maybe both events are present and processed in the same loop, but in the wrong order which only makes a difference if processed in the same loop.
This could be a timing issue which only needs a slight difference in timings to give this different behavior.
Such a difference can be caused by slightly better tuned WiFi radio or quality of the crystal or different used flash chip, so it is plausible this can make a difference among ESP nodes.

Also, different builds of the SDK can introduce some extra delays somewhere.

And now for the possible fix.
Is it possible to add a function to correct this internal state? Or even better, to make a new build of the SDK which does show the correct state of the WiFi.

@Misiu
Copy link

Misiu commented Jul 31, 2020

Can we get some info/help here?
There are countless issues reported about WiFi, and many great projects like ESPEasy suffer because of this.
@earlephilhower, @d-a-v, @devyte sorry for tagging you directly, but maybe you guys can put some light on this?

@d-a-v d-a-v self-assigned this Aug 2, 2020
@d-a-v d-a-v added this to the 3.0.0 milestone Aug 2, 2020
@d-a-v
Copy link
Collaborator

d-a-v commented Aug 2, 2020

There are #6680 and #7391 pending.
If lwIP is truely made aware of disconnections from firmware, then we can use more / new / controlled callbacks (current ones are closed source). For example we could forbid "connected" when there is no valid IP address (or until we receive the connected callback).

@TD-er
Copy link
Contributor Author

TD-er commented Aug 2, 2020

Well, I'm not entirely sure that will be the magic fix also, as it is the closed source part that's reporting the wrong state and perhaps some parts in there also use that wrong state.

As a matter of fact, there are more bugs hidden in there, which I'm not yet able to fully detect, but I know they are there.
For example, on some nodes the responsiveness of the node on network-requests is running fine in some builds and terribly (unworkable) slow on other builds.
I thought it could be 'fixed' by using just another SDK build, but my latest deception is about seeing that theory shattered which makes it almost like "random build" that may or may not work on those units.
It can be as simple as linking order of objects that cause some extra flash activity which may just be enough difference in timing to cause these issues.

So it is all very good to have a more uniform interface to "network" regardless of the physical interface, but I am afraid it won't fix the WiFi part here as it does appear to have fundamental issues in the closed source part.

@d-a-v
Copy link
Collaborator

d-a-v commented Aug 2, 2020

Some times, when the node gets disconnected, the WiFiEventStationModeDisconnected event is fired, but the WiFi state and/or the presence of the IP-address remains.
The only way to get out of this, is to call my resetWiFi() function and start over to create a connection.

What would be nice is to read the commented firmware output in debug mode.

For some reason, TCP/IP traffic is not causing crashes in this WiFi limbo state, but UDP is causing crashes.

Because TCP bufferizes.
If we could reproduce, we could add extra checks (valid interface) in core or in lwip2.

However the web server may serve pages and the WiFiEventStationModeGotIP event has (not?) fired.
So all seems to be working already, but the state is not updated.

That can be fixed with the above PRs in which an event is triggered when an IP is assigned.

@TD-er
Copy link
Contributor Author

TD-er commented Aug 2, 2020

That can be fixed with the above PRs in which an event is triggered when an IP is assigned.

But how does LWIP know the IP has assigned?
What part does send out the GOT_IP event?
Is that LWIP? Is sending that event based on the state reported by the closed source library?
I sometimes do get that event multiple times (very quickly after the first), but still the WiFi status is reported as not "Connected".

The events do seem to work OK, or at least more reliable compared to the WiFi status.

@d-a-v
Copy link
Collaborator

d-a-v commented Aug 4, 2020

But how does LWIP know the IP has assigned?

The logic is:

Link layer (driver) calls a lwIP function when link is up (netif_set_link_up)
Then two cases:

  • static IP: lwIP uses the link layer callback to call another callback to set status
  • DHCP: lwIP sends the dhcp request, later receives the IP (it's a callback), and calls another callback to set status

What part does send out the GOT_IP event?

It's nonos-sdk. But we can add ours callbacks (open source full control) and use them.

What has to be done will be more clear after #6680 is merged (so we can make/fix things for any kind of interface).
Goal is to have wifi, ethernet (, ... ppp) and keep compatible with the current api (wifi.status()).

@TD-er
Copy link
Contributor Author

TD-er commented Aug 4, 2020

Goal is to have wifi, ethernet (, ... ppp) and keep compatible with the current api (wifi.status()).

That's a sensible goal :)

@mcspr
Copy link
Collaborator

mcspr commented Mar 24, 2021

Have noticed the same issue while testing out what happens after the following on a Linux-based AP:

$ iw dev wlan0 station del es:p8::26:6m:ac

WiFi.isConnected() is still true even while none of the networking actually works for the last ~10 minutes

Should the status do the same that the current LwipIntfDev (for the various ethernet devices) does and simply check ip availability? fwiw the PRs above are merged.


As a workaround... For example, SDK does send disconnection event pretty reliably e.g.

    WiFi.persistent(false);
    WiFi.setAutoConnect(false);
    WiFi.setAutoReconnect(false);
    // ... do the setup for the ssid & pass ...
    static auto disconnected = WiFi.onStationModeDisconnected([](const auto&) {
        notifyDisconnected(); // <-- connection is dead, reconnect
    });

By doing the above, I've noticed that station IP settings become un-set (localIP(), gatewayIP(), subnetMask(), but not dnsIP() for some reason) and the associated lwip's netif, aka STATION_IF by looking through the netif_list, is brought down with both FLAG_LINK_UP and FLAG_UP un-set. Not sure if it is an SDK doing, or something set up by the Core / lwip that resets the interface.

Also re. above, NONOS pdf specifically mentions that wifi_station_get_connect_status depends on the event system & autoreconnect / reconnection policy setting, and it seems it simply tracks the latest event it understands, but only for the connection routine so we should only track it when there is an actual connection in progress. And without disabling reconnection policy, SDK would try to be helpful with reconnections in the background, so any manual setup / loop that checks connectivity needs to disable both of those.

@leifclaesson
Copy link

leifclaesson commented Aug 9, 2021

I was just about to post this same issue and then found this post. This is still an issue even in 3.0.2.
After having suffered from this issue with my deployed devices forever, I finally managed to reproduce it reliably -- simply selecting "Reconnect Device" in UniFi access points kicks the esp8266 off the network in a way that isConnected() still returns true.

The workaround I came up with is simple, and similar to what @mcspr proposed above:

void OnWiFiDisconnectedEvent(const WiFiEventStationModeDisconnected & event)
{
	(void)event;
	WiFi.disconnect();
}

WiFi.onStationModeDisconnected(OnWiFiDisconnectedEvent);

Regardless of the inner workings of the core libraries, this makes isConnected and getStatus correctly return the connection loss!
It's a bandaid, but wouldn't it be better to have a fix like this implemented at the esp8266-arduino library level rather than to just leave it in its current broken state, even now in 3.0.2?

This particular bug has been a huge annoyance and significant time-waster to me before I finally isolated and tracked it down the other day.. and now I have a couple hundred WiFi lights to update. A band-aid would have been greatly preferable in my opinion -- i would never even have noticed the issue.

@TD-er
Copy link
Contributor Author

TD-er commented Aug 9, 2021

Do you have any idea how often this disconnect event happens in succession?
I think it would for sure help as explicit disconnect is mentioned on a lot of places by users and I also use it in my code (not at the disconnect event though) but still I'm not sure it fixes all situations where the WiFi status is not reflecting the true situation.

And I do feel your pain in lost hours, as I've spent way over a 1000 hours on this, maybe even over 2000 hours, as this has been an issue since 2.4.x (for sure 2.5.x, but I've got the feeling it has been bugging me for longer).
The WiFi code in ESPEasy has become extremely complex just to get around all kinds of WiFi issues, with lots and lots of checks.
So it would be really great if there was a simple work-around.

@sblantipodi
Copy link

@TD-er I'm experiencing the same exact problem and the disconnection happen every 5 or 6 hours more or less.

@TD-er
Copy link
Contributor Author

TD-er commented Aug 20, 2021

5 or 6 hours sounds like regular interval.
This can have a lot of reasons why it disconnects, but it is even more frustrating if you can't detect it has been disconnected.

@sblantipodi
Copy link

5 or 6 hours sounds like regular interval.
This can have a lot of reasons why it disconnects, but it is even more frustrating if you can't detect it has been disconnected.

yes the problem is that we cannot detect it.

@sblantipodi
Copy link

@TD-er I still have some disconnections that I can't detect, my MQTT client enter the reconnection loop and it does not know that it can't connect since the WiFi is disconnected.
Do you have some suggestions for workaround the problem while we are waiting for an official fix please?

All suggestions will be really appreciated. :) Thank you!

@TD-er
Copy link
Contributor Author

TD-er commented Aug 27, 2021

What I do is that I count the number of failed connection attempts.
If this exceeds some user defined threshhold, the WiFi will be turned off and a reconnect is initiated.
This counter is lowered over time, with each check for MQTT connection and/or other network connection attempts.
So it will not cause a reconnect after N fluke reconnect attempts in a long time.

@leifclaesson
Copy link

@TD-er Holy crap, 2000 hours?? That is a LOT of hours.

Update: Relying on the event does not work for all disconnections! It seems it's possible for the WIFi stack to hang up / get stuck in an inconsistent state for some situtation.

But, I have found a different workaround that has worked perfectly for me for a couple of weeks now. @sblantipodi

When WiFi is in this bad state (thinks it's connected but it has actually been disconnected), WiFi.RSSI() returns a positive value!! In fact, it returns 31.. instead of the negative values we usually see.

So, I now check RSSI() and if it's positive, I call WiFi.disconnect(false) and then I wait for a second (actually I let other things in the loop run) and then I attempt to reconnect.

So far no sustained connection losses. They always come back on their own now, so far.

@TD-er
Copy link
Contributor Author

TD-er commented Sep 14, 2021

There are more situations where you might get a positive value for RSSI.
The +31 value is just a "generic" error code for the RSSI (and also the country code for the Netherlands ;) )

You can also get this value during a connection process and even when performing a scan, or when running in AP only mode with no client connected to it.
It is just the value when there is no known signal strength from a connected AP.

The key ingredients of why it is now working in your test setup are the explicit disconnect and a wait.
Calling delay(100); is somewhat of a magic fix to give the WiFi some time to get things done.
Another thing to test is whether or not you are able to switch WiFi modes.
If not, call delay and try again.

@bwjohns4
Copy link

Is there any update on this? In my application, I have built an ensureWiFi() that pings the gateway WiFi.gatewayIP() and if after a few tries cannot reach it, the WiFi is retried, then disconnected and retried, etc. Does anyone see anything wrong with this process? It seems to go completely around the ESP's knowledge of the WiFi stack and directly try to reach the WLAN. I'd love to hear better ideas, etc. Is there any type of "Layer 2 ping" that could see if the WiFi AP's MAC address is reachable without going to Layer 3?

@TD-er
Copy link
Contributor Author

TD-er commented Jan 25, 2022

I would not consider checking on layer-2 only, as it could still lead to crashes if attempting to connect to some host where the IP stack is not ready.
You could try to send an arp command to ask who has the IP of your gateway, which is layer-2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants