Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect Assumption There Is A TCP Connection When There May Not Be #2750

Closed
kentaylor opened this issue Dec 9, 2016 · 54 comments
Closed
Assignees

Comments

@kentaylor
Copy link

kentaylor commented Dec 9, 2016

Basic Infos

Where there is an unreliable network connection the ESP8266 using the Arduino Core can get stuck in an endless loop waiting for a message that will never arrive.

Hardware

Hardware: ESP-12E
Core Version: All

Description

The effect of the problem is described in another thread that links to several instance reports. It happens often for me when the ESP is connected to a WiFi AP that uses a mobile phone network for its internet connectivity.

The failure is a consequence of an incorrect assumption that the ESP8266 knows whether it has a TCP connection. The assumption is correct when the TCP connection is created and destroyed in accordance with this state diagram TCP Connection flow diagram
but can be invalid if there is packet loss once the TCP connection gets to the ESTABLISHED state shown in green in the diagram.

When lost packets are recovered in the TCP layer the state diagram is followed and when packets sent from the ESP are not acknowledged the ESP should know the TCP connection is broken but when the remote socket is closed incorrectly, perhaps through a power outage or packets have been sent from a remote socket but not received by the ESP it can not know. If packets are sent from the remote socket but not received, the remote machine will retry and eventually close its socket. If the ESP is waiting for incoming packets it will still think the TCP connection exists when it doesn't.

The Arduino core relies on the ESP knowing whether it has a TCP connection. An example is WiFiClient::connected() in WiFiClient.cpp where it concludes that there is a TCP connection when the TCP state is ESTABLISHED but ESTABLISHED means there has been a TCP connection, not necesarily that it still exists. In the WiFiClient.ino example client.available() is used to decide if a TCP connection exists and this function can return true when the TCP connection has been lost. Similarly in BasicHttpClient.ino it calls http.GET() which calls sendRequest which calls handleHeaderResponse() in which a loop is controlled with the erroneous connected() function that may return true when there is no TCP connection and hence loop forever.

@mtnbrit
Copy link

mtnbrit commented Dec 9, 2016

Does it still respond to ping when its stuck in this loop forever state for you? if so, I have seen this issue where only a power cycle will get it back. Kicking it from the AP and having it reconnect does not help wake it back up. it does respond to ping.

@kentaylor
Copy link
Author

kentaylor commented Dec 10, 2016

@mtnbrit that is interesting. Yes it will still respond to ping. The packets can be lost anywhere between the remote machine and the ESP device, in my case most often in the Telco mobile network. If the packets are lost beyond the router then it will work fine when pinged from the local network and the packet loss can be temporary so it could be pinged from elsewhere as well once the network is working again. However, a failed TCP connection will not be healed so it will remain in the endless loop referred to regardless of the network healing.

What is interesting is where @mtnbrit says:-

Kicking it from the AP and having it reconnect does not help

because without a WiFi network it should know the TCP connection is broken. How do you kick it from the AP, is it by a command on the router? Also how long is it disconnected from the router because it should be disconnected for a while before it gives up. The TCP layer can deal with intermittent packet loss and needs a few seconds to try and recover from a short term failure before it classifies packet loss as a failed TCP connection. You could try something like blocking the ESP MAC address from connecting to your router for 10 seconds or so and see if it is still stuck in an endless loop or the TCP connection has failed gracefully.

@objarni
Copy link

objarni commented Feb 14, 2017

Any progress on this one? WiFi down should definitely trigger a connection closed signal on a device where WiFi is the only connectivity.

@igrr
Copy link
Member

igrr commented Feb 14, 2017

Disconnection from an AP does indeed close all active WiFiClient connections.

The point from the original issue is a valid one. My assumption was that lwip will kill inactive PCBs after a while. I will check whether this indeed happens.

@kentaylor
Copy link
Author

kentaylor commented Feb 15, 2017

I'm thrilled to get a response from @igrr. In a real system that is expected to run for long periods this is a major flaw so I was a little disappointed there was so little interest when I first raised the issue. It seems to me to be so major that commercial products would be insufficiently reliable to be viable with this library. I have had devices run for weeks in the stuck condition so I'm confident lwip does not pick up the absence of a remote socket. In the TCP layer there is a concept of keep alive packets but these are often not implemented at all and the relevant RFS says they shouldn't be sent at less than 2 hour intervals see http://ltxfaq.custhelp.com/app/answers/detail/a_id/1512/~/tcp-keepalives-explained .

You will see all over the internet people stressing the importance of large capacitors with ESP8266's to keep them reliable. I think that claim is not in general true and mostly large capacitors are used to overcome this software error. Radio transmit increases current draw a lot which can cause a voltage drop sufficient for the processor to continue to work but the radio to fail temporarily. Then you can get into the state described above where the device is stuck waiting for a packet that will never come. I can reproduce this reliably by powering an ESP8266 that sends data at 1 minute intervals off alkaline batteries until they become flat. Alkaline batteries have a high internal resistance making the voltage sensitive to current draw and while the exact battery voltage where the ESP will get stuck is unpredictable it almost always does, long before the batteries are flat and usually from about 2/3 flat with 3 cells in series. Rebooting makes the ESP functional again on the same batteries. I'm not a hundred percent sure of this because I am assuming the processor can recover from a brownout satisfactorily. What I have checked though is that the processor is still running when in this state as the current draw fluctuates and it appears to be fluctuating enough to be still using the radio and it will still flash the led on a NodeMCU occasionally.

Most applications will work OK with an occasional reboot but stuckness is fatal and that is what, as far as I can tell, most large capacitors are added to avoid.

Looking some more, I came across the statement "LWIP_TCP_KEEPALIVE controls compiling in support for TCP keepalives and by default each connection has keepalives off." So it seems keep alives can be enabled in lwip but I hope the interval is much less than 2 hours.

@objarni
Copy link

objarni commented Feb 15, 2017

Thanks for this information, @kentaylor. I have been experiencing TCP connections from node MCU units getting into a bad state (unit thinks it is still connected via WiFiClient.connected() API, but it cannot get any more data from server) where a power cycle of the unit, or dropping the connection from the server, has been the only way to make communication work again. On server side the connections are eventually detected to be broken (exact exception in python is TimeoutException).

My workaround so far is dropping TCP connections from node MCU units after awhile, for example never keeping it for more than five minutes, by closing it on the server. It does feel really bad though when TCP sockets are supposed to be super reliable.

The read on keep alive packets was good, it seems like a good balance between reliability and low transmission overhead.

@igrr
Copy link
Member

igrr commented Feb 15, 2017

We can check for (in)activity on any given PCB and close it after certain timeout. This however would break applications which use persistent TCP connections without application-level keepalives (like MQTT does). For instance, I have an application with a TCP connection which is expected to stay up for hours, without any data being passed in either direction.

Would it be best to add some reasonable timeouts at application layer? For instance HTTPClient library can have some timeout associated with the client.available() loop, so that if the connection is for some reason lost, we will eventually bail out from reading. Looking at the source code, this is what curl library does. In this case WiFiClient.ino sample also needs to be updated to include some timeout logic.

@objarni
Copy link

objarni commented Feb 15, 2017

Updating the WiFiclient example to include connection reliability would be good.

What do you mean by "at the application layer"?

@vlast3k
Copy link

vlast3k commented Feb 15, 2017

@kentaylor "Most applications will work OK with an occasional reboot but stuckness is fatal and that is what, as far as I can tell, most large capacitors are added to avoid." - as far as i remember from some posts, is that one possible issue is that the Flash operates on 3V on most of the devices. So using 2 alkaline batteries is already on it's low level.

@objarni
Copy link

objarni commented Feb 15, 2017

@igrr My application does not use battery for power (status lamps in software development offices). So I prefer reliability in favour of low power consumption or low bytes-per-minute.

  1. Is there any way to enable keep-alive packets as described by articles linked by @kentaylor?
  2. If not, I guess one workaround would be a behaviour where a byte is sent every minute from the NodeMCU, and then wait for an answer byte from the server within a specific time period, say 15 seconds. If the answer byte does not arrive within 15 seconds, I could assume that the connection is broken, in which case a ESP.restart(); could be performed. Does this make sense? Is there any flaw in this reasoning?

@pwolfman
Copy link

Does anyone know how I can enable keepalive for TCP communications with the ESP8266/Arduino ? I see some mention by @igrr and @kentaylor in this chain. My issue is I have multiple clients connecting and sometimes a client can drop off and the TCP socket remains open. If I could implement a keepalive function, I could free the socket. I have found "espconn_set_keepalive" function in the espconn.h, but I have no idea how to implement it in the Arduino environment.

@drmpf
Copy link

drmpf commented Mar 30, 2017

Basically your clients need to keep sending something say every 5 sec if nothing else is happening and your server needs to reset a timer on data received. When the timer times out, the server should close the socket on its end.

See the code in this project
http://www.forward.com.au/pfod/CheapWifiShield/ESP2866_01_WiFi_Shield/index.html
for an example.

Of course your clients must send the keepalive messages :-)

@pwolfman
Copy link

pwolfman commented Mar 30, 2017

Unfortunately I have no control over the attached client and I have no way of knowing if it drops from the network. The ESP8266 as the server to the client needs to send the keepalive message every minute or so to see if the client has dropped from the network so I can stop the client connection. The client may be attached for 1 minute or 1 or mores hours without sending any data, so setting a timer for lack of communications would be unpredictable. Without releasing the sockets when a client drops off the network unexpectedly I run out of sockets and the clients cannot attach to me.

@vitotai
Copy link

vitotai commented Apr 28, 2017

I seems to be able to reproduce this issue easily.
In my application, there are two ESP8266s, server and client, obviously.
The server runs ESPAsyncWebServer while the client connects by HTTPClient or WiFiClient.
The client reports status in a period, and between the reports, it deep sleeps.
If the deep sleep call follows stop call immediately, the server will become inaccessible quickly - but ping-able.
Adding a delay of 50 to 100ms seems to alleviate this situation. However, kentaylor makes a very good point. This is more serious issue than crash. If the device crashes and restarts, users may not be aware of it, but users will definite know when the device stops response.

@kentaylor
Copy link
Author

kentaylor commented Jul 23, 2017

@igrr says:
We can check for (in)activity on any given PCB and close it after certain timeout.

Yes. Keepalives do that. Here is a better reference.

@igrr says:
This however would break applications which use persistent TCP connections without application-level keepalives (like MQTT does).

I don't think it will, An application will see a failed socket and should be robust to this. To quote from the reference "Keepalive is non-invasive, and in most cases, if you're in doubt, you can turn it on without the risk of doing something wrong." The only downside is increased network traffic.

@igrr says: I have an application with a TCP connection which is expected to stay up for hours, without any data being passed in either direction.

An application like this will usually fail when it is behind a router that does network address translation (NAT) without keepalives. You need them in this case for a different reason to that described above. The NAT will drop the mapping after a while and the device on the internet side of the NAT will no longer be able to initiate messages to the device behind the NAT. See reference.

@igrr says:
Would it be best to add some reasonable timeouts at application layer?

Assuming the application layer includes libraries like http client then yes, timeouts should be used rather than relying only on the WiFiClient::connected() to indicate there is no longer a connection to the remote machine. ESTABLISHED, without keepalives, means there has been a connection to the remote machine that hasn't been closed but it may nevertheless, no longer exist.

In my view keepalives should also be used as they will maintain NAT mapping, cause sockets open a long time to lose their ESTABLISHED status when the far end is lost, and would be sufficient on their own to eliminate stuckness caused by the existing libraries. User programs that made the same errors as the current libraries would also benefit. The downside is extra network traffic. In the MQTT application mentioned by @igrr, if messages were sent rarely and the interval is short the keepalives could be most of the traffic. On wired networks the extra traffic is insignificant. On mobile phone networks data is expensive. The Telcos in Australia charge in 1KB increments, they used to round to 1MB and some resellers still do. They start a new increment if the time between data transmissions is too long (I'm not sure how long). I find this annoying because I've found on some low bandwidth applications most messages sent, even a few bytes count as 1KB, previously 1MB, and data volume as measured by the Telco is much higher than expected. It's becoming less of an issue as data costs decline though. A keepalive interval shorter than the Telco new connection charging interval could actually reduce data costs on mobile networks.

@devyte
Copy link
Collaborator

devyte commented Oct 6, 2017

@kentaylor is this issue still valid with latest git, or better yet, with #3215 ?
@igrr I assume that lwip keepalives are still not enabled?

@devyte devyte added the waiting for feedback Waiting on additional info. If it's not received, the issue may be closed. label Oct 6, 2017
@igrr
Copy link
Member

igrr commented Oct 6, 2017

Keepalives are enabled in LwIP, but the timeout is still the default one:

#define  TCP_KEEPIDLE_DEFAULT     120000UL /* Default KEEPALIVE timer in milliseconds */

We can manipulate keepidle value for each TCP connection. The question is, where should the timeout come from? Should we introduce yet another ESP8266-specific API to WiFiClient class, or let applications retrieve struct tcp_pcb* from WiFiClient, and set keepidle value manually?

@devyte
Copy link
Collaborator

devyte commented Oct 12, 2017

@igrr my humble opinion:
yet another ESP8266-specific API to WiFiClient class : yes
applications retrieve struct tcp_pcb* : a big no

The fact that a specific API is added isn't a problem, as long as it's not necessary to call it with normal usage. That means that there is a default set under the hood. In this case, I it would be called only by those users who actually need it a value other than the default.

@kentaylor
Copy link
Author

kentaylor commented Nov 29, 2017

Keepalives have been added but the initial problem I reported "Where there is an unreliable network connection the ESP8266 using the Arduino Core can get stuck in an endless loop waiting for a message that will never arrive" still exists according to my testing. I have been testing over a mobile phone data connection where the the poor connection quality triggers this problem regularly.

In the tcp_keepalive function I can see that, when debugging the keep alive acknowledgements are printed.

I don't fully understand the logic but this comment implies keepalives are only intended to be sent over an idle connection. Stuckness is happening when a socket is waiting for input but will never receive it. The decision to close a socket in the absence of a keepalive acknowlegement is made here but the pcb->tmr is set equal to tcp_ticks here while the socket is waiting for data from the remote socket. Therefore, the socket is never closed while waiting for data from a remote socket even though it may never come.

I'm finding the logic in the code difficult to follow but Espressif is struggling too. It is now 12 months since I reported the problem, 9 months since @igrr acknowledged it and Espressif's attempts to fix it have not yet suceeded.

@d-a-v
Copy link
Collaborator

d-a-v commented Nov 29, 2017

Have you tried latest master with lwip-v2 enabled ?
lwip-v2 also addresses stuckness in tcp connections.
For the record, keepalive discussion also happens in #3886.

@kentaylor
Copy link
Author

kentaylor commented Nov 30, 2017

@d-a-v I'm not sure if I've used the latest implementation and I thought keepalives are enabled by default. Aren't they? I updated to the latest code in the latter half of October and tested it for a bit over a month over a Vodafone mobile networked WIFi access point. It failed regularly at first but has not suffered stuckness in the last two weeks. Should I update again?

Previously I have found a stuckness failure every few months on a wired internet connection, frequent stuckness on the Telstra network over a period of a couple of weeks and on the Vodafone Telco network it has run for weeks without failure then regular failure for a few weeks then back to no failure for a while. I suspect reliability depends on the amount of traffic in the Telco cell but I speculate. Whatever the cause in the Telco network, it shouldn't matter, it should not get stuck.

There is a work around of using an additional hardware based watchdog timer in the application layer as described here.

There is a fundamental issue that Espressif seems to be ignoring. It is impossible to know whether a remote machine will respond in the expected way. Therefore implementations need to be robust to any response/non response from the remote end and this needs timeouts whenever waiting for something. Some of this robustness can be provided in the libraries and where possible it should be. But some can only be provided in the application layer. For example where a HTTP client is waiting for a response there should be a timer that will cause the HTTP request to fail after a reasonable, ideally configurable, amount of time but where an application is maintaining an open TCP connection it is impossible for the library to know what is a reasonable time to wait. In that case the application should have some periodical interaction with the remote end to ensure it is still behaving as expected. Keepalives provide assurance that the socket is still open as well as maintaining routes through NATS but they don't provide assurance that the application using that socket will send what it should or send anything at all.

Keepalives operate below the TCP layer and do not interfere with the message flow in TCP so their use should not depend on what the TCP connection is doing at any moment. According to this good reference "Keepalive is non-invasive, and in most cases, if you're in doubt, you can turn it on without the risk of doing something wrong."

@d-a-v
Copy link
Collaborator

d-a-v commented Nov 30, 2017

@kentaylor

About stuckness problem, and because of your:

but when the remote socket is closed incorrectly

In your original post above (sorry I did not read that thread before), I think that this is similar to #2925 and addressed by lwip2 which is only available in master.

(I was not talking about keepalive, I rekon my message was confusing)

@kentaylor
Copy link
Author

The problem identified in this issue is not similar to #2995 - "New Simple Easily Reproducable Catastrophic TCP Stack Failure Results in Low Heap Crash shown with WiFiServer, Latency > 100ms, and Larger Send Bytes". This is a problem where the ESP8266 can get stuck in an endless loop waiting for a message that will never arrive as more fully described in the first post of this issue.

It is caused by not timing out when the ESP is waiting for a response from a webserver.

@kentaylor
Copy link
Author

As pointed out by @EnricoElCartmanez, here, keepalives are not enabled by default but must be turned on for each socket. Therefore, the testing reported above is invalid as keepalives were not enabled

Looking again at the test of whether a socket should be closed due to failure to receive a keep alive acknowledgement, I expect it will succeed in eliminating stuckness. However, keepalives should not be required to fix stuckness. The root cause is that the ESP is not timing out while waiting for a response that will never come. For example this loop will never return if the ESP thinks the socket exists even though the socket may no longer exist on the remote server.

As @d-a-v is a contributor could this be fixed?

@d-a-v
Copy link
Collaborator

d-a-v commented Dec 1, 2017

As I tried to explain, keepalive code is enabled with defines, then must be set for every TCP connection in lwip's raw pcb in our case, I guess ClientContext.h is the common location for that. @kentaylor do we all agree?

I am a contributor because I submitted a pull request which, by chance, was accepted. I have no more rights than you have.

I can do the fix, you can too, but before that we need to know if such a pull-request has a chance to be accepted by members of this repo. Enabled keepalive is already in sources. Setting automatically it for all connection is not, but it can be done by hand in sketches, so to prove the fix works as expected and is necessary.

We also need to know if this must always be enabled, and what keep-alive values would have to be set, or what new API is desired for setting parameters

Proofs and coherency are needed for members to accept such a PR.

You can also try lwip2 at no cost even if your problem is not similar to #2995, even if automatic keep-alive may be interesting or needed for this project. As I say lwip2 addresses problems with tcp stuckness that I experimented, not only me. TCP stuckness problems is the primary reason why lwip2 has been ported here.

@d-a-v
Copy link
Collaborator

d-a-v commented Dec 1, 2017

About stuckness problem, and it is to my mind not exactly but strongly related with your references to the test and loop above and your original issue too: lwip2 compared to previously-default-lwip1.4 also fixes the following behaviour:

  • running tcp server on esp8266, running tcp client on PC
  • esp8266 is reset by hand, hence the tcp server is started again
  • lwip1.4: PC's tcp client remains stuck with lwip1.4 as tcp peer (must be canceled by hand).
  • lwip2: PC's tcp client is immediately canceled by standard tcp stack with lwip2 as tcp peer.

You can reproduce that with the TelnetSerial example.
Does this example match your initial post above ?

@objarni
Copy link

objarni commented Dec 1, 2017

The use case is every http request made using the http client library. At the moment this error means a http request using the http client library will occasionally cause the esp8266 to get stuck in the http client library and never return while still feeding the watchdog timer.

That is broken behaviour. I'm quite surprised it's not happening more in general...? Is there a clear way to reproduce the behaviour?

@d-a-v
Copy link
Collaborator

d-a-v commented Dec 1, 2017

I don't agree with the use case, or I don't understand it.
People will look at me badly at some point because of me constantly repeating lwip2 all the time.
As I explained above, the tcp client will gracefully reset/close whenever the tcp server restarts (with connected()==false). TCP has internal algorithms, namely one is called retransmit-timeout, which is part of the protocol and always enabled in any decent implementation. For unknown reason to me, lwip-1.4 implementation by espressif is broken, here comes lwip2.
If your client gets no immediate answer from the tcp server (no ack for instance), then the tcp layer itself starts a retransmit timeout, enabled, internal, hidden to user, that re-triggers peer (the server, but it is actually a symmetric bi-directionnal behaviour) to get an answer.
If the server goes OFF and is not restarted, then yes, the client will wait longer to finally conclude that the server is down, but that's what your web browser does too. If the server comes back, then the browser (your PC's tcp layer) and also lwip(2 only)'s tcp, by sending its internal peeks will realize that a new server has come and will cancel (connected()==false or your browser's "something is wrong" message) the current connection. It is client's application responsibility (you clicking shift-reload, or your sketch restarting an HTTPClient session) to initiate a new one with the restarted server.

@igrr
Copy link
Member

igrr commented Dec 1, 2017

If your client gets no immediate answer from the tcp server (no ack for instance), then the tcp layer itself starts a retransmit timeout

I think in the case of HTTPClient library this may not happen, as at this point the client is not sending anything to the server. Assume that the client has received a packet, sent an ack, and is waiting for the next packet. At this point, plug is pulled from the server. The client will faithfully sit waiting for the next packet, which will never arrive.

This can be either solved using keepalives, or by fixing the logic of HTTPClient (i.e. taking some timeout into account). I would probably consider the second option safer (less possible side effects) at the moment.

@kentaylor
Copy link
Author

Thankyou @igrr. Yes, that is what happens. I would only add that fixing the logic of HTTPClient should be done regardless of whether keepalives are enabled even though enabling keepalives should fix the problem as well.

The principle is, that it is not possible to know if a TCP connection still exists so when awaiting data there always needs to be a timeout. Therefore a timeout should be added to this loop. This is not the only place in the Arduino code where the device does not have a timeout while waiting for data from a TCP socket but fixing this one would be a great start.

In response to @objarni, yes it is broken behaviour and it has been broken for a long time. In this thread I have linked to lots of reports where the problem being reported is most likely caused by this error. It is also, I suspect, why lots of people think large capacitors are needed to make the esp8266 reliable for the reasons explained here.

I can reproduce the problem quite reliably when I power a esp8266 from 3 alkaline batteries as described, I see it often when using mobile phone networks to connect to the internet, rarely on wired internet connections and once when the Thingspeak server was faulty devices connected to Thingspeak that were kilometers apart on different networks all got stuck at the same time. Apart from powering with alkaline batteries I can't reproduce stuckness reliably as it requires dropping IP packets at the correct moment and I don't know how to do that.

@objarni
Copy link

objarni commented Dec 1, 2017

@kentaylor thanks for keeping at this very real issue. I hope some kind of fix which does not involve "turn the world upside down" is merged is in place for 2.4 release.

An explanation of why my application - a internet connected "information radiator/orb" (physical pixel?) RGB led - does not suffer is because we do not use HTTPClient. Instead, we use raw TCP sockets and control connection state ourselves (reboot device if 90 seconds without any information from server seen).

@devyte
Copy link
Collaborator

devyte commented Dec 2, 2017

Alright, in the words of my country: Somebody Has To Cut The Cake (paraphrase), so I will.
There are some cases where keepalive is desired. Cases where connections with long lifetimes and infrequent packets transmissions are in fact a use case. An example is when there are NATs or some types of proxy between the server and client, and if you go too long without sending something, you may lose the route, and therefore the connection dies.
There are other cases where socket-level keepalives are not desired. I believe an example is MQTT servers, because keepalives there are part of the protocol, and hence you don't want the duplicate traffic, but in any case, there are many other examples. My own app is one, where I have 100+ devices on the same network doing all sorts of stuff, and I definitely don't want the extra traffic.

Conclusion:
Keepalives are needed. Right now, my understanding is that they can be enabled at build time, but not configured at runtime (per-socket), as they have only default values. Therefore, we need a per-socket api. There is already a PR under discussion for that. Let's not discuss it again here.

About keep-alives being disabled: If during a comm one end waits for a packet, but the other end dies unexpectedly, then the packet may never arrive. This wait state is not a transmit-wait-for-ack, where a retransmit algorithm within the tcp layer would help. Consider the following sequence (simplified):

Assume TCP connection is in ESTABLISHED state
Send packet with some data representing a request
Receive ack
Receive packet
Send ack
Wait for next packet
Receive packet
Send ack
Wait for next packet
other side dies

The waiting is at above socket level, while the packet send-ack receive is at tcp level. In this case, if the low level tcp layer does not have an inactivity timer (which I do believe is our case), the app will wait forever. The connection has been lost, the tcp layer won't detect it, because there is nothing pending to transmit or ack pending to be received. Here, _tcp-> connected() returns true, which means that:

bool HTTPClient::connected()
{
    if(_tcp) {
        return (_tcp->connected() || (_tcp->available() > 0));
    }
    return false;
}

wold return true, which means that a loop that depends on HTTPClient::connected() would be stuck forever. I believe @kentaylor refers to this case.

In that case, a timeout of some sort is needed. And, in fact, in the while(connected()) loop that @kentaylor links to, there is already a timeout implemented:

while(connected()) {
        size_t len = _tcp->available();
        if(len > 0) {
           //do something with the data
          ...
        } else {
            //check for timeout: we're waiting for data, but give up after timeout.
            if((millis() - lastDataTime) > _tcpTimeout) {
                return HTTPC_ERROR_READ_TIMEOUT;
            }
            delay(0);
        }
}

And so, in that case, there won't be an infinite loop. There other cases, such as the ThinkSpeakClient in the esp8266-weather-station implementation:

while(client.connected()) {
    while((size = client.available()) > 0) {
      c = client.read();
      if (c == '{' || c == '[') {
        isBody = true;
      }
      if (isBody) {
        parser.parse(c);
      }
    }
  }  

where there is no timeout implemented, hence the danger of infinite loop: client.connected() returns true, client.available() will always return 0, because the other side died.

Conclusion:
Some timeout mechanism is needed.

The arguable point is: in which level should it be implemented? Assuming I'm correct about what I see in the code, I see the following possibilities:

  1. Within _tcp->connected(). That would make it part of our socket layer, and yet it would remain above the low level tcp layer which is part of lwip.
    Pros::
  • It would be easiest to implement, as it must be done in just one place.
  • All apps that use WiFiClient would receive the functionality right away
    Cons:
  • It could have high impact, as all apps that rely on WiFiClient could be affected, i.e.: apps could timeout on long connections, if the app is not changed to configure that timeout. That means that many apps that currently work would require code changes to continue working.
  1. Delegate it to the higher layer, e.g.: HTTPClient. This is the case now.
    Pros:
  • Little or no impact to those apps that use the higher layer protocol (e.g.: HTTPClient)
    Cons:
  • All apps that use WiFiClient directly would need to implement the timeout mechanism, or risk the infinite loop.
  • The timeout mechanism would need to be implemented in many places within our repo code.
  1. Delegate it to the app
    Pros:
  • no impact to the code in this repo, except maybe a doc entry and an example or two
    Cons:
  • All apps would need to implement the mechanism in their sketch

Conclusion:
I think we can discard point 3 right off the bat.
We implement point 2. first, at least for those cases in the code within our repo. Next release is 2.4, but we are trying to reduce scope to get it out the door, so we plan this for release 2.5.
We implement point 1. for release 3.0, as it is a behavior change. When the PR for point 1. is made, we mark it as part of the 3.0 project for merging for that release. We open a new issue to remove the timeouts of point 2. (no longer needed at that point), and mark that for release 3.0 as well.

Is everyone ok with this plan?

Now, having said all of the above, @d-a-v is correct about lwip2. There are a whole lot of issues in lwip1.4 that cause what seem to be infinite loops, or at least some form of s(t)uckness during a socket communication. I've lost count of how many issues have been addressed by it. Therefore, his request to try building with lwip2 is very valid, independently of whether this issue is really true or not.
@kentaylor please test with lwip2 and report back.

@aaro668
Copy link

aaro668 commented Dec 4, 2017

@devyte
I have been having issues with the ESP8266 hanging/unreliable connections as well, and I would like to try this "lwip2" that you recommend to see if my troubles go away then.
I am very new to this Github and I am not sure how to go about to install "lwip2" into my IDE so I can recompile my sketch and test it.....can you give me any help with this? Sorry for the newb questions.....

@devyte
Copy link
Collaborator

devyte commented Dec 5, 2017

@aaro668 uninstall from board manager, then install git. Instructions are on readthedocs. Then restart Arduino IDE. In the menu, you should see lwip options.

@aaro668
Copy link

aaro668 commented Dec 5, 2017

Thanks @devyte , this did not fix my problem though. It seems that wifiClient.connected() is returning true for a while even after I unplug the AP router....

@devyte
Copy link
Collaborator

devyte commented Dec 5, 2017

@aaro668 I assume you mean that you still have an unreliable connection even with lwjp2.
Please be more specific. How long is "a while", or what exactly do you mean? Do you mean that it returns true for some time, and then afterwards it returns false?
Are you sure it is this same issue, and not something else, like the arp issue, or mem corruption, or out of heap at some point in your code?

@kentaylor
Copy link
Author

Thanks for the lengthy response. Where @devyte.says:

I believe @kentaylor refers to this case.

Yes that is correct.
And this is correct too.

And so, in that case, there won't be an infinite loop.

I was taken aback when @devyte correctly pointed out I had linked to a loop that does have a timeout and is therefore OK.

It is interesting that @devyte mentioned the

ThinkSpeakClient in the esp8266-weather-station implementation:

as having this problem because that was where I first discovered it.

Here is a list of locations in the Arduino repository where I think a loop depends on the connected() function and will, under the circumstances confirmed by @devyte, get stuck forever.

  1. Docs - Example sketch https://github.com/esp8266/Arduino/blob/d7044eceab4d4453e4d73ac49dcbbc8b8d0c9eb1/doc/esp8266wifi/server-examples.rst#put-it-together
  2. StreamHttpClient.ino
    while(http.connected() && (len > 0 || len == -1)) {
  3. WebServer.ino
  4. Parsing.cpp
    while(!client.available() && client.connected())
  5. Docs - Get Response From Server https://github.com/esp8266/Arduino/blob/d7044eceab4d4453e4d73ac49dcbbc8b8d0c9eb1/doc/esp8266wifi/client-secure-examples.rst#get-response-from-the-server
  6. BarometricPressureWebServer.ino

Have I got these correct?

@devyte says

The arguable point is: in which level should it be implemented?

, provides 3 alternatives and favours option 1 as a final solution. I'd argue that options 2 and 3 are best. Option 1 is undesirable for the @devyte reason

Cons:
It could have high impact, as all apps that rely on WiFiClient could be affected, i.e.: apps could timeout on long connections, if the app is not changed to configure that timeout. That means that many apps that currently work would require code changes to continue working.

plus

  • most use cases are covered by option 2
  • where option 2 is not applicable the timeout is application specific so a default value can not reasonably be provided.
  • socket timeouts on reads is transparently provided by keepalives which also provide other useful functionality.
  • it adds complexity without adding functionality that isn't better provided by another means.
  • any application that is waiting for data should be resilient to the possibility it may never come.

Options 2 and 3 aren't exclusive so should both be done.

I will test with lwip2. I expect it will take a few weeks as stuckness only occurs in certain circumstrances. In any case the problems identified here should be fixed.

@d-a-v
Copy link
Collaborator

d-a-v commented Dec 7, 2017

As @kentaylor shows it, every sketch currently uses WiFiClient::connected() method to check on still active connection.
@devyte mentions in 1)Cons):

It could have high impact, as all apps that rely on WiFiClient could be affected, i.e.: apps could timeout on long connections, if the app is not changed to configure that timeout. That means that many apps that currently work would require code changes to continue working.

I'm not sure that long connections - idle or not - would be affected by TCP keepalives.
If the link is up and running, keepalive won't interfere. In case of a failing connection, keepalive would directly affect ::connected() result since this one directly comes from the underlying TCP, the same one which is running keepalive packets.
Applications/Sketches already take actions when ::connected() becomes false, and with keepalives, ::connected() will behave as the authors would expect it does.

To make things short:

  1. Is it agreed that tcp-keepalive would solve this issue and all the sitting while (client.connected()) loops ?
    (you all think so)
  2. Is tcp-keepalive harmless even on already existing applications that are not aware of it ?
    (it's what it is made for)
  3. Should it be set by default in every tcp connection ?
    it can be done with a few lines in ClientContext.h
    (regarding 2. I vote the same as 1.)

@devyte
Copy link
Collaborator

devyte commented May 29, 2018

@d-a-v @kentaylor I sort of lost track on this. I understand that the keepalive is merged. Is the issue resolved? Is there anything pending, or can this be closed?

@dalbert2
Copy link
Contributor

The issue is not resolved, at least not for the default case. Using 2.4.1, I periodically get a lockup running this code:
httpClient.begin(fwVersionURL); int httpResult = httpClient.GET();
where I've checked previously for WiFi.connected() (and it is). If someone has a workaround example for this, I'd be grateful if you'd post it.

I'd also like to second the notion that it is a cardinal sin for a library function to stroke the watchdog inside a loop that might block indefinitely; such behavior makes the entire platform unreliable for embedded applications.

@d-a-v
Copy link
Collaborator

d-a-v commented Jun 14, 2018

@devyte I had lost track of it too.

@dalbert2 Are you able to reproduce this bug with latest git version (which is pre-2.4.2) ?
Keepalive api is now available, and disabled by default. If the problem persists, would you agree to test by simply enabling it with myclient.keepAlive(5,5,5) ?

@ALL
It is agreed at least by @devyte and @kentaylor that

  • (2) a timeout should be added inside core's HTTPClient
  • (3) all examples spotted by @kentaylor (there) should be fixed by using an app timeout

All of these timeouts beeing linked to ::connected(), what about a new ::connectedWithTimeout()-like api, that would have an internal configurable 5000ms delay that'd be self reset at every single byte received ?
Even if I think keepalive should automagically solve this problem, the question of what default values should be used remains (currently it's 2hours by default, I propose 5s above, and I'm afraid this will be another point of everlasting discussion). So I agree too, fwiw, that we could implement something, for instance the proposal above.

@dalbert2
Copy link
Contributor

Thank you for the quick response and the amazing level of help you provide in this forum! I am happy to test things, but unfortunately, I am not able to reliably reproduce the failure yet; in fact I saw it for the first time today in code that had been running for a long time; I found this thread when looking into it. My app can (currently) only build with LWIP 1.4 due to shortage of IRAM with LWIP2; if I can reproduce reliably, would the results be useful with LWIP 1.4?

Thanks to Ken for identifying library calls that can loop indefinitely without timeout; if it's at all possible, those should be eliminated; no library code that can block should ever (effectively) disable the watchdog.

@d-a-v
Copy link
Collaborator

d-a-v commented Jun 15, 2018

What's stated above is not specific to lwip2, so yes you can try keepalive.
Please note that core 2.4.1 (the current release) has known issues about memory leaking (unrelated with lwip2), and issues about httpclient failing after some time have already been opened. Leaks are fixed in current git version.
(and if you can find some spare time please open a new issue about the ram shortage you have specifically with lwip2)

@dalbert2
Copy link
Contributor

For those struggling with this issue, an easy work-around is to implement a software watchdog using Ticker. Because the interrupt system is still running in this case, if the ticker expires without the watchdog getting stroked (e.g. because it is stuck inside LWIP), your ticker ISR can reset the system. Not perfect, but far better than a permanent lockup. If what I'm describing is not obvious, someone did a nice write-up here:
onlineshouter.com/how-to-create-a-watchdog-to-stabilise-your-esp8266/

@d-a-v d-a-v self-assigned this Aug 29, 2018
@d-a-v d-a-v removed the waiting for feedback Waiting on additional info. If it's not received, the issue may be closed. label Aug 29, 2018
@d-a-v
Copy link
Collaborator

d-a-v commented Aug 29, 2018

@dalbert2 Have you had chance to try myclient.keepAlive(5,5,5) ?

@dalbert2
Copy link
Contributor

dalbert2 commented Oct 16, 2018

@d-a-v I'm sorry, I lost track of this issue and just now got back to it. I'm using core 2_4_2 with LWIP 2.0.3 (low memory version).

I added a call to keepAlive(5,5,5) to my code. Unfortunately, it just causes an Exception 28 on the call to keepAlive. The problem is that keepAlive() de-references the client's tcp protocol control block which doesn't exist until after the call to connect(). Unfortunately, the call to connect() is where things are getting hung-up. It works most of the time, but sometimes the call locks up the ESP8266 for hours. millis() doesn't advance, but it does eventually recover. So I've since moved the keepAlive() call after the connect() call and it might help with subsequent GETs, but it seems that the problem is either in the call to dns_gethostbyname() or tcp_connect(). I don't see an obvious problem with the code leading up to either call.

I've been trying to eliminate more code (e.g. using WiFiClient instead of HttpClient) to narrow down the problem. Next, I'm going to try using the server IP address instead of the hostname to count out the problem being in the DNS lookup.

static WiFiClient wifi_client;
static const char http_server[] = "myServer.com";
static const int  http_port = 80;

int ICACHE_FLASH_ATTR myServer_http_get(char *bufP, int buf_size) {
	// connect to server
	wifi_client.setTimeout(2000);
	wifi_client.setNoDelay(true); // send messages immediately (no Nagle)
	wifi_client.keepAlive(5,5,5);
	Serial.printf_P(PSTR("Connecting to server..."));   // last msg displayed when locked up
	delay(5); // allow time for message to get out
	if (!wifi_client.connect(http_server, http_port)) {
		Serial.printf_P(PSTR("Server connection failed.\r\n"));
		return 0;
	}
	// send GET request...an hour or more may elapse before this happens
	Serial.printf_P(PSTR("sending GET...\r\nCLIENT: "));
  ...

@dalbert2
Copy link
Contributor

A bit more information: this happens with either LWIP2 configuration and it seems to only happen when connection requests to the server are failing (which points back to the original problem in this post). After several failed connection attempts, a connection attempt will just hang either for hours or indefinitely. I'm able to reproduce it more often now and hope to soon be able to repeat it reliably.

@devyte
Copy link
Collaborator

devyte commented Nov 10, 2019

@kentaylor I think this issue was addressed with the keepalive feature.
I am closing because the issue is old and a whole lot has changed since it was reported, including relevant stability fixes. If anyone here considers the problem to be still relevant, please open a new issue and follow the template instructions.

@devyte devyte closed this as completed Nov 10, 2019
@mdries14887
Copy link

@igrr My application does not use battery for power (status lamps in software development offices). So I prefer reliability in favour of low power consumption or low bytes-per-minute.

  1. Is there any way to enable keep-alive packets as described by articles linked by @kentaylor?
  2. If not, I guess one workaround would be a behaviour where a byte is sent every minute from the NodeMCU, and then wait for an answer byte from the server within a specific time period, say 15 seconds. If the answer byte does not arrive within 15 seconds, I could assume that the connection is broken, in which case a ESP.restart(); could be performed. Does this make sense? Is there any flaw in this reasoning?

If you had a cheap pc with inbuilt NIC that supported W.O.L or ''magic packet message'' you could implement what i did with my wifi monitor to reboot my open wrt router supply to hard reset it when my complicated network loaded up to much at home. I came across same timing loop issue which i solved with turning an old open wrt wifi router into a wifi repeater and to repeat magic packet message to NIC when node mcu lost connection.and the nic's output gate relay toggled a relay rebooting mains power to my nbn router

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests