Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Esp8266 IP Address not reachable after a while #2330

Open
thehellmaker opened this Issue Jul 26, 2016 · 287 comments

Comments

@thehellmaker
Copy link

thehellmaker commented Jul 26, 2016

Hi All,
ESP abecomes unavailable after sometime intermittently where it says
Connection to http://192.168.1.4:80 refused
org.apache.http.conn.HttpHostConnectException: Connection to http://192.168.1.4:80 refused

The same happens from the web browser and then it starts working randomly. Checking the logs of the ESP device itself there is no crash.

Am i missing something in the setup of the server which can keep it alive all the time.

I thought this was a webserver problem but it seems like and ESP issue.
Here is the related issue me-no-dev/ESPAsyncWebServer#54

Cheers,
Akash A

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Jul 26, 2016

A similar issue #1137 was reported on dec 2015

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Jul 26, 2016

A similar issue also reported in
http://internetofhomethings.com/homethings/?p=426

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Jul 26, 2016

As mentioned in me-no-dev/ESPAsyncWebServer#54 I have already tried the approach in the link http://www.esp8266.com/viewtopic.php?p=12809 and its still not working.

Now will analyse using wireshark myself

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Jul 26, 2016

Wireshark is recieving ARP broadcast from the module every second because of the fix.

Here is the packet content.
100 7.995514 Espressi_1a:66:47 Broadcast ARP 42 Gratuitous ARP for 192.168.1.6 (Request)
Frame 100: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0
Ethernet II, Src: Espressi_1a:66:47 (5c:cf:7f:1a:66:47), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Destination: Broadcast (ff:ff:ff:ff:ff:ff)
Address: Broadcast (ff:ff:ff:ff:ff:ff)
.... ..1. .... .... .... .... = LG bit: Locally administered address (this is NOT the factory default)
.... ...1 .... .... .... .... = IG bit: Group address (multicast/broadcast)
Source: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
Address: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
Type: ARP (0x0806)
Address Resolution Protocol (request/gratuitous ARP)
Hardware type: Ethernet (1)
Protocol type: IPv4 (0x0800)
Hardware size: 6
Protocol size: 4
Opcode: request (1)
[Is gratuitous: True]
Sender MAC address: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
Sender IP address: 192.168.1.6
Target MAC address: 00:00:00_00:00:00 (00:00:00:00:00:00)
Target IP address: 192.168.1.6

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Jul 26, 2016

For a device to which ARP is responding here is the sequence

  1. Request
    1732 208.713855 IntelCor_c5:37:30 Espressi_1a:66:47 ARP 42 Who has 192.168.1.6? Tell 192.168.1.7
  2. Response
    1733 208.734013 Espressi_1a:66:47 IntelCor_c5:37:30 ARP 42 192.168.1.6 is at 5c:cf:7f:1a:66:47
  3. Request Body
Frame 1732: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0
    Interface id: 0 (\Device\NPF_{641ED2C7-4125-43D0-BEF1-205ACE40B627})
    Encapsulation type: Ethernet (1)
    Arrival Time: Jul 26, 2016 21:05:15.359512000 India Standard Time
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1469547315.359512000 seconds
    [Time delta from previous captured frame: 0.374458000 seconds]
    [Time delta from previous displayed frame: 0.374458000 seconds]
    [Time since reference or first frame: 208.713855000 seconds]
    Frame Number: 1732
    Frame Length: 42 bytes (336 bits)
    Capture Length: 42 bytes (336 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: eth:ethertype:arp]
    [Coloring Rule Name: ARP]
    [Coloring Rule String: arp]
Ethernet II, Src: IntelCor_c5:37:30 (18:5e:0f:c5:37:30), Dst: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
    Destination: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
        Address: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Source: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
        Address: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: ARP (0x0806)
Address Resolution Protocol (request)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: request (1)
    Sender MAC address: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
    Sender IP address: 192.168.1.7
    Target MAC address: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
    Target IP address: 192.168.1.6
  1. Response Body
Frame 1733: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0
    Interface id: 0 (\Device\NPF_{641ED2C7-4125-43D0-BEF1-205ACE40B627})
    Encapsulation type: Ethernet (1)
    Arrival Time: Jul 26, 2016 21:05:15.379670000 India Standard Time
    [Time shift for this packet: 0.000000000 seconds]
    Epoch Time: 1469547315.379670000 seconds
    [Time delta from previous captured frame: 0.020158000 seconds]
    [Time delta from previous displayed frame: 0.020158000 seconds]
    [Time since reference or first frame: 208.734013000 seconds]
    Frame Number: 1733
    Frame Length: 42 bytes (336 bits)
    Capture Length: 42 bytes (336 bits)
    [Frame is marked: False]
    [Frame is ignored: False]
    [Protocols in frame: eth:ethertype:arp]
    [Coloring Rule Name: ARP]
    [Coloring Rule String: arp]
Ethernet II, Src: Espressi_1a:66:47 (5c:cf:7f:1a:66:47), Dst: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
    Destination: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
        Address: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Source: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
        Address: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
        .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
        .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
    Type: ARP (0x0806)
Address Resolution Protocol (reply)
    Hardware type: Ethernet (1)
    Protocol type: IPv4 (0x0800)
    Hardware size: 6
    Protocol size: 4
    Opcode: reply (2)
    Sender MAC address: Espressi_1a:66:47 (5c:cf:7f:1a:66:47)
    Sender IP address: 192.168.1.6
    Target MAC address: IntelCor_c5:37:30 (18:5e:0f:c5:37:30)
    Target IP address: 192.168.1.7
@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Jul 26, 2016

Found a very interesting thing.
I am using a Windows 7 OS to debug this issue and here are the findings

  1. ESP is responding to ARP queries where destination is the ESP MAC address
    1918 151.364565 IntelCor_c5:37:30 Espressi_1a:66:47 ARP 42 Who has 192.168.1.6? Tell 192.168.1.7
    1919 151.371335 Espressi_1a:66:47 IntelCor_c5:37:30 ARP 42 192.168.1.6 is at 5c:cf:7f:1a:66:47
  2. ESP is not responding to broadcast ARP pings using nmap.
    3459 254.010073 IntelCor_c5:37:30 Broadcast ARP 42 Who has 192.168.1.6? Tell 192.168.1.7

I will look into the arp query responder code in the codebase

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Jul 26, 2016

Looks like the arp requests are completely handled by lwIP project which is what this project is depenent on.
@me-no-dev looks like you imported the project as dependency 4 months back. And i did a diff with the latest version of the project 1.4.1 of lwIP and seems like some broaddcast functionality was added which is not there in the version imported. Did you import the latest version ?

@me-no-dev

This comment has been minimized.

Copy link
Collaborator

me-no-dev commented Jul 26, 2016

lwip comes from espressif and not me :) I just tweaked some stuff here and there (not broadcast but multicast). Latest lwip is wip :)

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Jul 28, 2016

Upgrade to open source Lwip(1.4.1) from 1.3.2 port as suggested by @igrr The module is still responding to ARP requests.. Waiting to see if it stops.

@igrr

This comment has been minimized.

Copy link
Member

igrr commented Aug 1, 2016

Can you make a diff between 1.3.2 and 1.4.1 in the part which deals with ARP? Maybe we can backport the fix instead of updating all of lwip for now.

@igrr igrr added this to the 2.5.0 milestone Aug 1, 2016

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Aug 2, 2016

Stopped responding to ARP requests on 1.4.1 as well.
The gratuitous ARP that is being sent is not being handled by android devices. Deep diving into the code base to debug further.

51056 1693.352539 Espressi_88:7f:7e Broadcast ARP 42 Gratuitous ARP for 192.168.1.12 (Request)
Frame 51056: 42 bytes on wire (336 bits), 42 bytes captured (336 bits) on interface 0
Ethernet II, Src: Espressi_88:7f:7e (5c:cf:7f:88:7f:7e), Dst: Broadcast (ff:ff:ff:ff:ff:ff)
Address Resolution Protocol (request/gratuitous ARP)

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Aug 2, 2016

How ever restarting the module takes a new ip address and it starts responding to ARP requests.

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Aug 7, 2016

Now I am seeing that the IP Address is also in use by another device which is obvious as ESP didn't respond to ARP request. But ESP has been sending gratuitous ARP and here is the wire shark capture

5495591 37123.051429 00:e1:40:46:09:6c Broadcast ARP 42 Gratuitous ARP for 192.168.1.5 (Request) (duplicate use of 192.168.1.5 detected!)

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Aug 12, 2016

This is not an issue with ARP as most people have pointed out. This has something to do with the wireless connectivity stability.

I see debug logs right after the module stops responding to ARP saying
wifi evt: 7
add 1
aid 1
station: 40:88:05:b1:29:eb join, AID = 1
wifi evt: 5
wifi evt: 7
bcn_timout,ap_probe_send_start

This seems to be the root cause. I have attached the full log here.
https://drive.google.com/open?id=0B8DXcb9GfNuARFZGdy1USGNPbFk

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Aug 12, 2016

Attaching Enums that the event numbers point to


Both map to same enum values..

@mtnbrit

This comment has been minimized.

Copy link

mtnbrit commented Aug 12, 2016

What make, model and firmware is your AP? Have you tried a different brand or model of wifi AP? they are not all equal by far.

On Aug 12, 2016, at 10:46 AM, Akash Ashok notifications@github.com wrote:

This is not an issue with ARP as most people have pointed out. This has something to do with the wireless connectivity stability.

I see debug logs right after the module stops responding to ARP saying
wifi evt: 7
add 1
aid 1
station: 40:88:05:b1:29:eb join, AID = 1
wifi evt: 5
wifi evt: 7
bcn_timout,ap_probe_send_start

This seems to be the root cause. I have attached the full log here.
https://drive.google.com/open?id=0B8DXcb9GfNuARW54YWFsVHhJbnc https://drive.google.com/open?id=0B8DXcb9GfNuARW54YWFsVHhJbnc

You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub #2330 (comment), or mute the thread https://github.com/notifications/unsubscribe-auth/AKy2zsZ_DPhgEpNQL7VhxiU-lqHxkFeCks5qfLF8gaJpZM4JVNL1.

@ClaudioHutte

This comment has been minimized.

Copy link

ClaudioHutte commented Aug 12, 2016

So you think the wifi connectivity instability lead to the inability of responding ARP broadcast requests?
A reception problem since transmission seems to be ok, correct?

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Aug 13, 2016

@ClaudioHutte You are partially right. Here are my observations

  1. Reciever seems to be mainly affected because post this gratuitous ARP from other module is not being recieved as well but Gratuitous ARP is being sent to other modules though
  2. How ever if you see the log below it seems like... The module tries to rejoin it gets a wifi evt: 5 which is connected post which it recieves the Gratuitous ARp from other modules for just a few seconds post which it disconnects with
err already associed!
station: 98:0c:a5:b8:de:91 leave, AID = 1

Log

add 1
aid 1
station: 98:0c:a5:b8:de:91 join, AID = 1
wifi evt: 5
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
Got ARP Input 
nHere for arpwifi evt: 7
Got ARP Input 
nHere for arpGot ARP Input 
nHere for arpwifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
Got ARP Input 
nHere for arpGot ARP Input 
nHere for arpGot ARP Input 
nwifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
Got ARP Input 
nHere for arpGot ARP Input 
nHere for arpGot ARP Input 
nHere for arpwifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
Got ARP Input 
nHere for arpwifi evt: 7
Got ARP Input 
nHere for arpGot ARP Input 
nHere for arpwifi evt: 7
Got ARP Input 
nHere for arpwifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
wifi evt: 7
err already associed!
station: 98:0c:a5:b8:de:91 leave, AID = 1
rm 1
wifi evt: 6
add 1
aid 1
station: 98:0c:a5:b8:de:91 join, AID = 1
  1. somewhere between multiple join and leave attempts you'll also see
    max connection!
  2. And ofcourse a bunch of
bcn_timout,ap_probe_send_start
bcn_timout,ap_probe_send_start

Just to explain my setup I have 2 esp8266 12 f modules
http://www.thaieasyelec.com/products/wireless-modules/wifi-modules/esp8266-12f-wifi-serial-transceiver-module-detail.html

  1. I setup gratuitous ARP to send arp broadcast pings into the network every second as @ClaudioHutte pointed out in the beginning
  2. When the modules connect for the first time every second the module prints
Got ARP Input 
nHere for arp
  1. Along with this ARP recieve there are other wifi events.
  2. At some point the logs mentioned in point 2 stop. (After close to 48 hours) and there are a bunch of other events which happen before this terminates.

I have attached the complete log to the link
https://drive.google.com/file/d/0B8DXcb9GfNuARFZGdy1USGNPbFk/view

@ClaudioHutte

This comment has been minimized.

Copy link

ClaudioHutte commented Aug 13, 2016

I never tested two units as you done, though I incurred into the same troubles with ESP8266-12 and a TP-link router located quite far (two stories below mine). I would like to do some tests the same way you've done, but I will be busy into other works for the next two weeks.
What happens if the "every second gratuitous ARP send" workaround is stopped/skipped?

@thehellmaker

This comment has been minimized.

Copy link
Author

thehellmaker commented Aug 13, 2016

Before you mentioned about the gratuitous ARP i hadn't added it into the code base. Even then the module stopped responding like we discussed here me-no-dev/ESPAsyncWebServer#54

But i haven't collected the logs without Gratuitous ARP but I'm sure its the same issue though.

For the module to eventually stop responding it always take 36 hours + .

@alex00971

This comment has been minimized.

Copy link

alex00971 commented May 8, 2017

By upgrading to SDK 2.1.0 #3215 this problem will be solved.

@pouriap

This comment has been minimized.

Copy link

pouriap commented Jun 20, 2017

Thanks for the efforts to create the update_sdk_2.1.0 branch.

But I'm still having the ARP issue even when using that branch:

esp8266_arp

Can anyone confirm that their ARP issue has been resolved by using that branch?

@IvanBayan

This comment has been minimized.

Copy link

IvanBayan commented Jun 20, 2017

I tried new sdk and still have arp issue.

@vks007

This comment has been minimized.

Copy link

vks007 commented Jun 25, 2017

I also have this same issue.
I am using a webserver on the ESP which connects to my router in the STA mode. The router assigns a fix IP to the ESP (192.168.1.54). All works good but after some time (typically a few hours to a day) the ESP webserver stops responding. I tried pinging the IP address at this point and its unreachable. To see memory footprint I added log calls within the ESP which calls a googlesheet URL and logs all relevant info. All that keeps working fine. Memory foot print is also normal. So while my ESP is able to reach the internet it's IP address is not reachable from within the local network.
If I reset my ESP or turn my modem ON/OFF (to again assign the IP address) the issue goes away for a few hours.
I have tried the simple webserver from the examples and it behaves the same way so my program is not what is causing this. I have also tried this on a SONOFF,Electrodragon ESP relay module, ESP 01 module , ESP 12E module - they all behave the same.
Can somebody guide me on what should I be looking at here,

@mtnbrit

This comment has been minimized.

Copy link

mtnbrit commented Jun 25, 2017

@vks007

This comment has been minimized.

Copy link

vks007 commented Jun 26, 2017

Hi @mtnbrit , I am in the process of trying out a different router, will a few days to verify this.
Also, while the above condition happened some times, with my testing since yday most of the times, I am able to ping the ESP while the webserver does not respond. At other times, I am sometimes able to get a response from my iPhone browser while the desktop browser throws - connection reset error.
I also, started recycling the webserver every 10 min (I mean recreate the webserver object via reset method) and I was still able to get into a state where the ESP behaves normally but webserver stops responding.
Is there someway I can debug the state of the webserver object, I can log that and figure out what is not responding. Maybe I can tweak the source files and read some members and log them, if they are private, make them public just for this purpose. But I am not sure what things would tell me something about the webserver object.

@devyte

This comment has been minimized.

Copy link
Collaborator

devyte commented Oct 2, 2018

There is no known solution for this at the moment, and confirmation of the proposed workarounds is still pending. Pushing milestone back.

@d-a-v

This comment has been minimized.

Copy link
Collaborator

d-a-v commented Oct 6, 2018

Please have a look at #5210

@mateuszdrab

This comment has been minimized.

Copy link

mateuszdrab commented Oct 6, 2018

I've been able to workaround this issue in my environment by creating a python ARP responder script which bascially responds for ARP requests from the firewall, I've not had a single ping alert since or at least they happen once a week for 1 or 2 boards, previously I'd get 15-20 alerts a day. Once in a while one of the boards disconnect, but considering I have 30+ of them at home, I blame it on channel congestion. I'd still prefer to see a solution that doesn't need workarounds so happy to try the new sdk on one of the units.

@devyte

This comment has been minimized.

Copy link
Collaborator

devyte commented Oct 9, 2018

Everyone, #5210 is merged. Given the explanation in Espressif's doc (quoted in comments in the PR), it is clear that the ESP could miss broadcasts when using light sleep and sleep level max. It is possible that at some point Espressif "improved" power usage by internally changing sleep level to max, which can miss broadcasts, which could explain the symptoms in this issue. Now, in the sdk version integrated in the PR, the setting can be controlled, and is set explicitly in the core internals.
Please retest with latest git, and report back here.
Oh, and cross fingers...

@d-a-v

This comment has been minimized.

Copy link
Collaborator

d-a-v commented Oct 19, 2018

If the issue is still there, I put a pingAlive example in WIFI_MODEM_SLEEP mode.
gateway-ping is set with a 5secs interval.
Maximum unreachable time has been 7 seconds in 15 hours testing (just jumped to 10secs after I put my finger on the antenna).
I don't have an accurate enough power meter for current measurement.

date (UTC): Fri Oct 19 07:56:11 2018
delta:      25118 ms
delta-max:  30143 ms
            (should not be more than (ping)5000 + (refresh)20000 = 25000 ms)

gateway ping stats: 11019 sent - 11019 received

will be refreshed in 16 seconds
@javot

This comment has been minimized.

Copy link

javot commented Nov 11, 2018

Hi all! I had a similiar issue, my ESP8266 doesnt respond after 5 minutes. I put the pingAlive code and the issue was resolve (at least my esp8266 is responding for 3 hours) I dont know how that code impact in my energy consumption.

(well... I had to edit this post after 5 hours... IT DOESNT WORK!! I wanted to log in the webserver on the ESP8266 and it didnt response! it is strange because if I do a ping it responds, but when I want to enter in port 80 nothing happend. ) How it is possible?

@klaasdc

This comment has been minimized.

Copy link

klaasdc commented Dec 23, 2018

For me, the issue was solved after #5210. My Wemo D1 stays reachable for many weeks now.

@mateuszdrab

This comment has been minimized.

Copy link

mateuszdrab commented Dec 25, 2018

For me, the issue was solved after #5210. My Wemo D1 stays reachable for many weeks now.

So just need to rebuild from source using latest SDK? 2.4.0 or 2.5.0?

@klaasdc

This comment has been minimized.

Copy link

klaasdc commented Dec 25, 2018

For me, the issue was solved after #5210. My Wemo D1 stays reachable for many weeks now.

So just need to rebuild from source using latest SDK? 2.4.0 or 2.5.0?

Yes, just a rebuild. I used a git version a few days after Oct 9, when devyte mentioned the merge. I suppose it is now in the 2.5 beta's.

@DirtyHairy

This comment has been minimized.

Copy link

DirtyHairy commented Jan 17, 2019

I am on the same boat; my ESP stops responding to ARP requests as well, so I am basically losing connectivity after my ARP cache gets flushed. FWIW, I am using a Ubiquiti Unifi AP. For me the issue persists even with #5210. From what I observe and what I have read on this thread, my impression is that this is really a bug in lwip's ARP handling, which is triggered by some behaviour of the AP or by packets sent from some other devices on the same network.

As a workaround, I settled for sending gratuitous ARP broadcasts every 5 seconds with the following code (I am using the scheduler):

#include <lwip/netif.h>
#include <lwip/etharp.h>

// ... SNIP ...

void GratuitousARPTask::loop() {
    netif *n = netif_list;

    while (n) {
        etharp_gratuitous(n);
        n = n->next;
    }

    delay(5000);
}

// ... SNIP ...

I can see the ARP broadcast sent every five seconds with Wireshark, and it reliably restores connectivity after I flush my laptops ARP table. While a direct response to ARP broadcasts would arguably be better, this is a viable workaround for me.

@d-a-v

This comment has been minimized.

Copy link
Collaborator

d-a-v commented Jan 17, 2019

This solution is nice.

From what I observe and what I have read on this thread, my impression is that this is really a bug in lwip's ARP handling

I'm not sure about that. lwIP has a wide audience.

Could you use netdump and check, once your esp is not responding, if you can read incoming arp requests from your AP on the serial console ?

@DirtyHairy

This comment has been minimized.

Copy link

DirtyHairy commented Jan 18, 2019

I'm not sure about that. lwIP has a wide audience.

Mmmh, I guess you're right, I should have read a bit deeper into lwip's background. I agree, it is unlikely that such a bug would've gone unnoticed.

Could you use netdump and check, once your esp is not responding, if you can read incoming arp requests from your AP on the serial console?

That's a cool idea, will do so this weekend --- I am curious what I'll find. I did another test and tried sending ARP requests systematically with arping; it seems that, in my case, the ESP answers ARP requests only sporadically even immediatelly after boot. For example, there's the initial gratuitous broadcast on boot, then a stretch of ARP requests not being answered, then five answered ones, then again nothing for 30 seconds or so, and so on.

@DirtyHairy

This comment has been minimized.

Copy link

DirtyHairy commented Jan 19, 2019

OK, I have done some experiments with netdump and arping. The result: I don't think this is a bug at all, but a reception issue. ARP requests that are not answered are not received by the device at all. However, I notice that placement of the module and the wiring around it have a noticeable effect on its tendency to receive and answer ARPs. In particular, I can get a significant improvement in received packages by just touching the antenna trace on the PCB, and nearly all packages get answered if I move close to the AP.

I am not sure why ARP packages are that badly affected, while IP seems to be fine, but it might just be the small package length that causes the device to mistake ARP packages for noise. In addition, now that I am scrutinising connectivity more closely, I notice that ICMP ping times are pretty inconstant as well where I am usually sitting, ranging from 10ms to 200ms, with an occasional dropped package.

@mtnbrit

This comment has been minimized.

Copy link

mtnbrit commented Jan 19, 2019

@DirtyHairy

This comment has been minimized.

Copy link

DirtyHairy commented Jan 19, 2019

Have you tried a different AP? Not all APs are created equal. Try a totally different platform, not just a different model of the same manufacturer. For me I found Mikrotik worked but Ubnt didn’t, this was couple years ago though.

Thanks for the hint. As this seems (at least in my case) to be a reception issue, I would even expect that different APs lead to different reliability --- a different AP will at have a different characteristic, transmit at a different power and differ in a myard of other details.

However, switching APs is not really an option for me; I am quite happy with our Unifi, and I have it wall mounted in our house. The workaround of broadcasting gratuitous ARPs at fixed intervals feels a bit clumsy, but is totally sufficient for me --- much more than switching APs 😏

@mateuszdrab

This comment has been minimized.

Copy link

mateuszdrab commented Jan 20, 2019

Think you might be right about reception guys. I have about 20 of those at home and only some of them have the ARP issue. I just get by with it using the python script but I might implement the gratuitous ARP solution. With the ARP script, I pretty much never have ping issues with the ESPs but there sometimes is a situation the ESPs will struggle to reconnect for long time and recently one of my boards started disconnecting after a couple of hours on the network - I am going to test if its a location/placement issue by plugging it in nearer the AP. Switching APs is no solution to me either ;)

@cziter15

This comment has been minimized.

Copy link

cziter15 commented Feb 5, 2019

I just ran into exactly same problem. Esp12e drops connection to mqtt server and messages "no reppy arp from x.x.x" are appearing.

My router is d-link dwr921

@timmpo

This comment has been minimized.

Copy link

timmpo commented Feb 10, 2019

Hello all time to tell my story! I have the same problem here.
I have 10 esp8266 in my network, when they go down I can not reach them from neither computer nor the phone but they still continue communicating with my Raspberry and internet. If i wait the all comes up again after a few ours or some day.

I have tried 5 different routers with different results.
Whith my Dlink DIR-809 connection problem occurs every day. I have tried simple server from example with same results.
Their (esp) logs shows that the never rebooted or drop wifi.
When they goes down the don't respond to ping or arping.
They are all only meters from Access point with good reception.
My PC or Raspberry have never problem on same wlan network so i cant only blame the routers.
The problem seems to be worse when i have more esp on my network.

But when i use my old thompson router with only 3 esp for 6 mounts they stayed connected for weeks.
But if is 100% router related why comes the raspberry and PC server always stay connected.

@timmpo

This comment has been minimized.

Copy link

timmpo commented Feb 17, 2019

I disable Multicast Streams i my router and now my ESPs have not stopped respond in days!
https://support1.bluesound.com/hc/en-us/articles/200639793-D-Link-Router-General-Setup
The Local Multicasting is being blocked or over-prioritized by this outgoing, internet-based Multicasting.

@cziter15

This comment has been minimized.

Copy link

cziter15 commented Feb 20, 2019

For me it looks like LWIP 1.4 is more stable than 2.0 (both Higher memory).
When on LWIP 2, my MQTT connection timeouts few times per hour, while on LWIP 1.4 it timeouts few times per week. I don't know what cuses this anyways.

@ 2.5.0 release

@timmpo

This comment has been minimized.

Copy link

timmpo commented Feb 24, 2019

I wrote a bash script that ping all my devices, it sending 1 packet to all devices once per minute and i got ~1 packet drops per our in at least one of the devices ( not only esp8266) that indicate my wlan isn't fool proof, but before the Multicast Streams setting changed i got much more packets drops over the wlan.
A new problem occurred with that disabled, the esp's cant send big web pages outside the lan in
lwip 1.4 (Error: content_length_mismatch) i had to change the esp's with big pages to lwip v2 witch for me makes slower load times on pictures.
But finally my esp's don't drops out anymore!

@TD-er

This comment has been minimized.

Copy link
Contributor

TD-er commented Mar 1, 2019

I am now looking into the gratuitous ARP option suggested above.
What is a good interval for such an ARP packet? 5 seconds is suggested, but I was hoping someone already found a more dynamic way of sending such a packet.
Is there some way to see how much traffic has been sent/received in the IP stack? (also useful for other purposes)
Is there a good rule of thumb on how often an ARP table in a switch is being cleared? (possibly also related to amount of nodes in the network and ARP table size)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.