Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Network stack never recovers without a hard reset after receiving closely-spaced larger UDP packets #2899

Closed
ssilverman opened this issue Jun 17, 2019 · 25 comments
Labels
Status: Stale Issue is stale stage (outdated/stuck)

Comments

@ssilverman
Copy link

ssilverman commented Jun 17, 2019

Hardware:

Board: Adafruit ESP32 Feather
Core Installation version: Core v1.0.2 in Arduino 1.8.9, espressif32@1.8.0 and framework-arduinoespressif32@2.10002.190416 in PlatformIO
IDE name: Arduino v1.8.9 and PlatformIO v3.6.7 in VSCode v1.35.1
Flash Frequency: 80MHz
PSRAM enabled: Don't know
Upload Speed: 921600
Computer OS: Mac OSX v10.14.5

Description:

The network stack never recovers if it receives just tens of larger UDP packets greater than 1024 bytes. This detail got lost in the discussion of #2871, so I'm creating this new bug.

I see this problem almost every time I send somewhere between 20-80 larger UDP packets in non-softAP mode. I see this problem sometimes but rarely when using softAP mode. Sometimes, if I repeatedly run the program, see the network stack crash, and then hard reset the device, I won't see the network stack failure and it will recover from seeing a bunch of large packets. But only sometimes.

To reiterate: The network stack does not recover after receiving a bunch of large packets. Some people see that it recovers just fine if the packets stop, but I never see this. Yes, the device stops seeing packets if it's flooded with them, and yes, some people will see the network stack recover after a short period, but I never do. A hard reset is required.

To continuously send closely-spaced packets, this Bash script is useful:

while true; do echo -n $(printf '.%.0s' {1..1400}) > /dev/udp/192.168.1.9/8000; sleep 0.05; done

The 1400 is the UDP size; I find that 1025 and greater causes the problem for me. Also, change the IP address and sleep (in seconds) to play with different network loads.

The crux is this: I see onPacket never called again once a bunch of large packets are received, even if no more packets are sent. I've tried on three different ESP32 Feathers and three different network setups for non-softAP mode.

The effect: The device becomes permanently unusable on the network, without a hard reset, if it sees lots of larger UDP packets in a row.

Sketch:

#include <AsyncUDP.h>
#include <Esp.h>
#include <WiFi.h>

constexpr char kAPName[] = "ChangeMe";
constexpr char kAPPassword[] = "ChangeMe";
constexpr bool isSoftAP = false;  // Change to true for SoftAP mode

AsyncUDP udp;

void setup() {
  Serial.begin(115200);
  while (!Serial && millis() < 4000) {
    // Wait for Serial
  }
  Serial.println("Starting.");

  if (isSoftAP) {
    Serial.println("Starting SoftAP...");
    if (WiFi.softAP(kAPName, kAPPassword)) {
      Serial.print("    IP: ");
      Serial.println(WiFi.softAPIP());
    } else {
      Serial.println("ERROR: Starting SoftAP!");
    }
  } else {
    if (WiFi.begin(kAPName, kAPPassword)) {
      while (!WiFi.isConnected()) {
        delay(500);
      }
      Serial.print("    IP: ");
      Serial.println(WiFi.localIP());
      Serial.print("    Subnet: ");
      Serial.println(WiFi.subnetMask());
      Serial.print("    Gateway: ");
      Serial.println(WiFi.gatewayIP());
    } else {
      Serial.println("    ERROR: Connecting to AP!");
    }
  }

  if (!udp.listen(8000)) {
    Serial.println("ERROR: Starting UDP server!");
  }
  udp.onPacket(onPacket);
}

int counter = 0;

void onPacket(AsyncUDPPacket &packet) {
  Serial.printf("%d: %d\n", ++counter, packet.length());
}

void loop() {
  // Print some status every 5s
  Serial.printf("Free heap: %d\n", ESP.getFreeHeap());
  delay(5000);
}
@r1dd1ck
Copy link

r1dd1ck commented Jun 17, 2019

As I already wrote in the other thread - this is most probably not solely UDP related, as I'm getting the same behavior even with TCP. And packet cadence is the only thing that seems to matter here.

The problem looks to be related to packet reception / buffering (memory leak?), as the free heap size drops a non-trivial amount when this permanent "choke" happens..

A few pointers:
• sending a "large" packet (larger than the network MTU size) causes the packet to be fragmented - eg. the payload is sent in multiple smaller packets instead.
• your "non-softAP" mode is actually called CLIENT mode 😉

Oh yea - and the issue should be best moved over to the ESP-IDF repo, as the source of the problem most probably lies there.

@ssilverman
Copy link
Author

ssilverman commented Jun 17, 2019

Yes, you're right. I should move this to ESP-IDF.

Yeah, I noticed the fragmentation and surmised in #2871 that this might be the area where some memory corruption occurs. (Or, hrm... maybe I didn't, but I meant to.)

@ssilverman
Copy link
Author

I just filed this bug: espressif/esp-idf#3646

@ssilverman
Copy link
Author

@r1dd1ck Do you feel like weighing in to the esp-idf bug with things you've tried? I can't be the only one seeing these issues...
espressif/esp-idf#3646

@ssilverman
Copy link
Author

So the commenter at espressif/esp-idf#3646 is suggesting that the iperf example handles large bandwidth just fine, and that it's possible the Arduino sdkconfig options need to be tweaked.

@r1dd1ck
Copy link

r1dd1ck commented Jun 18, 2019

@ssilverman
Yes, for our application they certainly do.

But the point is, that it should not be possible to crash the WiFi stack just because of some non-optimal buffer settings. They should fix it regardless of whether increasing the buffers "fixes" our issue, or not.

@negativekelvin
Copy link

negativekelvin commented Jun 19, 2019

@ssilverman Your test works fine after reverting this commit
e9389e3#diff-a012459a2fc7708c407bdc0d3d081a29

or by changing your test

https://gist.github.com/negativekelvin/877d8f285e61583956ee3d0f8c8072bf/revisions

@ssilverman
Copy link
Author

ssilverman commented Jun 19, 2019

Changing my test to use the copy instead of the reference, per that gist change, results in this:

assertion "pbuf_free: p->ref > 0" failed: file "/Users/ficeto/Desktop/ESP32/ESP32/esp-idf-public/components/lwip/lwip/src/core/pbuf.c", line 765, function: pbuf_free
abort() was called at PC 0x400f0d8b on core 1

As well, looking at the source my PlatformIO install is using: AsyncUDP.cpp has code similar to before that commit, where the pbuf_free isn't inside an else.

@negativekelvin
Copy link

Hmm well it is discussed here #2685

Your code runs forever for me with that change, latest Arduino as idf component

@ssilverman
Copy link
Author

ssilverman commented Jun 19, 2019

I'm running the 1.0.2 core from the Arduino IDE. Which core and IDE are you using?

Before reading the following, I'd like you to try this: Use the latest Arduino 1.8.9 with the 1.0.2 ESP32 core. Or PlatformIO with the latest ESP32 core. Then change the onPacket function from my example to use a copy instead of a reference. Then you'll see this: assertion "pbuf_free: p->ref > 0" failed. Your setup is either very different or you're not using the same code here. I want you to see this fail because It feels like you don't believe me. (@r1dd1ck, if you've happened to have tried this step, could you confirm you're seeing this error too?)

And I'm not sure what you mean by "latest Arduino as idf component"? Are you saying you built your own core and running it from PlatformIO/Arduino, or are you saying you built arduino-esp32 and are running it independently from somewhere else?

Help me out here so that we can figure this out. "It works for me" doesn't help solve the main issue that the network stack crashes with the publicly-released code. I've tried both changes. That pbuf_free line not inside the else leads to the network stack crashing. Changing onPacket to use the copy instead of the reference also leads to a crash: assertion "pbuf_free: p->ref > 0" failed. Neither of your suggestions work. Something here isn't obivous and we're clearly operating from different assumptions. Hence, "Which core are you using and can you clarify 'latest Arduino as idf component`?"

More specifically, what are the specific steps you're doing and I'm not, such that you see it crashing, then you make a change then you rebuild something (I'm not sure what), then you run it, and then it works. What are these steps? This will more constructively lead to solving this.

If it runs for you and not for me, clearly we're doing something different. I'd like to figure out what that is; this is the point of this bug report, to find out why this is failing on the publicly-released version and to find a fix. The bottom line is that the network stack completely crashes.

Having said all that, thank you for your time in helping track this down. :)

@negativekelvin
Copy link

My setup is this
https://docs.espressif.com/projects/esp-idf/en/stable/get-started/
Plus this
https://github.com/espressif/arduino-esp32/blob/master/docs/esp-idf_component.md

As stated in #2685 there is a relationship between the commit and copy/reference. When I first tried your code, it was freezing because the pbufs never got freed. This is because my codebase included that commit. When I made the change, it ran forever with no problems. You may be experiencing some other issue based on your version of the codebase.

@ssilverman
Copy link
Author

Thank you for those getting started links.

I'm 100% certain that all the public release stuff (Arduino 1.8.9 plus 1.0.2 of the ESP32 core; latest PlatformIO plus latest ESP32 core downloaded via the IDE) experiences a network stack freeze. That's why I filed this bug. If it's since been fixed, I'll try to confirm with the latest arduino-esp32 codebase and report back here when I've tested.

@atanisoft
Copy link
Collaborator

atanisoft commented Jun 19, 2019

@ssilverman if you want to use the latest arduino-esp32 in PIO you can use this in your platformio.ini file:
platform=https://github.com/platformio/platform-espressif32.git#feature/stage

and it will pull the tip of this repo.

@ssilverman
Copy link
Author

ssilverman commented Jun 19, 2019

I can confirm that the staging version of arduino-esp32 from
platform=https://github.com/platformio/platform-espressif32.git#feature/stage still crashes the network stack when I send just a few larger packets.

Here's the bash script I'm using to send 1400-byte packets 20 times a second:

while true; do echo -n $(printf '.%.0s' {1..1400}) > /dev/udp/192.168.1.9/8000; sleep 0.05; done

Simply run this for a little bit of time (changing the IP address, of course, and running with the companion ESP32 program above), shut it down, and then the ESP32 is unresponsive to the network until a hard reset.

I'm really confused. I tried on several different ESP32's and with several different network setups. This can't be just me. Is nobody else seeing this (well, hardly anybody)?

@r1dd1ck
Copy link

r1dd1ck commented Jun 19, 2019

@ssilverman
espressif/esp-idf#3646 (comment)
Try that. It should "fix" your issues until someone cares to really fix it (if ever) 💩

@negativekelvin
Copy link

Yeah it is definitely an sdkconfig problem. Use the sdkconfig from Arduino = freeze, use the default esp-idf settings = ok. I cannot figure out which setting is the problem though.

@ssilverman
Copy link
Author

Thank you both!

@negativekelvin
Copy link

negativekelvin commented Jun 20, 2019

I think it is this:
CONFIG_ESP32_WIFI_RX_BA_WIN=6
vs
CONFIG_ESP32_WIFI_RX_BA_WIN=16

CONFIG_ESP32_WIFI_RX_BA_WIN is not supposed to be higher than ESP32_WIFI_STATIC_RX_BUFFER_NUM which is set to 8

@atanisoft
Copy link
Collaborator

ping @me-no-dev can you update sdkconfig to reset CONFIG_ESP32_WIFI_RX_BA_WIN to default from IDF for v1.0.3?

@r1dd1ck
Copy link

r1dd1ck commented Jun 20, 2019

@negativekelvin
Well, the iperf example defaults use:
WIFI_STATIC_RX_BUFFER_NUM=16
WIFI_RX_BA_WIN=32

and it does not "freeze" ... which gives? 🙄

Meanwhile, after some excessive stress testing, I can confirm that compiling with the ESP-IDF defaults really does "fix" the freezing WiFi stack issue 👍

The diff from current arduino-esp32 sdkconfig:
WIFI_STATIC_RX_BUFFER_NUM = 810
WIFI_DYNAMIC_RX_BUFFER_NUM = 1032
WIFI_RX_BA_WIN = 166

@collin80
Copy link
Contributor

I was also having this basic problem. In my case I was sending a lot of data over TCP on port 23 (pretending to be telnet) and it would lock things up if I ramped the data rate up too high. I could see a WINDOW error in wireshark saying that it exhausted the TCP window. With the fix from this issue it no longer does that. So, it does appear to be a reasonable fix. I would also like to see it merged.

@atanisoft
Copy link
Collaborator

Could this also be part of the unexplained lockup in ESPAsyncWebServer?

@Adrianotiger
Copy link

Adrianotiger commented Jul 31, 2019

Wroom32 - Arduino IDE - 1.03-rc1:
I have the same problem with the NtpClientLib.h library to get the time over the web.
This bug was not present in the 1.02-rc2 version.

Found a temporary fix:
gmag11/NtpClient#99

@stale
Copy link

stale bot commented Sep 29, 2019

[STALE_SET] This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale Issue is stale stage (outdated/stuck) label Sep 29, 2019
@stale
Copy link

stale bot commented Oct 13, 2019

[STALE_DEL] This stale issue has been automatically closed. Thank you for your contributions.

@stale stale bot closed this as completed Oct 13, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale Issue is stale stage (outdated/stuck)
Projects
None yet
Development

No branches or pull requests

6 participants