Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling WiFiUDP::parsePacket() too often eventually causes it to return zero always #2871

Closed
ssilverman opened this issue Jun 7, 2019 · 20 comments
Labels
Status: Stale Issue is stale stage (outdated/stuck)

Comments

@ssilverman
Copy link

ssilverman commented Jun 7, 2019

Hardware:

Board: Adafruit ESP32 Feather
IDE: PlatformIO + VSCode (framework-arduinoespressif32 v2.10002.190416)
IDE: Arduino IDE 1.8.9 (ESP32 v1.0.2)

Description:

I’m discovering that calling parsePacket() too often when receiving UDP packets from an external source causes it to eventually return zero always. Note that in the code below, the free heap is always printed every 5 seconds, it's just the parsePacket() that fails to return non-zero after it receives a bunch of packets.

You can test this code by sending a bunch of packets to port 8000, say with some random TouchOSC UI on a mobile device, and dragging one of the sliders so that packets are being sent often.

Sketch:

#include <Esp.h>
#include <WiFi.h>

WiFiUDP udp;

constexpr char kAPName[] = "esp32test";
constexpr char kAPPassword[] = "password";

void setup() {
  Serial.begin(115200);
  while (!Serial && millis() < 4000) {
    // Wait for Serial
  }
  Serial.println("Starting.");

  Serial.println("Starting SoftAP...");
  if (WiFi.softAP(kAPName, kAPPassword)) {
    Serial.print("    IP     : ");
    Serial.println(WiFi.softAPIP());
  } else {
    Serial.println("ERROR: Starting SoftAP!");
  }

  if (!udp.begin(8000)) {
    Serial.println("Error starting UDP server.");
  }
}

int counter = 0;

void loop() {
  int size = udp.parsePacket();
  if (size > 0) {
    Serial.printf("%d: %d\n", ++counter, size);
    udp.flush();
  }

  // Print some status every 5s
  static unsigned long lastT = 0;
  unsigned long t = millis();
  if (t - lastT >= 5000) {
    Serial.printf("Free heap: %d\n", ESP.getFreeHeap());
    lastT = t;
  }
}

Debug Messages:

After system initialization, there are no debug messages, even at the Verbose level, except for this one after a while, and after UDP receive has already stopped returning non-zero from parsePacket():

[D][WiFiGeneric.cpp:336] _eventCallback(): Event: 16 - AP_STADISCONNECTED
@ssilverman
Copy link
Author

ssilverman commented Jun 7, 2019

I'll add: Sometimes it doesn't fail, but it fails most of the time for me. Adding delay(1) above parsePacket() helps, but doesn't guarantee there's no future failure, it just may take longer. delay(10) helps even more, but again, it just takes longer to fail, and going much higher decreases the application's responsiveness.

I'm stuck because the purpose of parsePacket() is just to check if there's a packet waiting to be read, and choosing some delay isn't very rigorous, especially since the choice of that delay seems arbitrary and would affect the responsiveness.

@lbernstone
Copy link
Contributor

Trying to collect UDP packets in the main loop is eventually going to have you losing packets or starving the rest of your processing. AsyncUDP will execute a callback when a message is received, and run in a separate task.

@ssilverman
Copy link
Author

I switched over to AsyncUDP and it works more of the time. Specifically, it stops receiving at a much later time when using a softAP, but it still stops receiving after about 30-60 packets received when I'm joining a network and not using the softAP.

It helps, but just by a little bit, especially for the non-softAP case. Any other UDP "gotchas" I need to watch out for? Thank you for your help.

@ssilverman
Copy link
Author

For posterity, here is my test code for the non-softAP case. Change isSoftAP to false and fill in the network parameters just above that.

#include <AsyncUDP.h>
#include <Esp.h>
#include <WiFi.h>

AsyncUDP udp;

constexpr char kAPName[] = "esp32test";
constexpr char kAPPassword[] = "password";
constexpr bool isSoftAP = true;

void setup() {
  Serial.begin(115200);
  while (!Serial && millis() < 4000) {
    // Wait for Serial
  }
  Serial.println("Starting.");

  if (isSoftAP) {
    Serial.println("Starting SoftAP...");
    if (WiFi.softAP(kAPName, kAPPassword)) {
      Serial.print("    IP     : ");
      Serial.println(WiFi.softAPIP());
    } else {
      Serial.println("ERROR: Starting SoftAP!");
    }
  } else {
    if (WiFi.begin(kAPName, kAPPassword)) {
      while (!WiFi.isConnected()) {
        delay(500);
      }
      Serial.print("    IP: ");
      Serial.println(WiFi.localIP());
      Serial.print("    Subnet: ");
      Serial.println(WiFi.subnetMask());
      Serial.print("    Gateway: ");
      Serial.println(WiFi.gatewayIP());
    } else {
      Serial.println("    ERROR: Connecting to AP!");
    }
  }

  if (!udp.listen(8000)) {
    Serial.println("Error starting UDP server.");
  }
  udp.onPacket(onPacket);
}

int counter = 0;

void onPacket(AsyncUDPPacket &packet) {
  Serial.printf("%d: %d\n", ++counter, packet.length());
}

void loop() {
  // Print some status every 5s
  static unsigned long lastT = 0;
  unsigned long t = millis();
  if (t - lastT >= 5000) {
    Serial.printf("Free heap: %d\n", ESP.getFreeHeap());
    lastT = t;
  } 
}

@ssilverman
Copy link
Author

ssilverman commented Jun 7, 2019

Some more information: The softAP case seems more reliable now if, after uploading my code to the board, I press the hardware reset button at least once. Then receive appears to work without stopping after about 30-60 packets received.

I tried this as well for the non-softAP case, and now it appears like one needs to press the reset button after all firmware reloads for Wi-Fi to work reliably. I wonder why?

Any thoughts on what settings to use or calls to make to avoid this, especially if the board is not very physically accessible?

@lbernstone
Copy link
Contributor

Serial is slow. Change the output in your callback to log_i. Put a delay in your loop so it doesn't steal so many cycles- I used delay(500). I was able to send thousands of packets with a 50ms delay from bash. With no delay, after about 1200 packets it choked (on the vsnprintf).

@ssilverman
Copy link
Author

Curious, would calling yield() instead of delay(x) be just as effective?

@lbernstone
Copy link
Contributor

lbernstone commented Jun 8, 2019

Yield is going to give up control, but it will come right back and process the loop again. That is essentially the same thing as letting the loop function complete. See main.cpp
Don't waste cycles if you don't need to. I often delete the loop task and use timers and tasks to get more fine-grained control.

@ssilverman
Copy link
Author

Thanks for the suggestions. The Arduino execution model isn’t the best for robust systems. Time to write this code properly and abandon loop().

@ssilverman
Copy link
Author

ssilverman commented Jun 9, 2019

I'm sad to report that even with AsyncUDP, UDP reception still eventually fails (but not all the time, and with station mode) after a small number of packets. :(

@ssilverman
Copy link
Author

ssilverman commented Jun 11, 2019

Let me add some simplified example code that uses a delay instead of time checking. Maybe the firmware has trouble with a busy access point? I'm using a random TouchOSC UI to send packets. The device stops receiving packets after about 30-60 of them.

#include <AsyncUDP.h>
#include <Esp.h>
#include <WiFi.h>

AsyncUDP udp;

constexpr char kAPName[] = "ChangeMe";
constexpr char kAPPassword[] = "ChangeMe";

void setup() {
  Serial.begin(115200);
  while (!Serial && millis() < 4000) {
    // Wait for Serial
  }
  Serial.println("Starting.");

  if (WiFi.begin(kAPName, kAPPassword)) {
    while (!WiFi.isConnected()) {
      delay(500);
    }
    Serial.print("    IP: ");
    Serial.println(WiFi.localIP());
    Serial.print("    Subnet: ");
    Serial.println(WiFi.subnetMask());
    Serial.print("    Gateway: ");
    Serial.println(WiFi.gatewayIP());
  } else {
    Serial.println("    ERROR: Connecting to AP!");
  }

  if (!udp.listen(8000)) {
    Serial.println("Error starting UDP server.");
  }
  udp.onPacket(onPacket);
}

int counter = 0;

void onPacket(AsyncUDPPacket &packet) {
  Serial.printf("%d: %d\n", ++counter, packet.length());
}

void loop() {
  // Print some status every 5s
  Serial.printf("Free heap: %d\n", ESP.getFreeHeap());
  delay(5000);
}

@lbernstone
Copy link
Contributor

lbernstone commented Jun 11, 2019

I can't troubleshoot your network for you. I can tell you that with your exact code (other than the creds), I can receive thousands of packets using the following generator script on linux: while true; do echo $c > /dev/udp/192.168.1.125/8000; c=$((c+1)); sleep 0.05; done

@ssilverman
Copy link
Author

ssilverman commented Jun 12, 2019

I’ve now tried this on several networks and in all cases, the ESP32 stops being able to receive without a reset, after about 30-60 packets. While it may be true that it works on your network connections, I know unequivocally that it doesn’t work on at least some network connections.

At the very least, I think that it might be useful to figure out under which network conditions this fails, and under which it works. At least then we could say, “don’t use with network configuration type X.”

Or maybe it’s certain versions of the hardware or core firmware? There’s definitely an issue here, even if it’s just documenting under what conditions failure happens. I’m also noticing that failure sometimes happens even with SoftAP mode.

I’m using Feather ESP32’s (I’ve tried two) with the latest core from PlatformIO as of today.

I was reading recently that the Things from SparkFun have some issue with networks and that there was potentially a hardware flaw due to the choice of crystal frequency. That’s why I’m proposing that there’s a chance it’s the hardware. Grasping at straws here.

Thanks for testing this with me.

@lbernstone
Copy link
Contributor

lbernstone commented Jun 12, 2019

Here's some sequencing code to go along the with the bash script from earlier. I suggest you test with a known good source like this. Good luck.

long counter = 0;

void onPacket(AsyncUDPPacket &packet) {
  long c = strtol((const char*)packet.data(), NULL, 10);
  if (c > counter) {
    Serial.printf("missed packet: %lu\n", counter);
  }
  if (c < counter) {
    Serial.printf("slow packet: %lu\n", c);
  }
  counter = c+1;
}

@ssilverman
Copy link
Author

ssilverman commented Jun 13, 2019

I have more information. Your script generating just 6 characters per packet works just fine, so then I realized that it was TouchOSC generating many events at once, probably generating larger packets, that might be the cause. Sure enough, whenever I threw that into the mix, UDP reception would just stop.

So then I tried various packet sizes until a failure happened, and it turns out that packet sizes larger than 1024 generate multiple packets (that is, at least on my network; I don't know where the greater-than-1024-byte split is happening) and also halts UDP reception. For example, 1400 produces one 1024-byte packet followed by a 376-byte packet. When the device receives a few of these, the network stack appears to halt. Here's my modified script that shows the failure (try numbers larger than 1400, depending on your network).

while true; do echo -n $(printf '.%.0s' {1..1400}) > /dev/udp/192.168.1.9/8000; sleep 0.05; done

I'm using the following in onPacket:

Serial.printf("%d: %d\n", ++counter, packet.length());

Sometimes the failure happens after 80 packets, and sometimes after just 20. Sometimes failure occurs without ever receiving a packet, and sometimes it never fails. Different behaviours happen every time I press the reset button.

Could this be related to packet reassembly/splitting or just simply larger packets or something?

Update: If I leave the script running and don't press the reset button after a failure, I see this in the console: -bash: echo: write error: Host is down. A clue perhaps?

Update 2: Changing the inter-packet sleep time helps; I can't get it to fail at 0.2, but it fails at 0.1. Looks like there's some trouble if it's being bombarded by UDP packets... I suppose this might be expected, but I don't expect the network stack to stop functioning. Is there anything I can do to help mitigate this on the device?

@lbernstone
Copy link
Contributor

You are just running out of processing cycles. I can reproduce your choking issue at that rate. However, if I then stop the tx, and back off for a few seconds, the system will recover. AFAIK, this would be expected behavior for UDP- there is no guarantee on a datagram, so packets should always be sequenced and sized.

@ssilverman
Copy link
Author

ssilverman commented Jun 13, 2019

I think the main difference we're seeing between your setup and mine is that your device eventually starts calling onPacket again and mine does not. When the choke happens for me, I shut down the shell script sending the packets, wait at least a few tens of seconds, and then restart the script with a small packet size and a larger inter-packet delay. onPacket is never called again unless I do a hard reset on my device.

To be clear, my issue isn't that UDP reception on my device chokes, it's that the network stack never recovers without a hard reset.

I'm curious, which core version are you using? I'm using the latest stable 1.0.2 inside the Arduino IDE from https://dl.espressif.com/dl/package_esp32_index.json. I've also tested with espressif32@1.8.0 and framework-arduinoespressif32@2.10002.190416 inside PlatformIO.

@r1dd1ck
Copy link

r1dd1ck commented Jun 17, 2019

I am getting the "choke" problem even with TCP, so I believe this issue may not be specific to UDP nor the onPacket callback.

Earlier I was able to receive 100+ packets/sec without any stability issues, and now I can hardly push 30/sec without the WiFi stack totally choking up at some point 😕

Packet size does not seem to matter much (until you exceed the network MTU size & the packets get fragmented, of course), and whether the packets are handled, or just dropped without any processing doesn't seem to make a difference either.

I can run same code on an ESP8266 without making it break a sweat.. 😐

I guess this must be related to some "optimization" changes in the WiFi stack @ IDF, because I don't remember having to deal with such problems until I took the risk & upgraded to the latest core recently (after enjoying a 9 months long "if-it-aint-broke-dont-fix-it" grace period) 😫

Well, shit.. 💩 I hope it gets fixed soon.

@stale
Copy link

stale bot commented Aug 16, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the Status: Stale Issue is stale stage (outdated/stuck) label Aug 16, 2019
@stale
Copy link

stale bot commented Aug 30, 2019

This stale issue has been automatically closed. Thank you for your contributions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Stale Issue is stale stage (outdated/stuck)
Projects
None yet
Development

No branches or pull requests

3 participants