New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DPP Auth Failed on dpp-enrollee example (IDFGH-9228) #10615
Comments
Update on this: Hmm. |
I'm having the same issue (dpp-enrollee yields ESP_ERR_DPP_TX_FAILURE). For me it is strangely persistent but unpredictable. That is, I've had it work correctly in the past, but somehow once a given device gets into this state it will yield this error over and over no matter what I do, even if I reset the flash. I believe @knight-ryu12 is describing something similar where a seemingly unrelated change "fixed it". Looks like others have also experienced this: https://esp32.com/viewtopic.php?t=28573. Needless to say this unpredictability makes it impossible to actually ship something that uses DPP, so we need to find some predictable way of detecting and working around this condition. For reference, my ESP32-C3-DevKit-M1U is working, my ESP32-C3-DevKit-M1 isn't. Both are using v4.4.4. |
Hi @jasta , we are trying to reproduce it locally. Are you using the example? Please share your sdkconfig as well as IDF version commit. |
Yes, I'm able to reproduce consistently with the unmodified example on the current v4.4.4 tag of esp-idf. I am using the default settings (channel 6, no provided key/device info). Here's the sdkconfig: https://gist.github.com/42ef0f07990ca812bba8b541685ef798 |
The bug really smells like a race condition IMO. I had it all working perfectly and changed nothing of significance about my app and it just started giving me ESP_ERR_DPP_TX_FAILURE over and over. Then no amount of reverting my code could fix it, including reverting all the way back to the unmodified example which is where I started. It of course used to work with the sample and even my considerably more robust full app. I suspect that the "condition" that changes to cause it to become persistently broken is literally in the air -- something about my Wi-Fi setup must be able to consistently reproduce an "unlikely" race condition outcome. No hardware has been deliberately modified or replaced since it was once working, so the only possibilities in my view are environmental (wireless signals themselves changing) or through automated software updates of either my router (Google WiFi infrastructure) or my phone (Android Pixel 6) |
Well that's interesting, even though I get ESP_ERR_DPP_TX_FAILURE pretty much every time, I now just got ESP_ERR_DPP_INVALID_ATTR. Adds quite a bit of evidence to my theory that this is a race :) I swear it is seeming like the difference between the two errors is whether I have the esp32 on my desk (DPP_INVALID_ATTR) or in my component drawer (DPP_TX_FAILURE). Bizarre :) |
As an aside, I am the one working on adding DPP support to Rust and ran into this issue maturing the implementation even though I didn't change anything functionally interesting with respect to esp_* calls: esp-rs/esp-idf-svc#228 . |
I have logs with the full debugging turned up but I'm not posting here as I believe they will contain my Wi-Fi creds. Lmk and I can share them privately or reproduce with a dummy network (though I suspect changing my network around will change the results) |
Please black out wifi cred from the logs and share the rest. |
I can't actually tell what is and isn't sensitive about this, but I'm interested enough in getting this solved that I'll risk it hehe: https://gist.github.com/09fe320e7b549967b37088170e59c5cb. Here is the updated sdkconfig after I enabled logging: https://gist.github.com/8ab02d9c9ca064861e9e1cdf22261545 I confirmed again this morning it still repros. I seemingly have a 100% reliable repro (it's happened at least the last 20 or 30 times I've tried) on a devkit module that once worked just fine. One maybe important detail though is that in the example I'm unable to scan the QR code in the console (my phone never recognizes it at as a QR code), so I'm copying the QR text into qtqr and generating one there. I can confirm that if I fudge the QR text I get a different error ("No matching own bootstrapping key found as responder - ignoring message"), and then it yields ESP_ERR_DPP_INVALID_ATTR (different than the INVALID_ATTR case I got when I was randomly moving the device around physically). |
I don't know for sure, but it crashes Wi-Fi event handler hardly afterwards. I cannot receive any events after DPP had failed. I assume that might have to do with how LWIP is handling packets? (Or is it just my condition being bad...) |
I see the same thing re the hard crash, but I don't believe it is related to LWIP packet fragmentation. These aren't even IP packets AFAICT, they're action frame packets in 802.11. Further, I'm seeing essentially identical behavior to you with the unmodified example, which I'm guessing is the same thing our friends at Espressif are testing with but seeing different results. Highly likely an environmental condition causing the difference. |
From my testing today, I've found that if the channel the esp is currently listening on happens to match whichever channel my phone was connected to AP on, the chance of So from what I can tell, the issue comes down to just being purely a channel mismatch somehow and the esp not wanting to send to a different channel than its listening to(just a wild guess on that last part though). And, it seems that there's a fix that works perfectly(from my testing atleast), all it takes is enabling Multi Band Support in your sdkconfig.
Instead of only doing This fix seems to work well enough that I can even have the esp listening only on channel 10, my phone can be sitting on channel 149, and I can get the esp to connect to a network on channel 1 with no issue |
Afaik it doesnt actually crash anything, it just sets From my testing, the wifi driver and event loop are all still fully functioning even when that error happens, its just there's no events actually going on to show up in the log but if you set Also I doubt lwip actually has anything to do with the issue since i dont remember any mention of lwip in esp_dpp.c, and iirc lwip is a tcp/ip stack, and during dpp configuration it wouldnt make sense for there to be a tcp/ip stack considering only a limited amount of raw frames are sent between the two devices anyways. |
Ahh good catch, however what I'm seeing is that @kapilkedawat, so at the very least we've identified one clear bug that needs fixing: @gayafhannah, I'm going to go try your MBO workaround now and report back, awesome sleuthing BTW! |
@jasta I've also just noticed that since i'm not testing on the master branch and am instead on, in the master branch it seems to be called |
@gayafhannah, MBO support with v4.4.4 at least didn't seem to affect the results, though I still think you're onto something with your analysis. My network set-up is that my phone is connected to a Google Wi-Fi network on 5GHz ch 149, but the 2.4GHz network is on channel 1. I've tried a bunch of configurations of different channels for DPP to use though and still can't get it to work. I'm a bit stumped still. Note: I used idf.py menuconfig to enable MBO support so the settings should match my local checkout. I'm using the v4.4.4 tag in esp-idf. |
I know that without MBO support, i was able to get it to be more consistent at not having the error(although not 100%) if the only channel dpp was configured to listen on was the channel of the wifi network, and with my phone also connected to that network, and therefore on that same channel, which is what led me to mbo support when snooping around menuconfig. |
Disabling MBO and setting channels to 1,6 worked once, but then I rebooted the device and tried one more time and it failed. Then rebooted and tried again several times until it finally worked (which I can only do without rebooting because of my patch in I've attached an abridged log comparing the failed attempt vs the successful attempt (no reboot occurred, I just tried again on the phone): https://gist.github.com/ca10e95e786b79fe847c3a13b297e732 The interesting bit from the logs is that seemingly nothing is different about the failure case and successful one. Both using Channel-6, exchanging seemingly the same data with the same timings, etc. |
Have thrown together a pull request #10812 that should work for the latest version of espidf with the changes i made to get fully functional dpp |
Looks good, you might want to also include my fix to esp_supp_dsp_start_listen which resets the s_dpp_stop_listening flag. This is what enables retries to work as was intended by the example (i.e. not needing deinit/init before it can be retried).
Planning to retest all this on v5.0.1 but was sticking to v4.4.4 because that's where it once worked then stopped so I figured we could learn the most about what exact environmental difference actually matters. |
I'll have to give your fix a shot in the morning Also my guess for environmental differences causing the issues is just the phone/AP switching channels. It's the only thing I can come up with since like I said in earlier comments, setting the listen channel to be the same as the ap and phone channel, maybe 80-90% success rate atleast. Instead of the not even 1% chance if it wasn't the same. A fun way to test it was to in the channel listen list that you put into one of the functions, just up all channels 1 through to 13, I also added a little thing in esp_dpp to show the current channel as it was cycling through. If I managed to get the timing pressing the button on my phone just right to land on channel 1 on the esp(channel 1 also what the ap and my phone were on) then it'd almost always connect fine, but if I missed that channel(difficult to get it right when the timeout is 500ms) then it'd fail to connect except for maybe exactly one time. Also if I remember correctly, setting the channel to be near the same as the ap, if the channel width for the ap is 40MHz, I noticed even being +-2 channels from the centre channel, it had moderate success rates. Definitely not the same as setting the listen channel directly ontop of the centre channel though. Also 40MHz wide ones were slightly less successful on the centre channel than 20MHz wide(probably because 20MHz is literally just 1 channel wide). |
I'm not observing any strange side effects from the fix FWIW. And more importantly I'm noticing that my phone will actually automatically retry now with this setting on, indicating that the DPP standard actually expects some intermittent failures (at least due to RF interference would be my guess, but maybe even this channel weirdness). That suggests to me that the lack of working retry is a big part of the root cause here. Imagine what would happen to our global TCP infrastructure if you disabled packet retransmits, for example...
Definitely plausible and consistent with my experience. If I just sit there and hit Retry on my Android phone over and over with my fix it will eventually work. Not ideal UX by any stretch, though, and I'd be quite surprised if we (or maybe just Espressif) couldn't come up with a more reliable way to fix this.
I can't really say for sure, I'd probably need to dive deep into how this stuff is even supposed to function at the standard level before I could speculate why that would play a role. One curious thing I did notice however was that the QR code does in fact include the channel that the enrollee (the esp32) is listening on, so it doesn't make any sense why the configurator (the phone) would have any troubles sending/receiving messages on the correct channel. |
I noticed the channels being listed in the qr code aswell, and your exact reasons are why I'm hesitant that my guess is actually correct. It's just that pattern being there made it really seem like it has to have something to do with the channel being wrong. Especially because the error afaik is caused by the esp failing to send the packet "offchannel" and not in receiving any packets, as even with the error happening, it still successful receives the first packet/frame from the phone just phone. |
@kapilkedawat The plot thickens trying to work around these bugs. It looks like if you try to work around this issue by calling esp_supp_dpp_deinit/init again and just starting all over, you will quickly run into a race condition in esp_dpp.c in which this assert fails: https://gist.github.com/a067c88e511dfff52e7f704d469e157f I am using unmodified ESP-IDF 4.4.4 to cause this behaviour. The reason appears to be because of an inherent race condition in which esp_supp_dpp_init makes no effort to check if the previous s_dpp_evt_queue and associated task is drained and cleaned up before proceeding to start up esp_dpp_task again. Illustrated as such:
It seems to me like the fix would be to have a condition variable and associated state that deinit sets true, SIG_DPP_DEL_TASK sets false, and init checks for / waits until it is false. A workaround that seems reliable enough is to sleep for 1s between deinit/init though of course this isn't really a fix. This also suggests to me the proposed fix of clearing s_dpp_stop_listening in esp_supp_dpp_start_listen is even more important. EDIT: Filed a separate issue for this, #10879 |
This fixes a subtle bug in which ESP_ERR_DPP_TX_FAILURE errors would call esp_supp_dpp_stop_listen which sets the s_dpp_stop_listening flag to true. Subsequent attempts to restart listening with esp_supp_dpp_start_listen then only attempt to listen once more for 500ms before reading the s_dpp_stop_listening flag again and giving up. This contributes greatly to espressif#10615, but the fix here is still largely a work-around as it sometimes requires manually retrying a couple times before it works. Without this fix, any number of retries by deinit/init again will seemingly not work as the retries for currently unknown reasons.
This fixes a subtle bug in which ESP_ERR_DPP_TX_FAILURE errors would call esp_supp_dpp_stop_listen which sets the s_dpp_stop_listening flag to true. Subsequent attempts to restart listening with esp_supp_dpp_start_listen then only attempt to listen once more for 500ms before reading the s_dpp_stop_listening flag again and giving up. This contributes greatly to espressif#10615, but the fix here is still largely a work-around as it sometimes requires manually retrying a couple times before it works. Without this fix, any number of retries by deinit/init again will seemingly not work as the retries for currently unknown reasons.
This fixes a subtle bug in which ESP_ERR_DPP_TX_FAILURE errors would call esp_supp_dpp_stop_listen which sets the s_dpp_stop_listening flag to true. Subsequent attempts to restart listening with esp_supp_dpp_start_listen then only attempt to listen once more for 500ms before reading the s_dpp_stop_listening flag again and giving up. This contributes greatly to espressif#10615, but the fix here is still largely a work-around as it sometimes requires manually retrying a couple times before it works. Without this fix, any number of retries by deinit/init again will seemingly not work as the retries for currently unknown reasons.
@gayafhannah Did you ever test on the master branch? I started prepping my separate s_dpp_stop_listening fix and noticed that while v5.0.1 still has issues, the master branch is actually working correctly! I checked in wireshark on a promiscuous third party device and discovered that in fact the OTA packets are distinct between v4.4.4/v5.0.1 and master. For the former, there are several action frame transmits in response to the initial auth request from my phone but they are all ignored by my phone. The phone eventually times out and gives up, despite in theory the packets being made available. For master, there is exactly one transmit of auth response from the esp32 and the phone picks it up right away and the whole process completes successfully really fast. The only difference I can think of here is if the wifi libs got updated in master and fixed some subtle but important bug in the action frame tx path. Next step I was going to try setting up wpa_supplicant on my PC so I can increase the debug logs and understand why the packets I can see sent from the esp32 are not being accepted/received by the peer, but this is slow going :) |
Hey @gayafhannah @knight-ryu12 @jasta we are actively looking into this issue. For us, it is little bit hard to stimulate one consistent behavior . Also @gayafhannah regarding your PR we are trying to make more changes and make the state machine cleaner. Meanwhile, you can try out these changes
|
Note that I was able to make things work with #10865. I also dug into Wireshark a bit and saw some fairly surprising results (packets going out but not picked up by the configurator phone) which made me think this is why they changed the way the initial listening phase works in DPP V2. I think the right path to go down now is to upgrade wpa_supplicant. |
This fixes a subtle bug in which ESP_ERR_DPP_TX_FAILURE errors would call esp_supp_dpp_stop_listen which sets the s_dpp_stop_listening flag to true. Subsequent attempts to restart listening with esp_supp_dpp_start_listen then only attempt to listen once more for 500ms before reading the s_dpp_stop_listening flag again and giving up. This contributes greatly to #10615, but the fix here is still largely a work-around as it sometimes requires manually retrying a couple times before it works. Without this fix, any number of retries by deinit/init again will seemingly not work as the retries for currently unknown reasons. Signed-off-by: Shreyas Sheth <shreyas.sheth@espressif.com> Closes #10865
Hey, I am experiencing the same issue. DPP fails to connect. Is there a workaround without problems? |
This fixes a subtle bug in which ESP_ERR_DPP_TX_FAILURE errors would call esp_supp_dpp_stop_listen which sets the s_dpp_stop_listening flag to true. Subsequent attempts to restart listening with esp_supp_dpp_start_listen then only attempt to listen once more for 500ms before reading the s_dpp_stop_listening flag again and giving up. This contributes greatly to #10615, but the fix here is still largely a work-around as it sometimes requires manually retrying a couple times before it works. Without this fix, any number of retries by deinit/init again will seemingly not work as the retries for currently unknown reasons. Signed-off-by: Shreyas Sheth <shreyas.sheth@espressif.com> Closes #10865
This fixes a subtle bug in which ESP_ERR_DPP_TX_FAILURE errors would call esp_supp_dpp_stop_listen which sets the s_dpp_stop_listening flag to true. Subsequent attempts to restart listening with esp_supp_dpp_start_listen then only attempt to listen once more for 500ms before reading the s_dpp_stop_listening flag again and giving up. This contributes greatly to #10615, but the fix here is still largely a work-around as it sometimes requires manually retrying a couple times before it works. Without this fix, any number of retries by deinit/init again will seemingly not work as the retries for currently unknown reasons. Signed-off-by: Shreyas Sheth <shreyas.sheth@espressif.com> Closes #10865
This fixes a subtle bug in which ESP_ERR_DPP_TX_FAILURE errors would call esp_supp_dpp_stop_listen which sets the s_dpp_stop_listening flag to true. Subsequent attempts to restart listening with esp_supp_dpp_start_listen then only attempt to listen once more for 500ms before reading the s_dpp_stop_listening flag again and giving up. This contributes greatly to #10615, but the fix here is still largely a work-around as it sometimes requires manually retrying a couple times before it works. Without this fix, any number of retries by deinit/init again will seemingly not work as the retries for currently unknown reasons. Signed-off-by: Shreyas Sheth <shreyas.sheth@espressif.com> Closes #10865
This fixes a subtle bug in which ESP_ERR_DPP_TX_FAILURE errors would call esp_supp_dpp_stop_listen which sets the s_dpp_stop_listening flag to true. Subsequent attempts to restart listening with esp_supp_dpp_start_listen then only attempt to listen once more for 500ms before reading the s_dpp_stop_listening flag again and giving up. This contributes greatly to #10615, but the fix here is still largely a work-around as it sometimes requires manually retrying a couple times before it works. Without this fix, any number of retries by deinit/init again will seemingly not work as the retries for currently unknown reasons. Signed-off-by: Shreyas Sheth <shreyas.sheth@espressif.com> Closes #10865
Can you try with a version that has my patches applied (looks like espressif backported this to v4.3, v4.4 and v5.0 branches, but I don't think any versioned releases were made with the fix)? What you should see is that there are many automatic retries if you test with the dpp-enrollee sample and it will eventually work. In either case, can you report your findings here? |
Thanks for pointing out that you patched this. We decided to go with smartConfig instead. Maybe I will return to this in the future. |
Hmm I'm also experiencing the same issue when running the dpp-enrollee sample, which has the fix, using:
I got it properly working on the master branch (V5.2), but I had to return to V5.0 because of incompatibility of other dependencies.
Shouldn't it auto-retry here (at least 5 times in the example?)? I also tried to set the config options:
But that didn't resolve it either. Any ideas? |
Answers checklist.
IDF version.
v4.4.3
Operating System used.
Windows
How did you build your project?
VS Code IDE
If you are using Windows, please specify command line type.
PowerShell
Development Kit.
ESP32-WROVER-DevkitC
Power Supply used.
USB
What is the expected behavior?
I expected it to correctly authenticate with Accesspoint.
What is the actual behavior?
It does not authenticate with accesspoint at all;
Seems that it fails with WPA being
D (66148) wpa: Mgmt Tx Status - 1, Cookie - 0x400e036c
, and returns ESP_ERR_DPP_TX_FAILURE no matter what accesspoint it uses.Steps to reproduce.
Code
It is 1:1 to the example, expect I removed the key part. which should just work without it.
Debug Logs.
More Information.
Seems afterwards it crashes Wi-Fi module.
The text was updated successfully, but these errors were encountered: