-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nlerror 19 & 18 on peer nodes when new node joins network or refresh key #22
Comments
This is with debug 31: Thu Dec 17 16:29:51 2015 recv'd COMMIT from 00:0c:42:18:d1:29 while in NOTHING |
The memory leak is in install_key. Mon Dec 21 12:05:37 2015 set plink state (seq num=1450699574) |
On Mon, Dec 21, 2015 at 04:09:27AM -0800, tecoboot wrote:
send_nlmsg() frees msg... not very intuitive, I'll grant you. The error paths in install_key are a mess, but aside from this:
...I'm not seeing any other leaks. Bob Copeland %% http://bobcopeland.com/ |
I'm not even a C beginner. |
Maybe get rid of the two warnings first, and see if leaks disappeared too? |
Because nla_put_failure means send_nlmsg() was not called. Yes, it would be much cleaner if send_nlmsg didn't free msg and then the error and success path would look the same. Anyway, this fixes that one leak: |
Posted separate issue for memory leak, so this one is for the two nl errors. |
I tried to find the reason for the nl errors. It looks to me the errors are responses on nl commands that were applied earlier, in separate thread. I guess the errors are not provided as return codes, but as asynchronous nl reply messages. Right? Because the frequency of errors is influenced by key lifetime, it is related. I added simple debug lines on various places, to verify where the errors come from. See attachment. I think the error on cmd 19 (top of logfile) is from peer_created, timer initiated. I don't understand why this function is called for existing peers. I think the error on cmd 18 (bottom of logfile) is from set_supported_rates. I don't understand why a key refresh would need adjusting rates. |
I believe 18 is NL80211_ATTR_STA_LISTEN_INTERVAL and 19 is NL80211_ATTR_STA_SUPPORTED_RATES. Pretty sure the code for those two calls has been added in 2011, and the following patch showed up in 2012 - http://marc.info/?l=linux-wireless&m=135712403121537&w=2 Essentially, mac80211 will refuse to accept those two commands if VIF is not in AP mode. Perhaps we should remove both of these calls. |
There's also this patch that added them back to mac80211 for stations that have been created but not yet in associated state - it went in fairly recently. http://www.spinics.net/lists/linux-wireless/msg142702.html |
For reference, running trunk of OpenWRT with mac80211 from early December I only get errors on 18 (NL80211_ATTR_STA_SUPPORTED_RATES) but not 19 as I would expect given the patch in October. |
I currently use kernel 4.1.15 (latest longterm). 4.2.6 has same issue. root@R-090: Are those two calls needed for fist creation of peers? |
This patch (dec 2012) blocks updates of existing stations: The cmd 18 error can be fixed with attached patch, removing the set_supported_rates functions and skip the call in estab_peer_link. No side effects seen yet. Fixing cmd 19 is harder, it looks like process_authentication_frame as "new peer" is called when a peer restarted the sae procedure (e.g. ifdown wlan0 ; ifup wlan0 ). |
I have no problem with solve-error-netlink-cmd-18.patch.txt patch. As far as 19, I guess your approach of having a flag doesn't sound too bad. We could also add another parameter for create_candidate() that indicates if it was called for an existing station, like for example in do_reauth, vs say handle_new_station (that parameter is then passed on to peer_created to suppress LISTEN_INTERVAL and SUPPORTED_RATES calls or not). |
@alexgrin The recent patch you talk about is e4208427247ecc7306c8f71ab3c5c08e08cf9fda (in upstream Linux) |
I've submitted PR #27 to fix netlink error 18 |
I'm not deep enough into the code to fix netlink error 19, I'll leave that up to you guys/gals |
Tested the PR #27, results are great !! Only error 19 left for improvement. Edited: first see if memory leak is solved... |
This is in essence what I was thinking about - it compiles, but I don't know if it works. |
It is not working, I'll try to debug. |
The flag works. I changed the debug line based on is_new. The nl call in peer_created is the problem, in output below the call with seqno 1452242306 results in the error.
Existing peer candidate 00:0c:42:18:d1:29 (seq num=1452242306) recv'd COMMIT from 00:0c:42:18:d1:29 while in NOTHING assigning group 19 to peer, the size of the prime is 32 COMMIT received for unknown peer 00:0c:42:18:d1:29, committing and confirming peer 00:0c:42:18:d1:29 in NOTHING, sending COMMIT (no token), len 128, group 19 tx_frame(0x805f400, 0xbef1f640, 128) tx frame (seq num=1452242307) ---------- tx frame hexdump 00000000 b0 00 00 00 00 0c 42 18 d1 29 4c 5e 0c 13 bd ca 00000010 00 0c 42 18 d1 29 00 00 03 00 01 00 00 00 13 00 00000020 0d 56 2a db f9 2d cf 2f 03 c3 ec 5c b2 7a 03 7c 00000030 8b df 88 de 1c f1 a8 5c 8e 87 ac 03 c3 75 a6 6b 00000040 1b fa 62 76 26 8a b6 44 18 e2 7a 31 bf 69 a9 7c 00000050 2f 77 ee 1f 39 c3 14 6d a2 0a ce d7 f2 73 17 06 00000060 2a 4d 3b 32 ca ea ba 34 84 a6 49 fa 07 b9 73 bc 00000070 06 e8 a8 99 f6 69 2d 45 46 5f 4b a2 5b f9 21 d9 ---------- 00:0c:42:18:d1:29 in NOTHING, sending CONFIRM (sc=1), len 64 tx_frame(0x805f400, 0xbef1f64c, 64) tx frame (seq num=1452242308) ---------- tx frame hexdump 00000000 b0 00 00 00 00 0c 42 18 d1 29 4c 5e 0c 13 bd ca 00000010 00 0c 42 18 d1 29 00 00 03 00 02 00 00 00 01 00 00000020 4f 7b 42 90 3d db 8e 87 31 95 e2 1e 23 36 60 44 00000030 dd 51 db 50 33 4b 74 29 ee 5c 45 4a 4c 65 04 3c ---------- state of 00:0c:42:18:d1:29 is now (2) CONFIRMED --4565-- REDIR: 0x42b6970 (libc.so.6:stpcpy) redirected to 0x402f170 (stpcpy) nlerror, cmd 19, seq 1452242306: Invalid argument |
I can't find the reason for the -EINVAL. |
Here's what's going on - when we do reauth, or get a beacon and don't have a struct for that peer, we create a new candidate. Since we're not certain that we have perfect sync between kernel and meshd peer state, we blindly call CMD_NEW_STATION every time just in case. More likely than not it will return -EEXISTS for reauth. Failing to create new station when one exists appears to be harmless, though it generates an error -EEXIST for cmd 19 (CMD_NEW_STATION). It is possible to suppress the error by inserting "nlcfg.supress_error = -EEXIST;" right in front of send_nlmsg in peer_created(). Another way is to interrogate kernel of station existence before issuing CMD_NEW_STATION, but that seems like it will add unnecessary complexity. My previous patch is completely bogus, CMD_NEW_STATION will return -EINVALID if LISTEN_INTERVAL is not specified. |
On Sat, Jan 09, 2016 at 03:54:37PM -0800, Alexis Green wrote:
Yes, just inserting the station and ignoring -EEXIST is the right Bob Copeland %% http://bobcopeland.com/ |
OK, I better understand now. Tested on single node. No repeating File exists anymore. I have a single Fri Jan 15 07:42:43 2016 Unexpected error -22 (expected -17). Could have to do with peer that has a problem. I have a node that has unidirectional link with two peers. I see repeating errors: So something went wrong and adding the peer doesn't work. |
OK, much cleaner now. Not in stress test mode yet, lifetime is 3600000 (1000h). Some messages, probably caused by out-of-sync (restart peers): peer 1, ath5k: peer 5, ath9k: peer 6, ath9k: peer 7, ath9k: |
OK, no problems seen so far. The new error messages would have been there before, but were not detected due to the tons of cmd 18 & 19 messages. Thank you guys !! |
Fix for nlerror 19 on key refresh. Issue #22
I have seen huge differences in memory consumption on different nodes.
During investigation, I see error messages (debug=1):
Thu Dec 17 16:24:20 2015 nlerror, cmd 19, seq 1450368158: File exists
Thu Dec 17 16:24:20 2015 nlerror, cmd 18, seq 1450368171: Invalid argument
I'll post results with higher debug level
The text was updated successfully, but these errors were encountered: