Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subscriptions, why can't these two nodes see each other? (CON-1110) #891

Open
jonsmirl opened this issue Mar 31, 2024 · 9 comments
Open

Comments

@jonsmirl
Copy link
Contributor

jonsmirl commented Mar 31, 2024

My commissioner node can talk to both nodes, why can't they see each other?

esp-matter node

I (666882) read_command: Subscribe to AB0DE00000000001
I (666883) read_command: JDS my_send_subscribe_attr_command
I (666883) read_command: send_command AB0DE00000000001 fabric 2
I (666885) read_command: Error 0
I (666885) ep: >>> JDSEvent Exit 0 4852 >>>>
I (666901) chip[DIS]: UDP:[FDBA:A4B0:2A67:43ED:1978:F86:3097:B359%st1]:5540: new best score: 3
I (666902) chip[DIS]: UDP:[FDBA:A4B0:2A67:43ED:DADF:E851:D193:3BA2%st1]:5540: score has not improved: 3
I (666903) chip[DIS]: UDP:[FD71:92A5:EC02:171E:2916:FA94:342:294C%st1]:5540: new best score: 5
I (666904) chip[DIS]: UDP:[FD71:92A5:EC02:171E:3E6F:940A:EEF8:AB51%st1]:5540: score has not improved: 5
I (666905) chip[DIS]: UDP:[FD71:92A5:EC02:171E:190D:E0E1:76FF:D6E4%st1]:5540: score has not improved: 5
I (666906) chip[DIS]: Checking node lookup status after 22 ms
I (666907) chip[DIS]: Keeping DNSSD lookup active
I (666908) chip[DIS]: UDP:[FDBA:A4B0:2A67:43ED:1978:F86:3097:B359%st1]:5540: score has not improved: 3
I (666909) chip[DIS]: UDP:[FDBA:A4B0:2A67:43ED:DADF:E851:D193:3BA2%st1]:5540: score has not improved: 3
I (666910) chip[DIS]: UDP:[FD71:92A5:EC02:171E:2916:FA94:342:294C%st1]:5540: score has not improved: 5
I (666911) chip[DIS]: UDP:[FD71:92A5:EC02:171E:3E6F:940A:EEF8:AB51%st1]:5540: score has not improved: 5
I (666912) chip[DIS]: UDP:[FD71:92A5:EC02:171E:190D:E0E1:76FF:D6E4%st1]:5540: score has not improved: 5
I (666913) chip[DIS]: Checking node lookup status after 29 ms
I (666913) chip[DIS]: Keeping DNSSD lookup active
I (667084) chip[DIS]: Checking node lookup status after 200 ms
I (667086) chip[SC]: Initiating session on local FabricIndex 2 from 0xAB0DE00000000004 -> 0xAB0DE00000000001
I (667159) chip[EM]: <<< [E:19986i S:0 M:8072319] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[FD71:92A5:EC02:171E:2916:FA94:342:294C]:5540] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
I (667162) chip[SC]: Sent Sigma1 msg
I (667212) chip[EM]: >>> [E:19986i S:0 M:163439504 (Ack:8072319)] (U) Msg RX from 0:0000000000000000 [0000] --- Type 0000:31 (SecureChannel:CASE_Sigma2)
I (667214) chip[EM]: <<< [E:19986i S:0 M:8072320 (Ack:163439504)] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[FDBA:A4B0:2A67:43ED:1978:F86:3097:B359%st1]:5540] --- Type 0000:10 (SecureChannel:StandaloneAck)
I (667216) chip[SC]: Received Sigma2 msg
E (667396) chip[SC]: The device does not support GetClock_RealTimeMS() API: 6c.  Falling back to Last Known Good UTC Time
I (668216) chip[EM]: <<< [E:19986i S:0 M:8072321 (Ack:163439504)] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[FDBA:A4B0:2A67:43ED:1978:F86:3097:B359%st1]:5540] --- Type 0000:32 (SecureChannel:CASE_Sigma3)
I (668218) chip[SC]: Sent Sigma3 msg
E (668220) chip[DL]: Long dispatch time: 1008 ms, for event type 3
I (668221) chip[EM]: >>> [E:19986i S:0 M:163439504 (Ack:8072319)] (U) Msg RX from 0:0000000000000000 [0000] --- Type 0000:31 (SecureChannel:CASE_Sigma2)
I (668222) chip[EM]: <<< [E:19986i S:0 M:8072322 (Ack:163439504)] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[FDBA:A4B0:2A67:43ED:1978:F86:3097:B359%st1]:5540] --- Type 0000:10 (SecureChannel:StandaloneAck)
I (668225) chip[EM]: >>> [E:19986i S:0 M:163439504 (Ack:8072319)] (U) Msg RX from 0:0000000000000000 [0000] --- Type 0000:31 (SecureChannel:CASE_Sigma2)
I (668227) chip[EM]: <<< [E:19986i S:0 M:8072323 (Ack:163439504)] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[FDBA:A4B0:2A67:43ED:1978:F86:3097:B359%st1]:5540] --- Type 0000:10 (SecureChannel:StandaloneAck)
E (668229) chip[EM]: Non-crit err 3000001 sending solitary ack for MessageCounter:163439504 on exchange 19986i
I (668535) chip[EM]: >>> [E:19986i S:0 M:163439504 (Ack:8072319)] (U) Msg RX from 0:0000000000000000 [0000] --- Type 0000:31 (SecureChannel:CASE_Sigma2)
I (668537) chip[EM]: <<< [E:19986i S:0 M:8072324 (Ack:163439504)] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[FDBA:A4B0:2A67:43ED:1978:F86:3097:B359%st1]:5540] --- Type 0000:10 (SecureChannel:StandaloneAck)
E (668539) chip[EM]: Non-crit err 3000001 sending solitary ack for MessageCounter:163439504 on exchange 19986i
I (668567) chip[EM]: Retransmitting MessageCounter:8072321 on exchange 19986i Send Cnt 1
E (668569) chip[EM]: Ignoring transient send error: 3000001 on exchange 19986i
I (668933) chip[EM]: Retransmitting MessageCounter:8072321 on exchange 19986i Send Cnt 2
E (668935) chip[EM]: Ignoring transient send error: 3000001 on exchange 19986i
I (669571) chip[EM]: Retransmitting MessageCounter:8072321 on exchange 19986i Send Cnt 3
I (669670) chip[EM]: >>> [E:19986i S:0 M:163439504 (Ack:8072319)] (U) Msg RX from 0:0000000000000000 [0000] --- Type 0000:31 (SecureChannel:CASE_Sigma2)
I (669672) chip[EM]: <<< [E:19986i S:0 M:8072325 (Ack:163439504)] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[FDBA:A4B0:2A67:43ED:1978:F86:3097:B359%st1]:5540] --- Type 0000:10 (SecureChannel:StandaloneAck)
I (670519) chip[EM]: Retransmitting MessageCounter:8072321 on exchange 19986i Send Cnt 4
E (672021) chip[EM]: Failed to Send CHIP MessageCounter:8072321 on exchange 19986i sendCount: 4 max retries: 4
I (681892) app_main: Dump #34 Free internal 46771 biggest block 29680 total 46771 PSRAM free 1442424

target node, linux running the themostat example

[1711850501.071962][2710431:2710431] CHIP:EM: >>> [E:19986r S:0 M:8072319] (U) Msg RX from 0:2DAE095676574125 [0000] --- Type 0000:30 (SecureChannel:CASE_Sigma1)
[1711850501.072029][2710431:2710431] CHIP:EM: Handling via exchange: 19986r, Delegate: 0x56294e682500
[1711850501.072067][2710431:2710431] CHIP:IN: CASE Server received Sigma1 message . Starting handshake. EC 0x5629504af5c0
[1711850501.072102][2710431:2710431] CHIP:SC: Received Sigma1 msg
[1711850501.072147][2710431:2710431] CHIP:SC: Found MRP parameters in the message
[1711850501.072183][2710431:2710431] CHIP:SC: Peer assigned session key ID 20807
[1711850501.073354][2710431:2710431] CHIP:SC: CASE matched destination ID: fabricIndex 2, NodeID 0xAB0DE00000000001
[1711850501.075679][2710431:2710431] CHIP:EM: <<< [E:19986r S:0 M:163439504 (Ack:8072319)] (U) Msg TX to 0:0000000000000000 [0000] [UDP:[fd71:92a5:ec02:171e:7204:1dff:fecd:9da0%enp3s0f1]:5540] --- Type 0000:31 (SecureChannel:CASE_Sigma2)
[1711850501.075831][2710431:2710431] CHIP:SC: Sent Sigma2 msg
[1711850501.485458][2710431:2710431] CHIP:EM: Retransmitting MessageCounter:163439504 on exchange 19986r Send Cnt 1
[1711850501.825189][2710431:2710431] CHIP:EM: Retransmitting MessageCounter:163439504 on exchange 19986r Send Cnt 2
[1711850502.374047][2710431:2710431] CHIP:EM: Retransmitting MessageCounter:163439504 on exchange 19986r Send Cnt 3
[1711850503.419708][2710431:2710431] CHIP:EM: Retransmitting MessageCounter:163439504 on exchange 19986r Send Cnt 4
[1711850504.827512][2710431:2710431] CHIP:EM: Failed to Send CHIP MessageCounter:163439504 on exchange 19986r sendCount: 4 max retries: 4
[1711850535.299046][2710431:2710431] CHIP:SC: CASESession timed out while waiting for a response from the peer. Current state was 2
[1711850535.299091][2710431:2710431] CHIP:IN: CASE Session establishment failed: src/protocols/secure_channel/CASESession.cpp:559: CHIP Error 0x00000032: Timeout
[1711850535.299103][2710431:2710431] CHIP:IN: SecureSession[0x5629504cf9f0]: Released - Type:2 LSID:40264
[1711850535.299137][2710431:2710431] CHIP:IN: SecureSession[0x5629504cf9f0]: Allocated Type:2 LSID:40265
[1711850535.299152][2710431:2710431] CHIP:SC: Allocated SecureSession (0x5629504cf9f0) - waiting for Sigma1 msg
@github-actions github-actions bot changed the title Subscriptions, why can't these two nodes see each other? Subscriptions, why can't these two nodes see each other? (CON-1110) Mar 31, 2024
@jonsmirl
Copy link
Contributor Author

I am able to successfully establish a subscription between two ESP based nodes. So this is a problem talking to a Linux based node? I don't know, I was just using the Linux node for testing, actual target is ESP based.

@wqx6
Copy link
Contributor

wqx6 commented Apr 1, 2024

Looks like the error occurs when the ESP32 is sending the Sigma3 message to the Linux themostat and the error code is 0x3000001 (LwIP error ERR_MEM). Since the heap is enough in the log, I guess that the Wi-Fi TX buffer might be not enough. Could you try to increase the Wi-Fi Dynamic TX buffer in menuconfig?

@jonsmirl
Copy link
Contributor Author

jonsmirl commented Apr 1, 2024

That's my SDK config. I have 16 static TX buffers.
sdkconfig.defaults.txt

Note that if I point the same node at an ESP based device it works.

I (27879484) app_main: Dump #1392 Free internal 48655 biggest block 29680 total 48655 PSRAM free 1437392
Task: main -> MALLOC_CAP_INTERNAL: 108020 MALLOC_CAP_SPIRAM: 371740 Highwater 4852
Task: lua -> MALLOC_CAP_INTERNAL: 41364 MALLOC_CAP_SPIRAM: 104464 Highwater 27204
Task: CHIP -> MALLOC_CAP_INTERNAL: 37104 MALLOC_CAP_SPIRAM: 34548 Highwater 3452
Task: tiT -> MALLOC_CAP_INTERNAL: 528 MALLOC_CAP_SPIRAM: 660 Highwater 136
Task: wifi -> MALLOC_CAP_INTERNAL: 45948 MALLOC_CAP_SPIRAM: 60040 Highwater 2912
Task: nimble_host -> MALLOC_CAP_INTERNAL: 52 MALLOC_CAP_SPIRAM: 360 Highwater 1840
Task: sys_evt -> MALLOC_CAP_INTERNAL: 184 MALLOC_CAP_SPIRAM: 96 Highwater 1836
Task: httpd -> MALLOC_CAP_INTERNAL: 260 MALLOC_CAP_SPIRAM: 4 Highwater 536
Task: Pre-Scheduler allocs -> MALLOC_CAP_INTERNAL: 3660 MALLOC_CAP_SPIRAM: 0 Highwater 0
Task: ipc0 -> MALLOC_CAP_INTERNAL: 21944 MALLOC_CAP_SPIRAM: 0 Highwater 688
Task: ipc1 -> MALLOC_CAP_INTERNAL: 24 MALLOC_CAP_SPIRAM: 0 Highwater 708
Task: btController -> MALLOC_CAP_INTERNAL: 7416 MALLOC_CAP_SPIRAM: 0 Highwater 1100
Task: esp_timer -> MALLOC_CAP_INTERNAL: 100 MALLOC_CAP_SPIRAM: 0 Highwater 2984

@wqx6
Copy link
Contributor

wqx6 commented Apr 2, 2024

May you add some logs at wlanif.c:109 so that we can know whether the lack of Wi-Fi TX buffer causes this issue?

Note that the 16 TX buffers could be not enough for Matter since Matter uses UDP and it may deliver frames faster than WiFi layer can transmit. If you are worry about the memory usage, you could switch to dynamic TX buffer so that you can have more buffer number.

@jonsmirl
Copy link
Contributor Author

jonsmirl commented Apr 2, 2024

How do you do dynamic TX buffers in a system with PSRAM? Enabling PSRAM disables the dynamic TX option in menuconfig.

Shouldn't chip throttle when wifi runs out of TX buffers? It can't work under the assumption that it can put as many UDP packets into wifi as it wants.

Of course, today when I rebuilt everything, it initially works against Linux and fails in a different place.
esp-fail.txt
linux-log.txt
--- wait, that is the session to my phone. My phone went to sleep. Node 0 is phone, node 1 is Linux thermostat example, node 2 is ESP.

Did not hit this....

    if (ret == ESP_ERR_NO_MEM) {
    	printf("JDS -- ESP_ERR_NO_MEM at low_level_output");
        return ERR_MEM;
    }

@jonsmirl
Copy link
Contributor Author

jonsmirl commented Apr 2, 2024

This is not the first time I have seen behavior like this. When testing groups sometimes I will lose connectivity to various nodes until I reboot them. You get the same failure -- out of packet buffers. I think this is because something happened down in a wifi stack somewhere and the nodes stopped seeing each other.

@jonsmirl
Copy link
Contributor Author

jonsmirl commented Apr 2, 2024

Maybe when there is wifi interference TX buffers are being leaked? The failed transmits are not getting put back into the pool?

@wqx6
Copy link
Contributor

wqx6 commented Apr 3, 2024

How do you do dynamic TX buffers in a system with PSRAM? Enabling PSRAM disables the dynamic TX option in menuconfig.

You can select SPIRAM_USE_CAPS_ALLOC to enable dynamic TX buffers. With SPIRAM_USE_MALLOC enabled, the chip might allocate TX buffer memory on PSRAM, and this will cause a decrease in Wi-Fi performance. That's why we will disable dynamic TX buffers when SPIRAM_USE_MALLOC enabled.

This is not the first time I have seen behavior like this. When testing groups sometimes I will lose connectivity to various nodes until I reboot them. You get the same failure -- out of packet buffers. I think this is because something happened down in a wifi stack somewhere and the nodes stopped seeing each other.

Yes, for Matter over UDP, the failure -- out of packet buffers is more likely to be encountered when the network is complicated, as the Matter will post more packets to Wi-Fi TX radio. As you see in the log, we will not treat this failure as a critical issue since the Matter has re-transfer mechanism. I guess this might be improved when TCP is introduced to Matter in Matter1.4(or later).

Maybe when there is wifi interference TX buffers are being leaked? The failed transmits are not getting put back into the pool?

I don't think so. If there are some buffers leaked, the usable buffers will be less and less and we will have no buffer for Wi-Fi TX. At least we didn't see the situation when the chip has no buffer to use anymore.

@jonsmirl
Copy link
Contributor Author

jonsmirl commented Apr 3, 2024

I am recompiled and running with dynamic TX buffers limit 32. I used to not be able to do that but I have been adding CAP mallocs all over.

Looking at my debug output there is obviously still some stuff I can move into PSRAM.

I (461557) app_main: Dump #23 Free internal 52928 biggest block 31728 total 52928 PSRAM free 1472908
Task: main -> MALLOC_CAP_INTERNAL: 75280 MALLOC_CAP_SPIRAM: 371776 Highwater 4948
Task: lua -> MALLOC_CAP_INTERNAL: 48156 MALLOC_CAP_SPIRAM: 71632 Highwater 27200
Task: CHIP -> MALLOC_CAP_INTERNAL: 54272 MALLOC_CAP_SPIRAM: 34988 Highwater 2304

These allocs are in code I didn't write, I have been systematically forcing all of my allocs into PSRAM.

So is there a magic tool which can track these allocations and tell me which lines in the code are doing the worst of it? I have turned on heap tracking before but it generates many thousands of events most of which are just noise (tiny allocs and frees). I have plenty of PSRAM free to use for heap tracking. What I'd like to do is turn on this magic tracking and then make a call which would tell me the line number which made every internal alloc over 100 bytes. I don't care about tracking all of the alloc/frees, I just want to do a snapshot in time and know who made made every internal alloc over 100 bytes and to filter out the PSRAM ones of which there are thousands.

What about the CHIP task? 54,272 bytes there which I am sure could go into PSRAM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants