IoT Edge stops sending telemetry after WSSD Agent and Vmmem CPU and EFLOW.vhdx Disk usage climb to excessive levels. #49

dmaxwe22 · 2021-07-18T03:35:31Z

After about 20 minutes the CPU and Disk usage climbs very high and IoT Edge stops sending telemetry and PowerShell, Linux connection, etc. stop responding.

This issue of excessive CPU and Disk usage causing things to stop streaming and stop responding repeats even with a fresh installation of Windows 10 IoT Enterprise 2019 LTSC and EFLOW. The only thing that stays the same is the Azure DPS X509 enrollment with IoT Hub group deployment of 2 modules (simulated temperature sensor and streaming analytics reset module). This same deployment has worked on four other Ubuntu Linux computers and 2 Raspberry PIs for three months without issues.

Rebooting the computer restarts telemetry and all is calm for another 20 minutes or so until the excessive CPU and DIsk usage starts back up again.

WSSD Agent Service CPU usage 25% initially. Then 28% after 20 minutes. Not much change.
Vmmem CPU usage <1% initially. Then 25% after 20 minutes. Big change.
AzureIoT EdgeForLinux-v1-EFLOW.vhdx 48Kbps initially. Then 14 Mbps after 20 minutes then 160 Mbps after 25 minutes. Big change.
Is there a leak or something building up that is not properly managed behind the scenes?
I am not sure how to find out what is causing the high disk and memory usage and the effective shutting down of telemetry.

The Disk will keep running at nearly full capacity after an hour or more until rebooted even after telemetry stops sending. The CPU and Disk keep operating at very high capacity with the SSD LED on the front panel on solid.

What steps could be taken to troubleshoot? Since I cannot see logs when the computer is not responding, are we able to store logs on the computer to view when booting up and stopping iotedge from working? Would it be possible to eliminate the deployment, delete the containers and start with just two modules, the edgeAgent and the edgHub and see if the problem is related to the containers?

I will post screen clippings after submitting this issue.

dmaxwe22 · 2021-07-18T03:44:37Z

TerryWarwick · 2021-07-19T04:11:09Z

@dmaxwel,

Thank you for the very thorough report. We have a fix for the excessive CPU utilization of WSSDAgent and will make it available in our next servicing update. If all goes well this update will be available the week of July 26th.

Terry Warwick
Microsoft

dmaxwe22 · 2021-07-19T12:24:31Z

OK, I noticed, by the way, that it seemed calm when I had just the SimulatedTemperatureSensor running alone. When adding the Azure Streaming Analytics module to do a simple reset of the temperature module then the trouble began and after 15 or 20 minutes the entire PowerShell window is useless and telemetry stops.

In the next week to 10 days until this update comes out, I will experiment with other types of modules to see if that makes a difference.

samuelbertrand · 2021-07-19T18:50:57Z

I had this exact issue on the public preview (before the GA). Increasing the virtual machine memory from 1 GB (default) to 2 GB (or more) fixed the issue for me. It seems that the issue is related to the virtual machine excessively using the Linux swap space on the virtual disk caused by the lack of memory availability. Though, I was not able to reproduce the issue on the GA.

nealpeters86 · 2021-07-20T09:15:51Z

@TerryWarwick we encounter the same issue. Is there a workaround available?

dmaxwe22 · 2021-07-22T02:41:09Z

@samuelbertrand your tip seems to have provided a solution for this problem of everything coming to a halt. I should have thought about that but didn't. The upgrade of virtual machine memory has solved one additional problem. The Vmmem CPU utilization is very low now and the Disk usage is nearly 0% after an hour of running with the original two modules (SimulatedTemperatureSensor, and my Azure Streaming Analytic temperature reset mdule).

Here are the steps I did to get to this point (taken from the PowerShell functions for IoT Edge for Linux on Windows web page:
https://docs.microsoft.com/en-us/azure/iot-edge/reference-iot-edge-for-linux-on-windows-functions )

Stop-EflowVM
Set-EflowVM -memoryInMB 4096
Start-EflowVM

That redeployed the Linux virtual machine with 4GB of RAM allocated.

I don't know how to to find out how much memory that the virtual machine has when running so am not sure what it was before doing this. @samuelbertrand mentioned 1 GB (default) above. Now it is set to 4 GB.

After doing that, I also did the following:
sudo systemctl stop iotedge
sudo docker image list
sudo docker rmi "idofdockerimage"

Repeated the removal of docker images until all docker images were deleted.

Then

I then went into the Azure IoT Edge Deployments and changed the priority of the desired deployment containing the Azure Streaming Analytics module so that when the iotedge service restarts it will download the correct containers and runtimes of edgeAgent and edgeHub.

Then

sudo systemctl start iotedge

Then in the Azure IoT Edge portal, went to the IoT Hub, IoT Edge devices, then clicked on this device and then clicked on "Troubleshoot". This has become a very valuable addition to the portal and I appreciate that. I was able to see logs of each module.

After an hour all is calm.

However, the WSSD Agent CPU utilization is still at 26% and so the upcoming release will be appreciated to reduce that.

ms-vincent · 2021-07-22T15:58:56Z

@dmaxwe22 What version of EFLOW are you using?

dmaxwe22 · 2021-07-23T17:42:52Z

@ms-vincent can you tell me how to find the EFLOW version? I downloaded it around the 15th of July. Why do you ask?

dmaxwe22 · 2021-07-23T17:50:08Z

PS C:\WINDOWS\system32> Get-EflowVM | Format-List

VmConfiguration : @{ID=1b07306e9a1fc5c; name=DESKTOP-G5K33IS-EFLOW; properties=; tags=}
EdgeRuntimeVersion : @{IotEdgeVersion=1.1.3; MobyEngineVersion=19.03.15+azure; MobyCliVersion=19.03.15+azure}
EdgeRuntimeStatus : @{SystemCtlStatus=System.Object[]; ModuleList=System.Object[]}
SystemStatistics : @{TotalMemMb=3747; UsedMemMb=847; AvailableMemMb=2749; TotalStorageMb=9900; UsedStorageMb=1441; AvailableStorageMb=8041; CpuCount=4; KernelVersion=5.10.37.1-1.cm1 #1 SMP Fri Jun 4
11:14:43 UTC 2021}

fcabrera23 · 2021-08-04T12:39:04Z

Hi @dmaxwe22

We have released our latest EFLOW update that fixes the high CPU usage by WSSD Agent. Thank you for your detailed information about the high memory usage, and it's nice to know that you were able to solve it by assigning more memory to the EFLOW VM. For more information on our latest update, check 1.1.2107.0 Release Notes.

Thanks,
Francisco

nenright · 2021-09-07T15:35:51Z

Is there any other ways to validate that the update has been deployed? I'm seeing high CPU even after installing 1.1.2107.0. There is no deployment set and the only module running is EdgeAgent. Just wanted to make sure I had the bits deployed correctly before opening a new issue:

Not sure if its related or not, but edgeAgent logs look like this:

fcabrera23 · 2021-09-09T12:23:42Z

Hi @nenright,

Can you run Get-EflowVM | Format-List in PowerShell so that we can understand your VM configuration?

Thanks,
Francisco

nenright · 2021-09-09T13:14:44Z

Does the version upgrade require a machine reboot? After rebooting the host things appear back to normal however I don't know if its something that will creep back over time.

Get-EflowVM | Format-List


VmConfiguration    : @{ID=13dfb745fe82122; name=PSFD-NENRIGHT-EFLOW; properties=; tags=}
EdgeRuntimeVersion : @{IotEdgeVersion=1.1.4; MobyEngineVersion=19.03.15+azure; MobyCliVersion=19.03.15+azure}
EdgeRuntimeStatus  : @{SystemCtlStatus=System.Object[]; ModuleList=System.Object[]}
SystemStatistics   : @{TotalMemMb=787; UsedMemMb=278; AvailableMemMb=450; TotalStorageMb=9900; UsedStorageMb=652;
                     AvailableStorageMb=8831; CpuCount=1; KernelVersion=5.10.42.1-3.cm1 #1 SMP Mon Jun 28 13:00:04 UTC
                     2021}

fcabrera23 · 2021-09-09T15:20:23Z

@nenright,

Thanks for your information. I'm seeing that you're using a 1GB VM - Can you assign 2GB to the VM using the Set-EflowVm command?

Also, could you please share information about your Windows version? You can type winver directly in Windows and will get the whole information.

Thanks,
Francisco

nenright · 2021-09-09T15:38:49Z

I'm no longer experiencing the behavior on my test device after the reboot - it is currently working as expected even with 1 gig. test device is on 21H1 OS Build 19043.1165

we did see this in the field that had a device configured with 2 cores and 4GB. After a reboot of that device we think it resolved itself as well. I may have another device in the field in the high cpu/mem state but I'm not 100% sure (currently don't have a good way to monitor the VM Host remotely). If this is of interest to you, I can try to coordinate with the client to get some downtime to remote in and get some details.

if a reboot is needed after edge is updated with wus going forward, we'll just need to know that to coordinate.

nenright · 2021-09-20T12:29:14Z

@fcabrera23, I believe my test machine is back in the high cpu/memory state. I haven't done anything with it since my previous post. It's provisioned but no deployments set. edgeAgent is the only running module. The memory and cpu load appears to get worse over time.

Windows update indicates that KB5005565) was installed and there is a pending restart. I assume if I do, memory and CPU will return. Let me know if there is anything I can collect for you while it is in this state.

fcabrera23 · 2021-09-20T13:22:21Z

Hi @nenright,

Can you try doing a Get-EflowVm | Format-List and share the output? We are seeing an increase in the Docker logs file that can end up in this situation.

Thanks,
Francisco

nenright · 2021-09-20T13:40:54Z

So some more info... looks like the Antimaleware scan was what was causing things to spike. after 20ish min, things calmed down although ever ~1 second the hyper-v compute and wssdagent cycle through a ~3 second high cpu period so they go up and down. Maybe the metrics scrape (its trying to scrape the metrics for edgeHub but it isn't deployed)?

here is the get-eflowvm output.

PS C:\Users\nenright> Get-EflowVm | Format-List VmConfiguration : @{ID=13dfb745fe82122; name=PSFD-NENRIGHT-EFLOW; properties=; tags=} EdgeRuntimeVersion : @{IotEdgeVersion=1.1.4; MobyEngineVersion=19.03.15+azure; MobyCliVersion=19.03.15+azure} EdgeRuntimeStatus : @{SystemCtlStatus=System.Object[]; ModuleList=System.Object[]} SystemStatistics : @{TotalMemMb=787; UsedMemMb=269; AvailableMemMb=453; TotalStorageMb=9900; UsedStorageMb=903; AvailableStorageMb=8579; CpuCount=1; KernelVersion=5.10.42.1-3.cm1 #1 SMP Mon Jun 28 13:00:04 UTC 2021}

fcabrera23 · 2021-09-27T13:02:14Z

Hi @nenright,

After more investigation, we didn't see any modules generating excessive logs in journal logs, so we checked how the journald is designed again. It turns out that the memory usage is bound to the possible maximum of the active log file (the one currently written to).

This value can be correlated to the setting of "SystemMaxFileSize". The lower the value is, the less of the working memory the journald will take. By default, this value of "SystemMaxFileSize" is 1/8 of the maximum total logs size (1GB in EFLOWVM by default), so hypothetically, the journald at most will map 250MB (125MB for active system journal + 125MB for active user journal) into the working memory. And for the resource constraint device, this behavior may be not ideal.

Customers that do not want this behavior can modify the "/etc/systemd/journald.conf" to add "SystemMaxFileSize" to indirectly cap the journald memory usage.

Just for testing purpose, one can modify the file, and then do the following without VM reboot:

sudo systemctl daemon-reload
sudo systemctl restart systemd-journald
sudo journalctl --rotate

dmaxwe22 closed this as completed Jul 18, 2021

dmaxwe22 reopened this Jul 18, 2021

vtam-msft assigned ArchanaMSFT Jul 19, 2021

fcabrera23 closed this as completed Aug 4, 2021

fcabrera23 reopened this Sep 7, 2021

fcabrera23 assigned jagadishmurugan and ArchanaMSFT and unassigned ArchanaMSFT Sep 8, 2021

fcabrera23 closed this as completed Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IoT Edge stops sending telemetry after WSSD Agent and Vmmem CPU and EFLOW.vhdx Disk usage climb to excessive levels. #49

IoT Edge stops sending telemetry after WSSD Agent and Vmmem CPU and EFLOW.vhdx Disk usage climb to excessive levels. #49

dmaxwe22 commented Jul 18, 2021

dmaxwe22 commented Jul 18, 2021

TerryWarwick commented Jul 19, 2021

dmaxwe22 commented Jul 19, 2021

samuelbertrand commented Jul 19, 2021

nealpeters86 commented Jul 20, 2021

dmaxwe22 commented Jul 22, 2021

ms-vincent commented Jul 22, 2021

dmaxwe22 commented Jul 23, 2021

dmaxwe22 commented Jul 23, 2021

fcabrera23 commented Aug 4, 2021

nenright commented Sep 7, 2021

fcabrera23 commented Sep 9, 2021

nenright commented Sep 9, 2021

fcabrera23 commented Sep 9, 2021

nenright commented Sep 9, 2021

nenright commented Sep 20, 2021

fcabrera23 commented Sep 20, 2021

nenright commented Sep 20, 2021

fcabrera23 commented Sep 27, 2021

IoT Edge stops sending telemetry after WSSD Agent and Vmmem CPU and EFLOW.vhdx Disk usage climb to excessive levels. #49

IoT Edge stops sending telemetry after WSSD Agent and Vmmem CPU and EFLOW.vhdx Disk usage climb to excessive levels. #49

Comments

dmaxwe22 commented Jul 18, 2021

dmaxwe22 commented Jul 18, 2021

TerryWarwick commented Jul 19, 2021

dmaxwe22 commented Jul 19, 2021

samuelbertrand commented Jul 19, 2021

nealpeters86 commented Jul 20, 2021

dmaxwe22 commented Jul 22, 2021

ms-vincent commented Jul 22, 2021

dmaxwe22 commented Jul 23, 2021

dmaxwe22 commented Jul 23, 2021

fcabrera23 commented Aug 4, 2021

nenright commented Sep 7, 2021

fcabrera23 commented Sep 9, 2021

nenright commented Sep 9, 2021

fcabrera23 commented Sep 9, 2021

nenright commented Sep 9, 2021

nenright commented Sep 20, 2021

fcabrera23 commented Sep 20, 2021

nenright commented Sep 20, 2021

fcabrera23 commented Sep 27, 2021