Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IoT Edge stops sending telemetry after WSSD Agent and Vmmem CPU and EFLOW.vhdx Disk usage climb to excessive levels. #49

Closed
dmaxwe22 opened this issue Jul 18, 2021 · 19 comments
Assignees

Comments

@dmaxwe22
Copy link

After about 20 minutes the CPU and Disk usage climbs very high and IoT Edge stops sending telemetry and PowerShell, Linux connection, etc. stop responding.

This issue of excessive CPU and Disk usage causing things to stop streaming and stop responding repeats even with a fresh installation of Windows 10 IoT Enterprise 2019 LTSC and EFLOW. The only thing that stays the same is the Azure DPS X509 enrollment with IoT Hub group deployment of 2 modules (simulated temperature sensor and streaming analytics reset module). This same deployment has worked on four other Ubuntu Linux computers and 2 Raspberry PIs for three months without issues.

Rebooting the computer restarts telemetry and all is calm for another 20 minutes or so until the excessive CPU and DIsk usage starts back up again.

  1. WSSD Agent Service CPU usage 25% initially. Then 28% after 20 minutes. Not much change.

  2. Vmmem CPU usage <1% initially. Then 25% after 20 minutes. Big change.

  3. AzureIoT EdgeForLinux-v1-EFLOW.vhdx 48Kbps initially. Then 14 Mbps after 20 minutes then 160 Mbps after 25 minutes. Big change.

  4. Is there a leak or something building up that is not properly managed behind the scenes?

  5. I am not sure how to find out what is causing the high disk and memory usage and the effective shutting down of telemetry.

The Disk will keep running at nearly full capacity after an hour or more until rebooted even after telemetry stops sending. The CPU and Disk keep operating at very high capacity with the SSD LED on the front panel on solid.

What steps could be taken to troubleshoot? Since I cannot see logs when the computer is not responding, are we able to store logs on the computer to view when booting up and stopping iotedge from working? Would it be possible to eliminate the deployment, delete the containers and start with just two modules, the edgeAgent and the edgHub and see if the problem is related to the containers?

I will post screen clippings after submitting this issue.

@dmaxwe22 dmaxwe22 reopened this Jul 18, 2021
@dmaxwe22
Copy link
Author

EFLOW Performance Screen Capture - 1
EFLOW Performance Screen Capture - 2

@dmaxwe22 dmaxwe22 changed the title IoT Edge stops sending telemetry and disk usage for the EflowVM goes to 100 percent after 15 minutes and stays busy IoT Edge stops sending telemetry. CPU and Disk usage are excessive. The EFLOW.vhdx usage got to higher than 70 Mbps. PowerShell non responsive Jul 18, 2021
@dmaxwe22 dmaxwe22 changed the title IoT Edge stops sending telemetry. CPU and Disk usage are excessive. The EFLOW.vhdx usage got to higher than 70 Mbps. PowerShell non responsive IoT Edge stops sending telemetry after WSSD Agent and Vmmem CPU and EFLOW.vhdx Disk usage climb to excessive levels. Jul 18, 2021
@TerryWarwick
Copy link
Contributor

@dmaxwel,

Thank you for the very thorough report. We have a fix for the excessive CPU utilization of WSSDAgent and will make it available in our next servicing update. If all goes well this update will be available the week of July 26th.

Terry Warwick
Microsoft

@dmaxwe22
Copy link
Author

OK, I noticed, by the way, that it seemed calm when I had just the SimulatedTemperatureSensor running alone. When adding the Azure Streaming Analytics module to do a simple reset of the temperature module then the trouble began and after 15 or 20 minutes the entire PowerShell window is useless and telemetry stops.

In the next week to 10 days until this update comes out, I will experiment with other types of modules to see if that makes a difference.

@samuelbertrand
Copy link

I had this exact issue on the public preview (before the GA). Increasing the virtual machine memory from 1 GB (default) to 2 GB (or more) fixed the issue for me. It seems that the issue is related to the virtual machine excessively using the Linux swap space on the virtual disk caused by the lack of memory availability. Though, I was not able to reproduce the issue on the GA.

@nealpeters86
Copy link

@TerryWarwick we encounter the same issue. Is there a workaround available?

@dmaxwe22
Copy link
Author

@samuelbertrand your tip seems to have provided a solution for this problem of everything coming to a halt. I should have thought about that but didn't. The upgrade of virtual machine memory has solved one additional problem. The Vmmem CPU utilization is very low now and the Disk usage is nearly 0% after an hour of running with the original two modules (SimulatedTemperatureSensor, and my Azure Streaming Analytic temperature reset mdule).

Here are the steps I did to get to this point (taken from the PowerShell functions for IoT Edge for Linux on Windows web page:
https://docs.microsoft.com/en-us/azure/iot-edge/reference-iot-edge-for-linux-on-windows-functions )

Stop-EflowVM
Set-EflowVM -memoryInMB 4096
Start-EflowVM

That redeployed the Linux virtual machine with 4GB of RAM allocated.

I don't know how to to find out how much memory that the virtual machine has when running so am not sure what it was before doing this. @samuelbertrand mentioned 1 GB (default) above. Now it is set to 4 GB.

After doing that, I also did the following:
sudo systemctl stop iotedge
sudo docker image list
sudo docker rmi "idofdockerimage"

Repeated the removal of docker images until all docker images were deleted.

Then

I then went into the Azure IoT Edge Deployments and changed the priority of the desired deployment containing the Azure Streaming Analytics module so that when the iotedge service restarts it will download the correct containers and runtimes of edgeAgent and edgeHub.

Then

sudo systemctl start iotedge

Then in the Azure IoT Edge portal, went to the IoT Hub, IoT Edge devices, then clicked on this device and then clicked on "Troubleshoot". This has become a very valuable addition to the portal and I appreciate that. I was able to see logs of each module.

After an hour all is calm.

However, the WSSD Agent CPU utilization is still at 26% and so the upcoming release will be appreciated to reduce that.

@ms-vincent
Copy link

@dmaxwe22 What version of EFLOW are you using?

@dmaxwe22
Copy link
Author

@ms-vincent can you tell me how to find the EFLOW version? I downloaded it around the 15th of July. Why do you ask?

@dmaxwe22
Copy link
Author

PS C:\WINDOWS\system32> Get-EflowVM | Format-List

VmConfiguration : @{ID=1b07306e9a1fc5c; name=DESKTOP-G5K33IS-EFLOW; properties=; tags=}
EdgeRuntimeVersion : @{IotEdgeVersion=1.1.3; MobyEngineVersion=19.03.15+azure; MobyCliVersion=19.03.15+azure}
EdgeRuntimeStatus : @{SystemCtlStatus=System.Object[]; ModuleList=System.Object[]}
SystemStatistics : @{TotalMemMb=3747; UsedMemMb=847; AvailableMemMb=2749; TotalStorageMb=9900; UsedStorageMb=1441; AvailableStorageMb=8041; CpuCount=4; KernelVersion=5.10.37.1-1.cm1 #1 SMP Fri Jun 4
11:14:43 UTC 2021}

@fcabrera23
Copy link
Contributor

Hi @dmaxwe22

We have released our latest EFLOW update that fixes the high CPU usage by WSSD Agent. Thank you for your detailed information about the high memory usage, and it's nice to know that you were able to solve it by assigning more memory to the EFLOW VM. For more information on our latest update, check 1.1.2107.0 Release Notes.

Thanks,
Francisco

@nenright
Copy link

nenright commented Sep 7, 2021

Is there any other ways to validate that the update has been deployed? I'm seeing high CPU even after installing 1.1.2107.0. There is no deployment set and the only module running is EdgeAgent. Just wanted to make sure I had the bits deployed correctly before opening a new issue:

Screen Shot 2021-09-07 at 11 31 49 AM
Screen Shot 2021-09-07 at 11 30 30 AM
Screen Shot 2021-09-07 at 11 28 02 AM

Not sure if its related or not, but edgeAgent logs look like this:

Screen Shot 2021-09-07 at 11 34 34 AM

@fcabrera23
Copy link
Contributor

Hi @nenright,

Can you run Get-EflowVM | Format-List in PowerShell so that we can understand your VM configuration?

Thanks,
Francisco

@nenright
Copy link

nenright commented Sep 9, 2021

Does the version upgrade require a machine reboot? After rebooting the host things appear back to normal however I don't know if its something that will creep back over time.

Get-EflowVM | Format-List


VmConfiguration    : @{ID=13dfb745fe82122; name=PSFD-NENRIGHT-EFLOW; properties=; tags=}
EdgeRuntimeVersion : @{IotEdgeVersion=1.1.4; MobyEngineVersion=19.03.15+azure; MobyCliVersion=19.03.15+azure}
EdgeRuntimeStatus  : @{SystemCtlStatus=System.Object[]; ModuleList=System.Object[]}
SystemStatistics   : @{TotalMemMb=787; UsedMemMb=278; AvailableMemMb=450; TotalStorageMb=9900; UsedStorageMb=652;
                     AvailableStorageMb=8831; CpuCount=1; KernelVersion=5.10.42.1-3.cm1 #1 SMP Mon Jun 28 13:00:04 UTC
                     2021}

@fcabrera23
Copy link
Contributor

@nenright,

Thanks for your information. I'm seeing that you're using a 1GB VM - Can you assign 2GB to the VM using the Set-EflowVm command?

Also, could you please share information about your Windows version? You can type winver directly in Windows and will get the whole information.

Thanks,
Francisco

@nenright
Copy link

nenright commented Sep 9, 2021

I'm no longer experiencing the behavior on my test device after the reboot - it is currently working as expected even with 1 gig. test device is on 21H1 OS Build 19043.1165

we did see this in the field that had a device configured with 2 cores and 4GB. After a reboot of that device we think it resolved itself as well. I may have another device in the field in the high cpu/mem state but I'm not 100% sure (currently don't have a good way to monitor the VM Host remotely). If this is of interest to you, I can try to coordinate with the client to get some downtime to remote in and get some details.

if a reboot is needed after edge is updated with wus going forward, we'll just need to know that to coordinate.

@nenright
Copy link

@fcabrera23, I believe my test machine is back in the high cpu/memory state. I haven't done anything with it since my previous post. It's provisioned but no deployments set. edgeAgent is the only running module. The memory and cpu load appears to get worse over time.

Windows update indicates that KB5005565) was installed and there is a pending restart. I assume if I do, memory and CPU will return. Let me know if there is anything I can collect for you while it is in this state.

Screen Shot 2021-09-20 at 8 23 41 AM

@fcabrera23
Copy link
Contributor

Hi @nenright,

Can you try doing a Get-EflowVm | Format-List and share the output? We are seeing an increase in the Docker logs file that can end up in this situation.

Thanks,
Francisco

@nenright
Copy link

So some more info... looks like the Antimaleware scan was what was causing things to spike. after 20ish min, things calmed down although ever ~1 second the hyper-v compute and wssdagent cycle through a ~3 second high cpu period so they go up and down. Maybe the metrics scrape (its trying to scrape the metrics for edgeHub but it isn't deployed)?

here is the get-eflowvm output.

PS C:\Users\nenright> Get-EflowVm | Format-List VmConfiguration : @{ID=13dfb745fe82122; name=PSFD-NENRIGHT-EFLOW; properties=; tags=} EdgeRuntimeVersion : @{IotEdgeVersion=1.1.4; MobyEngineVersion=19.03.15+azure; MobyCliVersion=19.03.15+azure} EdgeRuntimeStatus : @{SystemCtlStatus=System.Object[]; ModuleList=System.Object[]} SystemStatistics : @{TotalMemMb=787; UsedMemMb=269; AvailableMemMb=453; TotalStorageMb=9900; UsedStorageMb=903; AvailableStorageMb=8579; CpuCount=1; KernelVersion=5.10.42.1-3.cm1 #1 SMP Mon Jun 28 13:00:04 UTC 2021}

@fcabrera23
Copy link
Contributor

Hi @nenright,

After more investigation, we didn't see any modules generating excessive logs in journal logs, so we checked how the journald is designed again. It turns out that the memory usage is bound to the possible maximum of the active log file (the one currently written to).

image

This value can be correlated to the setting of "SystemMaxFileSize". The lower the value is, the less of the working memory the journald will take. By default, this value of "SystemMaxFileSize" is 1/8 of the maximum total logs size (1GB in EFLOWVM by default), so hypothetically, the journald at most will map 250MB (125MB for active system journal + 125MB for active user journal) into the working memory. And for the resource constraint device, this behavior may be not ideal.

Customers that do not want this behavior can modify the "/etc/systemd/journald.conf" to add "SystemMaxFileSize" to indirectly cap the journald memory usage.

Just for testing purpose, one can modify the file, and then do the following without VM reboot:

  1. sudo systemctl daemon-reload
  2. sudo systemctl restart systemd-journald
  3. sudo journalctl --rotate

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants