Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClientClusterManifestProvider has serious performance problem,When there are multiple Silos and clients, it will cause network IO and CPU to increase by a multiple of the number of clients. #8722

Closed
HermesNew opened this issue Nov 13, 2023 · 14 comments · Fixed by #8728

Comments

@HermesNew
Copy link
Contributor

For example:
The cluster hava 3silos,3clients.

After the cluster is started, the CPU and network IO will rapidly increase, especially network IO, where the amount of data transmitted per second to ClusterManifest reaches 80MB.CPU is burdened by the network and also increases with it.
1

After analysis, we have identified the problem.
2

@ghost ghost added the Needs: triage 🔍 label Nov 13, 2023
@HermesNew
Copy link
Contributor Author

Equivalent to adding a 500ms dead loop, continuously executing GetClusterManifest(), resulting in a large amount of invalid network IO.

@HermesNew
Copy link
Contributor Author

A dictionary of Manifest should be maintained to check for any version changes, and only perform version changes if there are any. At the same time, it should be the same version of Silo for comparison, rather than the versions of Silo polled in the cluster. Otherwise, the more Silos there are, the higher the network overhead and CPU usage.

@HermesNew
Copy link
Contributor Author

This is our modified code:

	private ConcurrentDictionary<SiloAddress, MajorMinorVersion> siloAddressVersionMap = new ConcurrentDictionary<SiloAddress, MajorMinorVersion>();
	private async Task RunAsync()
        {
            try
            {
                var grainFactory = _services.GetRequiredService<IInternalGrainFactory>();
                var cancellationTask = _cancellation.Token.WhenCancelled();
                while (!_cancellation.IsCancellationRequested)
                {
                    var gateway = _gatewayManager.GetLiveGateway();
                    try
                    {
                        var provider = grainFactory.GetGrain<IClusterManifestSystemTarget>(SystemTargetGrainId.Create(Constants.ManifestProviderType, gateway).GrainId);

                        MajorMinorVersion version = _current.Version;
                        if(siloAddressVersionMap.ContainsKey(gateway))
                        {
                            version = siloAddressVersionMap[gateway];
                        }
                        var refreshTask = provider.GetClusterManifest(version).AsTask();
                        var task = await Task.WhenAny(cancellationTask, refreshTask).ConfigureAwait(false);

                        if (ReferenceEquals(task, cancellationTask))
                        {
                            return;
                        }
                        
                        var resultClusterManifestResult = await refreshTask;
                        if (resultClusterManifestResult.IsVersionChange)
                        {
                            siloAddressVersionMap[gateway] = resultClusterManifestResult.ClusterManifest.Version;
                        }

                        if (!resultClusterManifestResult.IsVersionChange)
                        {
                            await Task.WhenAny(cancellationTask, Task.Delay(_typeManagementOptions.TypeMapRefreshInterval));
                            continue;
                        }
                        

                        if (!_updates.TryPublish(resultClusterManifestResult.ClusterManifest))
                        {
                            await Task.Delay(StandardExtensions.Min(_typeManagementOptions.TypeMapRefreshInterval, TimeSpan.FromMilliseconds(500)));
                            continue;
                        }

                        _initialized.TrySetResult(true);

                        if (_logger.IsEnabled(LogLevel.Debug))
                        {
                            _logger.LogDebug("Refreshed cluster manifest");
                        }

                        await Task.WhenAny(cancellationTask, Task.Delay(_typeManagementOptions.TypeMapRefreshInterval));
                    }
                    catch (Exception exception)
                    {
                        _logger.LogWarning(exception, "Error trying to get cluster manifest from gateway {Gateway}", gateway);
                        await Task.Delay(StandardExtensions.Min(_typeManagementOptions.TypeMapRefreshInterval, TimeSpan.FromSeconds(5)));
                    }
                }
            }
            finally
            {
                _initialized.TrySetResult(false);

                if (_logger.IsEnabled(LogLevel.Debug))
                {
                    _logger.LogDebug("Stopped refreshing cluster manifest");
                }
            }
        }

@HermesNew
Copy link
Contributor Author

@ReubenBond We have submitted the PR, please review and merge it into the 7. x branch as soon as possible, and release a fixed version. Thank you.

@HermesNew
Copy link
Contributor Author

@benjaminpetit
After upgrading the project from 3. x to 7. x, the CPU has doubled. The reason for the growth is the feedback issue in this issue. After fixing it, the CPU is basically the same as before.

We have submitted the Optimized code, please review and merge it into the 7. x branch as soon as possible, and release a fixed version. Thank you.

@ReubenBond
Copy link
Member

Did this only start with v7.2.3, or did you see it with v7.2.2 and prior as well?

@HermesNew
Copy link
Contributor Author

@ReubenBond
7.2.3 This issue still exists. Compared to version 7.2.2, the memory leak issue in 7.2.3 has been improved. However, the issues with CPU and network IO remain.

@HermesNew
Copy link
Contributor Author

There are serious performance problem with CPU, memory, and network IO in v7.2.2.

@HermesNew
Copy link
Contributor Author

HermesNew commented Nov 17, 2023

1

This picture is from v7.2.3, and compared to v7.2.2, the memory growth is slightly slower.

@HermesNew
Copy link
Contributor Author

In short, all versions above 7.0 have this problem.

@ReubenBond
Copy link
Member

Thank you for your investigation and the PR. I am looking into this. I may push some changes to your PR branch before we merge it.

After fixing it, the CPU is basically the same as before.

In other words, your PR fixes your issues, greatly reducing CPU usage for your hosts?

@HermesNew
Copy link
Contributor Author

Yes, the CPU and memory have all returned to the level of the previous 3. x version. After repair, everything is normal as before.
We fixed and tested it on our own branch from v7.2.2. The difference between the submitted PR code and v7.2.3 may require you to check if there is a conflict with the v7.2.3 version code.

Thank you for your investigation and the PR. I am looking into this. I may push some changes to your PR branch before we merge it.

After fixing it, the CPU is basically the same as before.

In other words, your PR fixes your issues, greatly reducing CPU usage for your hosts?

@HermesNew
Copy link
Contributor Author

HermesNew commented Nov 18, 2023

After fixing this issue, the CPU has been reduced from 18% to 5%. In addition, the memory of the client has also been reduced by half.

In other words, your PR fixes your issues, greatly reducing CPU usage for your hosts?

@ReubenBond
Copy link
Member

Thanks, I pushed changes to the PR and will probably merge this if our tests pass. Thank you again for the fix and the investigation. I hope we can give you a new release soon

453873 added a commit to 453873/orleans that referenced this issue Nov 18, 2023
ReubenBond pushed a commit to 453873/orleans that referenced this issue Nov 30, 2023
ReubenBond pushed a commit to 453873/orleans that referenced this issue Nov 30, 2023
ReubenBond pushed a commit to 453873/orleans that referenced this issue Dec 13, 2023
ReubenBond pushed a commit to 453873/orleans that referenced this issue Dec 13, 2023
ReubenBond pushed a commit to ReubenBond/orleans that referenced this issue Dec 13, 2023
@ghost ghost added the Status: Fixed label Dec 15, 2023
@github-actions github-actions bot locked and limited conversation to collaborators Jan 15, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants