Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orleans gets slower over time #4505

Closed
SebastianStehle opened this Issue Apr 24, 2018 · 84 comments

Comments

Projects
None yet
@SebastianStehle
Copy link
Contributor

SebastianStehle commented Apr 24, 2018

Hi,

it is a little bit vague, but I have the impression that Orleans gets slower over time.

My specs are:

  • .NET Core 2.0
  • Running on Kubernetes in GCloud, 2 Nodes/Silos
  • Co-Hosted with ASP.NET Core

I have a very simple grain, that just returns the snapshot (my own state system):

https://github.com/Squidex/squidex/blob/master/src/Squidex.Domain.Apps.Entities/Schemas/SchemaGrain.cs#L305

I use my own serializer with JSON.NET and my benchmarks show that the serialization takes 2-3 ms usually. I also tested it with 100x larger states that expected and the performance is great.

I added some profiling to the client side and I have experienced that the call takes 4-5 ms after I deploy it and several days later up to 200ms or even more (the max I have seen is 10seconds).

I also checked my MongoDB logs where I save all slow queries and there is nothing related.

I am a little bit lost.

@xiazen

This comment has been minimized.

Copy link
Contributor

xiazen commented Apr 24, 2018

that seems odd. do you have memory dump from the server? if logs cannot give us anything with good insight, maybe memory dump will, because this looks like something slowly grow bad in the server.

Or do you monitor the memory and cpu use on the server? that might be an easier starting point than memory dump

@xiazen xiazen self-assigned this Apr 24, 2018

@SebastianStehle

This comment has been minimized.

Copy link
Contributor Author

SebastianStehle commented Apr 24, 2018

Sorry, it is kubernetes cluster. No idea how to get a memory dump from the process. I observe the monitor and cpu usage, but there is nothing special.

@SebastianStehle

This comment has been minimized.

Copy link
Contributor Author

SebastianStehle commented Apr 24, 2018

There was also a short discussion on gitter and @DarkCow said:

I have noticed increasing timeout's orleans-2.0.0, I have deployed orleans-2.0.0-ci.20180417.4 which includes the replacement threading model by @dVakulen which seems to solve the problem.

As far as I can tell, what I am seeing is that a thread will deadlock/stall, it is hard to find which Orleans function is stalling. But the new threading model seems to be a work stealing queue. #3792
Yes, been in production since 4/19. This is the first I am seeing other people having performance problems after hours of runtime...

Yeah, I went through the pipeline, everything in Orleans is really fast, but there is something else stalling the execution. Are you seeing warning logs about Task ... took elapsed time 0:00:00.4702115 for execution, which is longer than 00:00:00.2000000. Running on thread System.Threading.Thread and then Response did not arrive on time in 00:00:30 for message

I don't see these warning messages for first 12+ hours of runtime, then they start cropping up. orleans-2.0.0-ci.20180417.4 has appeared to of solved it and/or masked it...

@DarkCow

This comment has been minimized.

Copy link
Contributor

DarkCow commented Apr 24, 2018

I am seeing a similar issue. I have a 17VM cluster, running on Microsoft server 2012r2, net core 2.0, I am using the standard code-gen serializer, ADO.NET for clustering/storage/reminders. and SimpleMessageStreaming for streams.

The issue
After 12+ hours of runtime, Orleans begins to have problems with grain calls stalling. These grain calls are fine T<12 hours, but stall T>12 hours. Some calls are fast, but they are interleaved with slow grain calls.

  • Task ... took elapsed time 0:00:00.4702115 for execution, which is longer than 00:00:00.2000000. Running on thread System.Threading.Thread
  • Response did not arrive on time in 00:00:30 for message ...

Things I've investigated

  • Grain calls are 100% async await, there is no blocking IO calls.
  • Trace grain calls to specific grain code. There is no correlation between timeout warnings and code complexity. Some calls are Return Task.FromResult( State );
  • RAM usage, CPU usage and there are no leaks and/or CPU spikes.
  • A timeout warning does not correlate to 100% CPU usage.
  • Investigate if my code is causing the stall, as far as I can tell there is no correlation between a stall and the grain call. It feels like work is queued to be run, but the TaskScheduler/Thread is blocked
  • Run through the Orleans calls on the stack, everything runs/returns very fast.

Things I've tried

  • Increase core/ram provisions for VMs,
    • Didn't help because I am <10% CPU <50% RAM
  • Changing to .NET Framework 4.7.1
    • This helps, so does this mean that .NET Core handles threads/locks differently?
    • If Framework is handling blocked/stalled thread differently, it might explain it.
  • Upgrading to orleans-2.0.0-ci.20180417.4 which is post #3792
    • This seems to of fixed/masked the problem. The insight I gathered is that it is a work stealing pool.
    • Work stealing leads me to think the blocked/stalled thread is handled better
@stephenlautier

This comment has been minimized.

Copy link
Contributor

stephenlautier commented Apr 26, 2018

We do get similar issues as @DarkCow had explained as well. Especially timeouts which doesnt get resolved in 30sec, one major thing we had observed is when there are warnings of Queue wait time of xxx start adding up, things start slowing down and timing out.

Our setup

  • AWS EC2 instances (8x)
  • 3 clusters for each environment (dev, staging, prod)
  • Dynamo Db for persistence state
  • Persistent Streams are on DynamoDb as well (we are changing this to SmS on next deploy to test it out)
  • We have 4 apps using orleans
  • 4-8 silo instances (not all apps are the same)
  • Currently using Docker Swarm (kubi soon)
  • Docker base image for silo is microsoft/dotnet:2.0-runtime which is Linux
  • Running on .NET Core 2
  • Orleans version 2.0.0-rc2

Observations/info

  • Our CPU usage are quite low <15%
  • Our network in/out seems a bit high (and most of the time on a particular node is ~2.5x higher, and 2 seems to have no load at all)
  • We have 4 apps, most of them get the same issue (especially 2 of them which have higher load)
  • All environments get the same problems, fortunately prod seems to be more stable - instance types are bigger 1 level bigger
  • We dont have too many consumers yet (so requests arent high)
  • We do need to reboot at least once weekly especially on 2 specific apps (on prod)
  • We have alerts for Queue wait time of log warning, and when that gets triggered, generally we need to reboot or things get slowing down
  • Timeouts doesnt seem to happen only in our grains but also on PubSubRendezvousGrain

PubSubRendezvousGrain - Timeout exception log
image

PubSubRendezvousGrain - Queue wait time warning log
image

Machine network monitoring
image

@NoMercy82

This comment has been minimized.

Copy link

NoMercy82 commented Apr 26, 2018

Just to bring this in here as suggested in gitter, we have been experiencing these timeouts for a while now:
#4060

To recap:

8 VMs
1 silo/1 client on each
Client pulls from MSMQ, hardens to file, pushes to grain in silo and awaits processing, deletes file
Clients randomly 'stop' (anywhere from 1-10 days after startup) being able to talk to silos and throw Response did not arrive on time errors
Clients don't all stop at the same time but when the first one stops, all others usually stop within 12 hours
This sometimes happens sporadically from within orleans itself (silo to silo), we can see this when running as a console app.

One of our infrastructure guys mentioned it may be network related in that something isn't being cleaned up/released properly. Whether that's ports or what I'm not sure.

@SebastianStehle

This comment has been minimized.

Copy link
Contributor Author

SebastianStehle commented Apr 27, 2018

Is there something we can do to provide you more data?

@sergeybykov

This comment has been minimized.

Copy link
Member

sergeybykov commented Apr 30, 2018

I'm catching up with this thread.

Just to make sure. What level of logging are you running Orleans at. We kind of assume that it's normally Info and Debug is only used during development/testing and only for investigations.

I can tell that all our load tests currently run on full .NET Framework, not on .NET Core. We are going to get to running on .NET Core, but it'll take us some time. That's one variable I see that seems to be consistently different from our test environment.

If you could share memory dumps from silos in the "slow" state, that might provide additional insight.

@SebastianStehle

This comment has been minimized.

Copy link
Contributor Author

SebastianStehle commented Apr 30, 2018

Until a few days ago I was using Info level, but then I changed it to Warning, because Info does not provide that much helpful information and generates a lot of noise.

@ReubenBond ReubenBond assigned ReubenBond and unassigned xiazen Apr 30, 2018

@ReubenBond ReubenBond added this to the Triage milestone Apr 30, 2018

@ReubenBond

This comment has been minimized.

Copy link
Contributor

ReubenBond commented Apr 30, 2018

@SebastianStehle @DarkCow @stephenlautier do you have log files which you're able to share with us?
Above you mentioned that 12h is the point at which this happens, @DarkCow - is it suspiciously close to the 12 hour mark or is that a rough estimate?

A memory dump from a slow/stalled process may also prove useful.

@DarkCow

This comment has been minimized.

Copy link
Contributor

DarkCow commented May 1, 2018

I have now been running orleans-2.0.0-ci.20180417.4 which is post #3792 for ~2 weeks and the problem is solved or masked. But I will revert the staging environment back to the worst case scenario which was Orleans 2.0.0 && .NET core 2.0 Will be back with data hopefully by Thursday.

@ReubenBond There is no correlation between time. Just averages out to every 12 hours I had to restart my system.

@sergeybykov I have noticed that the problem is solved/masked in .net framework over .net core.

@sergeybykov

This comment has been minimized.

Copy link
Member

sergeybykov commented May 1, 2018

We'll prioritize load testing on .NET Core higher to try to repro this in house. In the meantime, any relevant logs and memory dumps would be very helpful.

@AClumsy

This comment has been minimized.

Copy link
Contributor

AClumsy commented May 2, 2018

The longer the running time, the higher this value will be. When it reaches 30 or so, this problem will occur. The initial time is only about 0.3.
w pl h 2f o_0h g 1 fq

@SebastianStehle

This comment has been minimized.

Copy link
Contributor Author

SebastianStehle commented May 2, 2018

I experienced the same, especially methods like GetTotalActivationCount() are super slow.

@richorama

This comment has been minimized.

Copy link
Contributor

richorama commented May 2, 2018

I wonder if this is dashboard related? Are people experiencing this without the dashboard enabled?

@stephenlautier

This comment has been minimized.

Copy link
Contributor

stephenlautier commented May 2, 2018

@richorama we dont have the dashboard installed, so its not related

@AClumsy

This comment has been minimized.

Copy link
Contributor

AClumsy commented May 3, 2018

@sergeybykov @ReubenBond How is this problem internal test? Our system is so urgent on the Internet now. Is there any plan to solve this problem?

@ReubenBond

This comment has been minimized.

Copy link
Contributor

ReubenBond commented May 3, 2018

@aqa510415008 if the situation is urgent, would you consider trying to use a nightly build from https://dotnet.myget.org/gallery/orleans-ci? @DarkCow has reported that the issue does not appear to affect the nightly builds.

If you are able to help by providing logs or a memory dump then that may help us resolve this sooner.

@richorama

This comment has been minimized.

Copy link
Contributor

richorama commented May 3, 2018

I ran the TestHost project in the OrleansDashboard project overnight. It's an Orleans 2.0.0, 3 silo local cluster with memory providers, running on an i7 quad core with 16GB RAM.

At the start of the run you can see calls to the management grain take <1ms.

screenshot 2018-05-02 20 05 56

After ~11 hours, the management grain is averaging 24ms per request.

screenshot 2018-05-03 06 53 01

It's interesting to note that other grains are not showing signs of running slowly.

screenshot 2018-05-03 07 47 56

Please ignore TestGrain, it has a Task.Sleep to slow it down.

@stephenlautier

This comment has been minimized.

Copy link
Contributor

stephenlautier commented May 3, 2018

if someone can guide us on how we can get the memory dump that you'd like to have, we can help with that.

Again we are using docker (linux image)

@ReubenBond

This comment has been minimized.

Copy link
Contributor

ReubenBond commented May 5, 2018

Unsure how to do a memdump on Linux, but you ought to be able to execute a command on a running docker container to dump dotnet, which may require installing something first. AFAIK the dotnet images run on Alpine Linux, but I'm not familiar with that.

Status update: I was unable to reproduce this using the dashboard testhost app running a single silo. I'm trying again with a 2-silo app without the dashboard + a simple load generator. I'm running both a .NET Core and .NET 4.6.2 build to verify that this doesn't repro against Framework. I'm also taking memory dumps and logging statistics. Hopefully we can at least repro this tonight. I have one memory dump from @DarkCow which I've been taking a look at, but so far I don't see anything too suspicious: the threads seem to be functioning. I'll keep digging.

@ReubenBond

This comment has been minimized.

Copy link
Contributor

ReubenBond commented May 6, 2018

24 hours in using the new test app, I still see no change in latency in either .NET Core or .NET Framework. If there is some random event which causes this issue, then it hasn't been encountered on my host yet. More likely, there is some difference in environments (maybe providers, including telemetry, etc) which explains why I have not been able to reproduce the issue yet. Any thoughts on what those differences might be?

For now, the work-around would be to run the latest nightly, which apparently does not exhibit this problem.

The test app in its entirety is here: https://gist.github.com/ReubenBond/20aad47ce739c5d2a07f12a1a3d8c168 - maybe someone can spot what's missing. The fact that the silos/client are hosted in the same AppDomain might be an issue, but it seems unlikely (still communicating using TCP, but over localhost)

@AClumsy

This comment has been minimized.

Copy link
Contributor

AClumsy commented May 7, 2018

@ReubenBond How to give you a memory dump file?

@guysegal

This comment has been minimized.

Copy link

guysegal commented Jun 27, 2018

@sergeybykov when will the nightly build will be published as a proper nuget package? I hope I understand this right..

@SebastianStehle

This comment has been minimized.

Copy link
Contributor Author

SebastianStehle commented Jun 30, 2018

@jballin82 Can you make the example public?

@ReubenBond

This comment has been minimized.

Copy link
Contributor

ReubenBond commented Jul 2, 2018

Update time.

I'm looking at this today. Apologies for the delay, I had issues running the repro from @jballin82 and had no Internet access - fixed now (bought a cellular modem while I wait for Telstra to get their act together.)

The repro uses ASP.NET and I thought that might be somehow important. I see that ASP.NET does create a whole bunch of threads and seems to slow things down on .NET Core significantly... However, ADO.NET is not required to reproduce this and removing the ADO.NET storage provider (swap for in-mem) and running @jballin82's example repo in Release mode allows me to reproduce this more quickly - within about 2 minutes, consistently.

I have not root-caused this yet. I see a very strong correlation between the failure and a (caught) InvalidOperationException from BlockingCollection<T>.TryTakeFromAny(...) (called from WorkQueue.Get()) or BlockingCollection<T>.TryTake() (called from WorkQueue.GetSystem()).

The 'interesting' thing is that the InvalidOperationException has the message "The underlying collection was modified from outside of the BlockingCollection<T>", but the underlying collection is ConcurrentBag<T> and it's constructed directly in the constructor call to the BlockingCollection<T>, i.e: new BlockingCollection<IWorkItem>(new ConcurrentBag<IWorkItem>()). In other words, we don't have any direct access to the underlying collection.

So I'm suspicious of changes to .NET Core's ConcurrentBag<T> implementation. I know a lot of performance work went into optimizing it for .NET Core. Of course, I haven't ruled out either that we're holding it wrong or that this is a red herring and the issue is in some totally separate location.

@jballin82 if you're happy for me to share a modified version of your repro, let me know.

EDIT: s/ConcurrentQueue/ConcurrentBag

xref dotnet/corefx#30781 < I opened an issue in CoreFx for the BlockingCollection behavior.

@jballin82

This comment has been minimized.

Copy link

jballin82 commented Jul 2, 2018

Please do, @ReubenBond

@veikkoeeva

This comment has been minimized.

Copy link
Contributor

veikkoeeva commented Jul 2, 2018

@ReubenBond

This comment has been minimized.

Copy link
Contributor

ReubenBond commented Jul 2, 2018

https://blogs.msdn.microsoft.com/dotnet/2017/06/07/performance-improvements-in-net-core/

both ConcurrentQueue and ConcurrentBag were essentially completely rewritten for .NET Core 2.0, in PRs dotnet/corefx #14254 and dotnet/corefx #14126, respectively

To be clear, though, I have not confirmed that this is the root cause - it's just strongly correlated. I will continue with the investigation tomorrow.

Repro: StallingExample.zip

Note that different silo projects run on different package versions and you might need to delete one or more to make it build.

@ReubenBond

This comment has been minimized.

Copy link
Contributor

ReubenBond commented Jul 3, 2018

I can confirm that replacing the underlying collection in BlockingCollection<T> to ConcurrentQueue<T> rectifies the issue in the repro. With that, I consider that corefx bug to be the root cause of this issue. I can also confirm that this bug causes items which were added to the queue to become lost from the perspective of TryTake, as demonstrated in this updated gist:
image

The question now is: do we release 2.0.4 with this fix or do we aim for 2.1?

I do not expect much impact (outside of fixing this bug) when switching from ConcurrentBag<T> to ConcurrentQueue<T>, but there may be subtleties which are not yet apparent to me.

@guysegal

This comment has been minimized.

Copy link

guysegal commented Jul 3, 2018

It would be great to have it on 2.0.4..

@ifle

This comment has been minimized.

Copy link

ifle commented Jul 3, 2018

Maybe is better to copy the prev implementation of BlockingCollection to orleans source and use it until there is a bug in .net core

@ReubenBond

This comment has been minimized.

Copy link
Contributor

ReubenBond commented Jul 3, 2018

I've opened #4736 with a workaround. It uses a runtime check so that we can preserve the current behavior when running on .NET Framework.

@ReubenBond ReubenBond modified the milestones: 2.1.0, 2.0.4 Jul 3, 2018

@ReubenBond

This comment has been minimized.

Copy link
Contributor

ReubenBond commented Jul 4, 2018

Another update: I needed to make a few changes to the build scripts so that we can create the right 2.0.4 packages. I'll open a PR for those changes tomorrow and try to start the release process so that we can attempt to release this week or early next week.

@ilyalukyanov

This comment has been minimized.

Copy link

ilyalukyanov commented Jul 13, 2018

Hey @ReubenBond have you guys managed to decide on 2.0.4? Even a rough ETA would help. If more development needs to be done and you need hands, I'd be happy to help.

@ReubenBond

This comment has been minimized.

Copy link
Contributor

ReubenBond commented Jul 13, 2018

We are hoping to release it this week or early next. I misunderstood my teammates' schedules with the last message. Apologies for that.

@pfrendo

This comment has been minimized.

Copy link

pfrendo commented Jul 18, 2018

@ReubenBond any update on this? We really need this fix on production.

@ReubenBond

This comment has been minimized.

Copy link
Contributor

ReubenBond commented Jul 20, 2018

We have published v2.0.4 to nuget.org. It includes the workaround for the ConcurrentBag issue. Please try it and chime in to let us know if it resolves the issue for you.

A future .NET Core 2.1 servicing release will include the fix for ConcurrentBag, but currently the workaround is needed.

EDIT: apologies for the silence, @pfrendo

@srollinet

This comment has been minimized.

Copy link

srollinet commented Jul 21, 2018

For my part, I can confirm that version 2.0.4 solves a huge performance issue I had. it was appearing in less than one minute under load.
Now everything works fine.
Thanks!

@lwansbrough

This comment has been minimized.

Copy link

lwansbrough commented Jul 22, 2018

Ditto on the performance improvements. Our system was slowing down by about +5 seconds per day. Now it's running smoothly.

@lwansbrough

This comment has been minimized.

Copy link

lwansbrough commented Jul 23, 2018

Is anyone experiencing memory pressure issues? We started seeing out of memory exceptions after this update (after ~2 days of uptime). Could be my code.

@ReubenBond

This comment has been minimized.

Copy link
Contributor

ReubenBond commented Aug 2, 2018

Closing this because the issues appears to have been successfully resolved by the workaround in 2.0.4 😁

Thank you, everyone, so much for your patience and assistance in tracking this down and getting it resolved!

@lwansbrough please feel free to open an issue regarding memory pressure issues

@ReubenBond ReubenBond closed this Aug 2, 2018

@pfrendo

This comment has been minimized.

Copy link

pfrendo commented Aug 6, 2018

@ReubenBond I can confirm it's working well 5-6 days+ with the same deployment and it's still working well. No memory leaks and exceptions that @lwansbrough mentioned

@ilyalukyanov

This comment has been minimized.

Copy link

ilyalukyanov commented Sep 13, 2018

Working for us too! Thanks a lot for fixing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.