Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

triggerTimeouts performance issue #500

Closed
KrishnaPG opened this issue Sep 3, 2019 · 8 comments
Closed

triggerTimeouts performance issue #500

KrishnaPG opened this issue Sep 3, 2019 · 8 comments

Comments

@KrishnaPG
Copy link

@KrishnaPG KrishnaPG commented Sep 3, 2019

On idle page, the triggerTimeouts is causing CPU load, and more importantly heap load. Below is screenshot of constantly increasing memory on every tick. Over a period of minutes this is bringing down the system performance to be slow and memory low. Observed memory going upto 690MB+ before GC finally kicked in reduced it 39MB. Too costly for a simple heartbeat.

image

@yasserf

This comment has been minimized.

Copy link
Member

@yasserf yasserf commented Sep 3, 2019

Amazing catch thank you! Interesting, the timeout registry was created to avoid the overhead of using set timeout individually. Will look into it ASAP!

@yasserf

This comment has been minimized.

Copy link
Member

@yasserf yasserf commented Sep 4, 2019

Okay been running an open connection for ten minutes now and there hasn’t been any more than 152 bytes of memory used.

This is just running a vanilla server and client where client logs in.

Can you please update with what I’m missing

@yasserf

This comment has been minimized.

Copy link
Member

@yasserf yasserf commented Sep 4, 2019

Update: even running with a timer resolution of 0 (triggers in a loop) there seems to be zero memory growth and the CPU is idle other than the function being called every tick.

@KrishnaPG

This comment has been minimized.

Copy link
Author

@KrishnaPG KrishnaPG commented Sep 4, 2019

Thank you. Yes, if the GC gets triggered, then the memory will be reclaimed, so we will not see much difference. Today when I rerun the page, I see the memory staying constant. When verified in the call logs the GC was triggered multiple times. Seems like today GC was in good mood all over the world - but as you know, it is not guaranteed to be run predictably, and yesterday was one such day where the page endedup piling hundreds of MBs before GC kicked in.

image


And when you profile you can see that the timer function is consuming 90% of it.

image

@yasserf

This comment has been minimized.

Copy link
Member

@yasserf yasserf commented Sep 4, 2019

Today when I rerun the page, I see the memory staying constant

What page is that exactly? Are you just doing

const client = deepstream(url)
client.login()

I find it really hard to believe a timeout every second or more creates 600mb of garbage after just a couple of minutes. I have worked on applications with thousands of functions being run every second via immutable object creation with nowhere near that. Also having a leak from 160bytes to 600mb seems to be quite a drastic mood swing for a garbage collector.

I'm going to park this issue until I either run into it in myself or if you manage to get your GC to go crazy again a memory heap dump/save your performance profile and send it over would be brilliant

@KrishnaPG

This comment has been minimized.

Copy link
Author

@KrishnaPG KrishnaPG commented Sep 4, 2019

Browsers are not known for their memory conservation - especially Chrome. Develop an SPA that is designed to run for longer than a day and you would hit these kind of scenarios.

It is not about leaking a byte vs 100 bytes or MB. It is the fact that there is a problem in the code. Along with Deepstream client, we have other clients (e.g. MQTT) and none of them have this problem. A 90% profile time occupied by TimerRegistry vs mere 3% occupied by similar MQTT heartbeat.

Timers have well designed native implementations. It is not advisable to reinvent the wheel.

My suggestion is to fix the problem, since the cause is well known, rather than wait for memory dumps.

I am closing this. You are welcome to ignore it.

Sorry for saying this: But we are seeing a trend in deepstream off late (especially with 4.0 onwards) where too many localized design choices are being made without focus on overall big picture (e.g. reinventing the logging mechanism, modifying read-only config objects, and this customized timers now that add no benefit etc.). Each one of these is breaking multiple well-established software engineering principles and if the trend continues, we may have to soon justify if this is the right kind of software library/package that we want to build on top of. Thanks for your time, though.

@KrishnaPG KrishnaPG closed this Sep 4, 2019
@yasserf

This comment has been minimized.

Copy link
Member

@yasserf yasserf commented Sep 4, 2019

Sorry you feel that way. For reference to those who stumble on this issue.

  1. the timer registry can be swapped out with the native one if needed, V4 is a modular service. It was implemented because setting a timeout is actually extremely expensive when creating thousands of records in a loop.

  2. some extremely popular nodejs logger projects implement should log. And even if they didn’t, I’m okay with changing the norm to save seconds of useless string generation.

For example, deepstream logger just happened to directly map onto pino: https://deepstream.io/tutorials/custom-plugins/logger/

  1. while read only objects are correct it was never actually a feature we provided.

I’m sorry you don’t feel that the quality of this project was kept up, specially since all the design decisions made have all been via weeks of optimisations sponsored by actual users with real comparisons.

Best of luck with whatever you pick and your welcome!

@yasserf

This comment has been minimized.

Copy link
Member

@yasserf yasserf commented Sep 5, 2019

So I spent some time today morning and yesterday trying to:

a) simulate the memory leak
b) try to find the recreate the same results of using the interval system that led it to being created in the first place

Unfortunately I wasn't able to do either. It seems now we use true bulk subscriptions (one timeout for each bulk subscription) its hard to expose the benefits, and when running thousands of them in a loop via the registry directly they both fare equally terribly since the node event loop gets clogged.

I'm not sure where your last comment went (I got the notification via email) but you are right in the sense that I didn't try this out via mobile energy in mind. Generally I am convinced with my results this morning that you are correct, using the native approach by default makes more sense. I just released v4.1 which uses the native setTimeout API by default and can be switched between the two. For users I find who actually benefit from the interval approach I'll be switching it on (possibly nodeJS).

There are some more improvements that can still be done in terms of removing context binding in the native implementation (which is what is responsible for most of the GC) but I'll leave that for another time.

One of the other usages of the central interval registry (which wasn't yet implemented) was to be time slippage aware, which means if a timer hasn't been triggered for a while the client would be smart enough to know its because of a heavy event loop rather than spewing ack errors all over the place, which is the case most people run into when they request a huge load of data upfront. But again, new V4 features have rendered these V3 mitigations more obsolete, and arguably they should be more careful in how much load they put up during load time.

Edit:

I'll make sure to upload associated performance logs here when I run into the issue again. Or possibly on a blog somwhere and just link it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.