-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deleting old completed invocations #74
Comments
Related to #4 |
Partially fixed in #103 - the workflow engine will no longer keep subscribed to channels of completed invocations. TODO
|
If it's not affecting users and not breaking workflows, let's move this to a later milestone. |
One of the problems we are hitting frequently is this and we resolved by workarounds that doesn't feel nice. Here is our feedback: After nats is up and fission-workflows is under constant usage, nats starting to use more and more RAM (8-16 GB). To solve this problem, we first go for file store type of nats but then we realized that it is still keeping everything in memory. So then we just decided to use memory store for some time because file store is impossible to clean up in the case of there is an error and nats restarted. Our workaround for keeping memory usage under some boundary is using memory limits of k8s which restarts the NATS. However, this way we hit memory limit of 16GB too fast sometimes. Then we realized default values in the nats config is not that good. We first make all limits unlimited because we don't want to hit them (like explained in #4):
Then we also enabled deletions of inactive channels, sometimes it worked and clear the memory and sometimes not (after waiting more than 10 minutes, completed invocation channels are not cleared):
So, there maybe some subscriber somewhere sending nats a heartbeat or something that still keeps completed invocation channel active and don't let the channel to be deleted. Then we also enforce deletition of messages older than 10 mins too because in our use case invocation should fail or success in 10 minutes anyway.
I wish setting up max inactivity is enough to clear up old invocations but it is not somehow. Maybe this is a bug @erwinvaneyk but I am not sure. At the moment, we have 2 nodepools. One is having big memory nodes and running nats, fission core and everything else outside of Also, as it mentioned in #4, we need more production ready nats. We probably need clustering (I am not sure if this is just replication of messages or distribution of them), partitioning (for distributing ram usage), persistency (for recovery from failure) features of nats to make sure all these memory and failure problems solved. In that case, it means maintaining NATS instances and make sure they are running on bigger nodes with enough resources. I am sure that not everyone want to maintain a running message queue at scale, maybe adding cloud message queue (google pub/sub) support is less trouble free overall. Thanks for reading our long feedback 👍 |
Hi @thenamly - this is a great analysis. I am writing this while reading your analysis.
This sounds like a bug in NATS streaming, you might want to file it there. As a mentioned above, you might want to try using the
It is an interesting issue, which you can attack in a couple of ways.
What are your thoughts on these options? |
What I understand from their docs are: "memory/file/sql are simple interfaces. If you don't like, you can add yours"
Performance tweaking in our case is mostly about garbage collection and distributing ram usage. I think they did fair job by providing all options. What we can do more is using partitioning etc. options of nats to go further but I see diminishing returns problem when I invested more time to nats. Also, probably maintaining nats instances and running them will cost something in total.
I think this is complex solution for simple problem that should be solved by more specialized services (message queues) outside of workflow engine. It may be redundant to add this, maintain this over time with all the bugs it brings. Only benefit is probably decreased latency but in our use case latency that nats brings to table is nothing at all. We can use the same effort for "doing one thing well" instead of trying to implement communication/message storage layer.
Google pub/sub and other cloud message queue support is probably option for getting more benefit with least effort of 3 option. I am not sure about their channel/subscription creation latency too but message delivery should be fast enough. What we can do before starting an implementation is benchmarking specific operations (channel/subscription creation, message delivery from publisher to subscriber...) and compare it with nats for example. Overall, I don't expect it to be unacceptably bad and getting rid of headache of maintaining it, running on big nodes for scaling to higher number of invocations etc. We can invest the same time to improve other bits of fission workflow. We are probably ok with short term workarounds until it's solved. However, I know there are also people out there running fission on bare-metal. So we still need to consider improving default nats deployment for them. |
I checked some benchmarks of google pub/sub from spotify and others. Maybe it's not that good :( |
Mmmh, I did not know that it was a per-channel limit. Weird that they don't provide a max_bytes for the overal limit. It seems to me like a common configuration you would want to set for any NATS cluster, or do they have other features to protect against OOMs?
Agreed!
There might be reports or research papers that have done this benchmarking before.
Interesting! Then I think we should start thinking about option 3, to create a hybrid solution that does not depend solely on the message queue. There are a lot of options in that space. |
We need a way to either manually or automatically clean up old workflow invocation data.
The text was updated successfully, but these errors were encountered: