New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vmcluster: rerouting enhancement #4922
Comments
@hagen1778 , thanks for very detailed analysis! The following additional interesting details can be extracted from this analysis:
Where:
For example, if the cluster contains 10 vmstorage nodes with 4CPU cores each and it handles 100M active time series, then the duration of data ingestion slowdown when one
Calculations above assume that every The duration of data ingestion slowdown can be reduced if the restarted |
@hagen1778 , thanks for very detailed analysis!
During our usage, we have encountered similar challenging issues as described above. Furthermore, we have noticed that the timeout duration for When a VictoriaMetrics/app/vminsert/netstorage/netstorage.go Lines 326 to 356 in f78d8b9
error message:
|
@aierui , this issue has been addressed in the pull request #4423 , which will be included in the next release. This pull request reduces the network timeout for unavailable In the mean time you can build |
Thank you very much for your reply! @valyala It looks great👍. Upon my initial review of the code changes in the pull request #4423 , I have a question: Why did we not utilize the |
@aierui as you already noted |
Thank you very much @wjordan for providing a detailed explanation. |
One approach to minimize the impact of slowdown / increased resource-usage caused by rerouting metrics to alternate storage nodes would be to buffer pending data for unhealthy storage nodes instead of immediately rerouting. A file buffer could be used (as in vmagent) to keep worse-case memory usage low once pending data exceeds the 30MiB max vminsert packet size. Some reasons against this solution were raised in #791 (comment), here are some ideas on how to address them:
|
The problem with the buffer that it could play very well for low-loaded installation. But everything will work well for such installation, as they not experience that much pressure. They might not even notice rerouting storms as our playground doesnt.
That's one of the options, yes. But it results in data loss. Despite the current issues with rerouting, even if cluster is unstable, the ingested data will be buffered on client (vmagent i.e) but still will be delivered eventually.
Afaik, the impact of this is that recording or alerting rules may misbehave for some period of time. But in other terms, data won't be lost. I agree that rules misbehaving is severe, but dropping data on the floor has the same or even bigger severity level. |
One of the simplest ideas to gradually reroute data is to make vmstorage to close accepted connections one by one on a specified time interval. Let's say, we have 6 vminserts sending data to 10 storage nodes. If we want to reboot 1 vmstorage gracefully, we send it SIGTERM as usually, but instead of closing all the connections at one vmstorage could spread this action over configured gracefulShutdownInterval. This would make vminserts to start rerouting one-by-one, not all together. This feature should be easy to implement and test, since it doesn't require changes to vminsert or intercommunication protocol. I hope @zekker6 will try to test it and report his findings. The downside of this feature is that it won't work for low number of vminserts. Since vminsert establishes only 1 connection to vmstorage then it means that the minimum amount of rerouted data in one step is equal to 1/len(vminserts). |
Tested an option suggested by @hagen1778:
Test setup:
Here are results for v1.94.0:
Version with gradual drop of vminsert connections:
The only downside I could see is that rolling restart now will either be harder or will take more time to perform full restart of 5 nodes. |
@zekker6 shared with me another round of tests:
@zekker6 also changed the way how we test the changes: now we run two clusters (patched and unpatched) concurrently and can see how both are effected with the vmstorage restart. On the screenshot we see two vmstorage jobs: According to the screenshot above, the load was indeed smoother. CPU usage and ingestion speed were less anomalous for the patched version. But what is also very interesting is vmiserts behavior: It is clear, that vminserts writing to the patched storage nodes experienced lower connection saturation and memory usage during vmstorage restart. Lower saturation also means smaller queue build-up, which results into higher data freshness for read queries. I personally think we should proceed with adding this feature to the upstream. |
…torage Implements graceful shutdown approach suggested here - #4922 (comment) Test results for this can be found here - #4922 (comment) Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>
* app/vmstorage: close vminsert connections gradually before stopping storage Implements graceful shutdown approach suggested here - #4922 (comment) Test results for this can be found here - #4922 (comment) Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * app/vmstorage: update graceful shutdown logic - close connections from vminsert in determenistic order - update flag description - lower default timeout to 25 seconds. 25 seconds value was chosen because the lowest default value used in default configuration deployments is 30s(default value in Kubernetes and ansible-playbooks). Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/cluster: add information about re-routing enhancement during restart Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/changelog: add entry for new command-line flag Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * {app/vmstorage,lib/ingestserver}: address review feedback Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/cluster: add note to update workload scheduler timeout Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * wip --------- Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
* app/vmstorage: close vminsert connections gradually before stopping storage Implements graceful shutdown approach suggested here - #4922 (comment) Test results for this can be found here - #4922 (comment) Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * app/vmstorage: update graceful shutdown logic - close connections from vminsert in determenistic order - update flag description - lower default timeout to 25 seconds. 25 seconds value was chosen because the lowest default value used in default configuration deployments is 30s(default value in Kubernetes and ansible-playbooks). Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/cluster: add information about re-routing enhancement during restart Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/changelog: add entry for new command-line flag Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * {app/vmstorage,lib/ingestserver}: address review feedback Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/cluster: add note to update workload scheduler timeout Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * wip --------- Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
The commit f783476 enables gradual closing of This commit will be included in the next release of VictoriaMetrics. In the mean time it is possible to build |
Previously there was off-by-one error, which resulted in logging len(conns-1) connections instead of len(conns) Updates #4922
Previously there was off-by-one error, which resulted in logging len(conns-1) connections instead of len(conns) Updates #4922
* app/vmstorage: close vminsert connections gradually before stopping storage Implements graceful shutdown approach suggested here - VictoriaMetrics#4922 (comment) Test results for this can be found here - VictoriaMetrics#4922 (comment) Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * app/vmstorage: update graceful shutdown logic - close connections from vminsert in determenistic order - update flag description - lower default timeout to 25 seconds. 25 seconds value was chosen because the lowest default value used in default configuration deployments is 30s(default value in Kubernetes and ansible-playbooks). Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/cluster: add information about re-routing enhancement during restart Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/changelog: add entry for new command-line flag Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * {app/vmstorage,lib/ingestserver}: address review feedback Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/cluster: add note to update workload scheduler timeout Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * wip --------- Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>
Previously there was off-by-one error, which resulted in logging len(conns-1) connections instead of len(conns) Updates VictoriaMetrics#4922
Closing this feature request as done. |
Is your question request related to a specific component?
VictoriaMetrics cluster is a distributed system. Its "heart" is its storage. In cluster storage is represented by
vmstorage
component - the only stateful component according to https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#architecture-overview.It is an expected situation for distributed system components to be temporarily offline: due to network issues, maintenance, hardware failure, etc. But still, it is expected from the distributed system to remain available during such events.
VictoriaMetrics approach for remaining available when vmstorage component goes offline is rerouting. Rerouting means that if vminsert is unable to push data to the vmstorage, it should redirect/reroute the payload to another alive vmstorage node. In practice, this means that if cluster has 5 vmstorage nodes and one of them "dies" - vminserts will spread the traffic across the remaining 4 storage nodes. And if with 5 vmstorage nodes alive each of them were processing 20% of the traffic, when one storage dies the rest of vmstorage nodes should be able to process a 5% increase in traffic each.
The approach for rerouting is super simple. It doesn't require coordination, extra components or manual actions from user. Everything happens automatically. However, it also means that load increase on remaining nodes may result in cluster instability. This ticket is supposed to aggregate thoughts on the problem of VM cluster instability during rerouting events.
Describe the question in detail
What do you mean by instability?
In this specific issue, I am mostly concerned about data ingestion slow-down. We've received reports that high loaded VM installations experience data ingestion slow-down during vmstorage restarts.
Why does the cluster become unstable during rerouting?
In short, because of sharding.
Each vmstorage in the cluster setup is represented as a shard. The shard holds only a fraction of the data. During ingestion, vminsert component consistently shards data across available vmstorage nodes, making sure that each unique time series ends up on the same shard/vmstorage. For example, series
foo
andbar
ingested in a 2-shard cluster will end up on separate vmstorages, no matter how many vminserts were doing the ingestion. Such an approach provides the benefit of better data locality, improves search speed, compression ratio, memory usage for caches, page cache, etc.However, this also means that during rerouting, vmstorages start to receive new to them metrics. These rerouted metrics aren't present in vmstorage caches or indexes, or page cache. Registering new metrics or missing the cache is an orders of magnitude slower than accepting already-seen metrics.
What can be done?
We assume two pain points are leading to ingestion slow-down:
The moment when re-routing starts
When one vmstorage goes off, its traffic gets rerouted to the rest of vmstorage nodes. As described above, this hits the problem that other vmstorage nodes had no knowledge of the rerouted metrics and start to lag. But we should remember that load here is spread across all vmstorage nodes. So with enough resources, the "hiccup" on ingestion should be unnoticeable.
The moment when restarted vmstorage recovers
Restarted vmstorage can get back online after a few minutes delay, usually. In high-churn environments or environments where vmstorage node could have been re-scheduled to another instance - it may be missing some caches on startup. So when it starts, it goes through the expensive way of registering not-seen metrics. And in contrast to p1, ingestion speed of the cluster is limited by the capabilities of this specific vmstorage node.
Experiment
To verify which pain point is the most problematic, I did an experiment:
The moment when rerouting started - is when I stopped the vmstorage.
The moment when rerouting ended - is when I started the vmstorage.
Charts below show mem and CPU usage, as well as SlowInserts (cache misses) and PendingDatapoints panels
From this data it looks like p1 is more harmful:
The data stream into the cluster is generated by vmagent. From the vmagent perspective on the same time interval the situation with remote-write was looking like the following:
We see that saturation was increased significantly on the vmstorage shutdown, and wasn't affected on its startup.
Experiment summary
My conclusion is that p1 is more harmful situation than p2 and should be optimized in the first place.
However, I can assume that the experiment was running in environment with low churn rate. It could be that in environments with much higher churn rate starting the vmstorage after a few minutes of downtime could result in a lot of cache misses and slow down the ingestion.
Troubleshooting docs
The text was updated successfully, but these errors were encountered: