vmcluster: rerouting enhancement #4922

hagen1778 · 2023-08-30T11:44:09Z

Is your question request related to a specific component?

VictoriaMetrics cluster is a distributed system. Its "heart" is its storage. In cluster storage is represented by vmstorage component - the only stateful component according to https://docs.victoriametrics.com/Cluster-VictoriaMetrics.html#architecture-overview.

It is an expected situation for distributed system components to be temporarily offline: due to network issues, maintenance, hardware failure, etc. But still, it is expected from the distributed system to remain available during such events.

VictoriaMetrics approach for remaining available when vmstorage component goes offline is rerouting. Rerouting means that if vminsert is unable to push data to the vmstorage, it should redirect/reroute the payload to another alive vmstorage node. In practice, this means that if cluster has 5 vmstorage nodes and one of them "dies" - vminserts will spread the traffic across the remaining 4 storage nodes. And if with 5 vmstorage nodes alive each of them were processing 20% of the traffic, when one storage dies the rest of vmstorage nodes should be able to process a 5% increase in traffic each.

The approach for rerouting is super simple. It doesn't require coordination, extra components or manual actions from user. Everything happens automatically. However, it also means that load increase on remaining nodes may result in cluster instability. This ticket is supposed to aggregate thoughts on the problem of VM cluster instability during rerouting events.

Describe the question in detail

What do you mean by instability?

In this specific issue, I am mostly concerned about data ingestion slow-down. We've received reports that high loaded VM installations experience data ingestion slow-down during vmstorage restarts.

Why does the cluster become unstable during rerouting?

In short, because of sharding.

Each vmstorage in the cluster setup is represented as a shard. The shard holds only a fraction of the data. During ingestion, vminsert component consistently shards data across available vmstorage nodes, making sure that each unique time series ends up on the same shard/vmstorage. For example, series foo and bar ingested in a 2-shard cluster will end up on separate vmstorages, no matter how many vminserts were doing the ingestion. Such an approach provides the benefit of better data locality, improves search speed, compression ratio, memory usage for caches, page cache, etc.

However, this also means that during rerouting, vmstorages start to receive new to them metrics. These rerouted metrics aren't present in vmstorage caches or indexes, or page cache. Registering new metrics or missing the cache is an orders of magnitude slower than accepting already-seen metrics.

What can be done?

We assume two pain points are leading to ingestion slow-down:

The moment when re-routing starts
The moment when restarted vmstorage recovers

The moment when re-routing starts

When one vmstorage goes off, its traffic gets rerouted to the rest of vmstorage nodes. As described above, this hits the problem that other vmstorage nodes had no knowledge of the rerouted metrics and start to lag. But we should remember that load here is spread across all vmstorage nodes. So with enough resources, the "hiccup" on ingestion should be unnoticeable.

The moment when restarted vmstorage recovers

Restarted vmstorage can get back online after a few minutes delay, usually. In high-churn environments or environments where vmstorage node could have been re-scheduled to another instance - it may be missing some caches on startup. So when it starts, it goes through the expensive way of registering not-seen metrics. And in contrast to p1, ingestion speed of the cluster is limited by the capabilities of this specific vmstorage node.

Experiment

To verify which pain point is the most problematic, I did an experiment:

I have a testing VM cluster running with the following config: 2 vmselects (2CPU, 1Gi), 3 vminserts (1CPU, 1Gi), 5 vmstorages (1CPU, 4Gi)
The cluster receives a data stream of 316K samples/s, Active Time series 4.4Mil, Churn rate 10Mil/day.
I shut down one vmstorage and observe recovery speed for ingestion. This verifies p1.
I return back the stopped vmstorage and observe recovery speed for ingestion. This verifies p2.

The moment when rerouting started - is when I stopped the vmstorage.
The moment when rerouting ended - is when I started the vmstorage.

Charts below show mem and CPU usage, as well as SlowInserts (cache misses) and PendingDatapoints panels

From this data it looks like p1 is more harmful:

The ingestion was slowed for a bigger period of time
The CPU and mem usage increase was bigger
The amount of pending datapoints (a signed of a bottleneck) was higher

The data stream into the cluster is generated by vmagent. From the vmagent perspective on the same time interval the situation with remote-write was looking like the following:

We see that saturation was increased significantly on the vmstorage shutdown, and wasn't affected on its startup.

Experiment summary

My conclusion is that p1 is more harmful situation than p2 and should be optimized in the first place.
However, I can assume that the experiment was running in environment with low churn rate. It could be that in environments with much higher churn rate starting the vmstorage after a few minutes of downtime could result in a lot of cache misses and slow down the ingestion.

Troubleshooting docs

General - https://docs.victoriametrics.com/Troubleshooting.html
vmagent - https://docs.victoriametrics.com/vmagent.html#troubleshooting
vmalert - https://docs.victoriametrics.com/vmalert.html#troubleshooting

The text was updated successfully, but these errors were encountered:

valyala · 2023-08-30T22:05:53Z

@hagen1778 , thanks for very detailed analysis!

The following additional interesting details can be extracted from this analysis:

Every vmstorage node contains around 4.4M / 5 = 0.9M active time series in steady state when all 5 vmstorage nodes are available in the cluster.
The 0.9M time series are registered on the remaining 4 vmstorage nodes in ~1 minute according to the data ingestion graph. This means that every vmstorage node can register new time series at the rate of 0.9M / 4 = 220K series per minute per CPU core on this workload, since the re-routed series are evenly spread among the remaining vmstorage nodes. This allows estimating the duration needed for registering new time series in the remaining vmstorage nodes when one vmstorage node goes offline:

durationMinutes = (activeSeries / N) / (N - 1) / cpuCores / 220K

Where:

N is the number of storage nodes in the cluster
cpuCores is the number of CPU cores per each vmstorage node

For example, if the cluster contains 10 vmstorage nodes with 4CPU cores each and it handles 100M active time series, then the duration of data ingestion slowdown when one vmstorage node goes offline would be (100M / 10) / 9 / 4 / 220K = 1.3 minutes.
The durationMinutes formula suggests approaches on how to reduce the duration of data ingestion slowdown when vmstorage node goes offline:

By increasing the number of vmstorage nodes in the cluster. For example, increasing the number of vmstorage nodes from 10 to 20 should reduce the duration of data ingestion slowdown by 4x times to 100M / 20 / 19 / 4 / 220K = 0.3 minutes.
By increasing the number of CPU cores per each vmstorage node. For example, increasing the number of CPU cores per each vmstorage nodes from 4 to 8 should reduce the duration of data ingestion slowdown by 2 times to 100M / 10 / 9 / 8 / 220K = 0.6 minutes.

Calculations above assume that every vmstorage node has enough free CPU and RAM for handling a temporary increase in the number of active time series during rolling restart. If resources aren't enough, then the cluster may experience data ingestion slowdown, instability and oom crashes for indefinite duration.

The duration of data ingestion slowdown can be reduced if the restarted vmstorage node returns back to the cluster in less than the durationMinutes. In this case vminsert will re-route time series back to the returned vmstorage node, without registering of the remaining time series on other vmstorage nodes.

aierui · 2023-09-06T16:54:53Z

@hagen1778 , thanks for very detailed analysis!

VictoriaMetrics is an outstanding time-series product. It has effectively addressed the storage requirements for a significant volume of monitoring data in our operational environment.

During our usage, we have encountered similar challenging issues as described above. Furthermore, we have noticed that the timeout duration for vminsert to send data to vmstorage is determined based on the packet size. The shortest timeout is set to 60 seconds, and the longest is approximately 120 seconds.

When a vmstorage instance experiences a failure, it currently takes at least 60 seconds for vminsert to detect the faulty instance by itself. In network terms, 60 seconds is a remarkably long duration. Is there a parameter option available here to support different values?

VictoriaMetrics/app/vminsert/netstorage/netstorage.go

Lines 326 to 356 in f78d8b9

    
           timeoutSeconds := len(buf) / 3e5 
        
           if timeoutSeconds < 60 { 
        
           	timeoutSeconds = 60 
        
           } 
        
           timeout := time.Duration(timeoutSeconds) * time.Second 
        
           deadline := time.Now().Add(timeout) 
        
           if err := bc.SetWriteDeadline(deadline); err != nil { 
        
           	return fmt.Errorf("cannot set write deadline to %s: %w", deadline, err) 
        
           } 
        
           // sizeBuf guarantees that the rows batch will be either fully 
        
           // read or fully discarded on the vmstorage side. 
        
           // sizeBuf is used for read optimization in vmstorage. 
        
           sizeBuf := sizeBufPool.Get() 
        
           defer sizeBufPool.Put(sizeBuf) 
        
           sizeBuf.B = encoding.MarshalUint64(sizeBuf.B[:0], uint64(len(buf))) 
        
           if _, err := bc.Write(sizeBuf.B); err != nil { 
        
           	return fmt.Errorf("cannot write data size %d: %w", len(buf), err) 
        
           } 
        
           if _, err := bc.Write(buf); err != nil { 
        
           	return fmt.Errorf("cannot write data with size %d: %w", len(buf), err) 
        
           } 
        
           if err := bc.Flush(); err != nil { 
        
           	return fmt.Errorf("cannot flush data with size %d: %w", len(buf), err) 
        
           } 
        
           // Wait for `ack` from vmstorage. 
        
           // This guarantees that the message has been fully received by vmstorage. 
        
           deadline = time.Now().Add(timeout) 
        
           if err := bc.SetReadDeadline(deadline); err != nil { 
        
           	return fmt.Errorf("cannot set read deadline for reading `ack` to vmstorage: %w", err) 
        
           }

error message:

2023-09-03T12:39:25.687+0800	warn	VictoriaMetrics/app/vminsert/netstorage/netstorage.go:303	cannot send 280520 bytes with 973 rows to -storageNode="vmstorage-1:8400": cannot read `ack` from vmstorage: cannot read data in 60.000 seconds: read tcp4 10.171.192.149:46136->10.170.108.235:8400: i/o timeout; closing the connection to storageNode and re-routing this data to healthy storage nodes
2023-09-03T12:39:30.689+0800	warn	VictoriaMetrics/app/vminsert/netstorage/netstorage.go:261	cannot dial storageNode "vmstorage-1:8400": dial tcp4 10.170.108.235:8400: i/o timeout

valyala · 2023-09-06T17:08:03Z

@aierui , this issue has been addressed in the pull request #4423 , which will be included in the next release. This pull request reduces the network timeout for unavailable vmstorage instance from 60 seconds to 3 seconds by default. Additionally, this timeout can be configured when needed via -vmstorageUserTimeout command-line flag.

In the mean time you can build vminsert and vmselect from the latest commit in the cluster branch (this is e0923f9 right now) according to these docs and verify whether it reduces the timeout needed to start re-routing from the disappeared vmstorage node from 60 seconds to 3 seconds.

aierui · 2023-09-06T17:35:59Z

Thank you very much for your reply! @valyala

It looks great👍.

Upon my initial review of the code changes in the pull request #4423 , I have a question: Why did we not utilize the SetReadDeadline() or SetWriteDeadline() methods provided by the standard library's net package, and instead opted for the TCP_USER_TIMEOUT socket option ?

wjordan · 2023-09-07T20:41:47Z

Why did we not utilize the SetReadDeadline() or SetWriteDeadline() methods provided by the standard library's net package, and instead opted for the TCP_USER_TIMEOUT socket option ?

@aierui as you already noted SetReadReadline() and SetWriteDeadline() are already being used, only with a minimum of 60 seconds. These Deadlines set a timeout for the entire Read()/Write() call, which can be quite large (maxInsertRequestSize defaults to 32MB), and so even if this were configurable it would still need to be extremely conservative. TCP_USER_TIMEOUT sets a timeout for each low-level TCP packet transmission, so it can be set much lower based on the network round-trip time. (The 3-second default allows for ~3 RTO retransmissions with Linux TCP_RTO_MIN of ~200 ms)

aierui · 2023-09-08T05:06:47Z

Thank you very much @wjordan for providing a detailed explanation.

wjordan · 2023-10-13T19:23:27Z

One approach to minimize the impact of slowdown / increased resource-usage caused by rerouting metrics to alternate storage nodes would be to buffer pending data for unhealthy storage nodes instead of immediately rerouting. A file buffer could be used (as in vmagent) to keep worse-case memory usage low once pending data exceeds the 30MiB max vminsert packet size.

Some reasons against this solution were raised in #791 (comment), here are some ideas on how to address them:

The buffer may grow to unlimited sizes if the corresponding vmstorage node remains unavailable for long periods of time.

A buffer could be used to delay rerouting by an adjustable timeout, instead of allowing indefinitely-long unavailability. ElasticSearch has delayed shard allocation (the index.unassigned.node_left.delayed_timeout dynamic setting, which defaults to 1m), which works quite well to minimize the impact of brief/intermittent network issues as well as planned rolling-restart cluster upgrades.

vmagent solves this issue by dropping the oldest data if the buffer grows beyond -remoteWrite.maxDiskUsagePerURL command-line value. Such approach cannot be used by vminsert, since this will mean incoming data loss when it could be re-routed to the remaining vmstorage nodes.

Instead of dropping data when the buffer is full, you could reroute the data at that point.
When replicationFactor > 1 and if dropSamplesOnOverload could be made to work with replicationFactor (Take in account replicationFactor with dropSamplesOnOverload=true #4798), it could be possible to drop redundant data for up to replicationFactor-1 unhealthy storage nodes while ensuring no data loss. In this case it could make sense to introduce a configuration to prefer dropping samples rather instead of rerouting them.

The data buffered at vminsert is invisible in queries. This means that vmselect may return incomplete responses when a vmstorage node is temporarily unavailable.

When replicationFactor > 1 and skipSlowReplicas = false (default), up to replicationFactor-1 unhealthy storage nodes could be buffered without any impact to query availability.
There is some tradeoff between query availability and storage-node instability caused by rerouting, but I would argue the cluster instability is worse or at least there should be a configurable choice to prefer one over the other. The impact of registering new time series on rerouted storage nodes often leads to massive resource-usage spikes, cascading storage-node failure and even potential data loss. This impact can be much more severe and longer-lasting than the impact of a brief period of incomplete responses due to buffered data while a storage node is temporarily unhealthy.

hagen1778 · 2023-10-18T10:20:00Z

A buffer could be used to delay rerouting by an adjustable timeout, instead of allowing indefinitely-long unavailability.

The problem with the buffer that it could play very well for low-loaded installation. But everything will work well for such installation, as they not experience that much pressure. They might not even notice rerouting storms as our playground doesnt.
For big installations, the in-memory buffer can be filled in seconds. This would increase complexity, memory usage of vminsert and give no benefits.

When replicationFactor > 1 and if dropSamplesOnOverload could be made to work with replicationFactor

That's one of the options, yes. But it results in data loss. Despite the current issues with rerouting, even if cluster is unstable, the ingested data will be buffered on client (vmagent i.e) but still will be delivered eventually.

This impact can be much more severe and longer-lasting than the impact of a brief period of incomplete responses due to buffered data while a storage node is temporarily unhealthy.

Afaik, the impact of this is that recording or alerting rules may misbehave for some period of time. But in other terms, data won't be lost. I agree that rules misbehaving is severe, but dropping data on the floor has the same or even bigger severity level.

hagen1778 · 2023-10-18T10:24:46Z

One of the simplest ideas to gradually reroute data is to make vmstorage to close accepted connections one by one on a specified time interval. Let's say, we have 6 vminserts sending data to 10 storage nodes. If we want to reboot 1 vmstorage gracefully, we send it SIGTERM as usually, but instead of closing all the connections at one vmstorage could spread this action over configured gracefulShutdownInterval. This would make vminserts to start rerouting one-by-one, not all together.

This feature should be easy to implement and test, since it doesn't require changes to vminsert or intercommunication protocol. I hope @zekker6 will try to test it and report his findings.

The downside of this feature is that it won't work for low number of vminserts. Since vminsert establishes only 1 connection to vmstorage then it means that the minimum amount of rerouted data in one step is equal to 1/len(vminserts).

zekker6 · 2023-10-23T15:24:45Z

Tested an option suggested by @hagen1778:

One of the simplest ideas to gradually reroute data is to make vmstorage to close accepted connections one by one on a specified time interval. Let's say, we have 6 vminserts sending data to 10 storage nodes. If we want to reboot 1 vmstorage gracefully, we send it SIGTERM as usually, but instead of closing all the connections at one vmstorage could spread this action over configured gracefulShutdownInterval. This would make vminserts to start rerouting one-by-one, not all together.

Test setup:

5x vmstorage - 1 CPU / 4 GB ram
3x vminsert - 0.6 CPU / 0.5 GB
ingestion rate - 452K

Here are results for v1.94.0:

Ingestion rate restored to the prior rate in 1m45s
CPU spiked up to 100%
vmstorage connection saturation

Version with gradual drop of vminsert connections:

Ingestion rate restored in ~1m, overall spike seems smoother
CPU spike is also lower
As well as vmstorage connection saturation

The only downside I could see is that rolling restart now will either be harder or will take more time to perform full restart of 5 nodes.

hagen1778 · 2023-11-02T12:30:59Z

@zekker6 shared with me another round of tests:

the number of vminserts was increased to 8
the graceful shutdown period increased to 2min

@zekker6 also changed the way how we test the changes: now we run two clusters (patched and unpatched) concurrently and can see how both are effected with the vmstorage restart.

On the screenshot we see two vmstorage jobs: patched and base. Patched is the one implementing idea from this comment. Base is original version.
It is clear, that load on the patched version is more gradual. In fact, with 8 vminserts each storage node (5 vmstorages in total) has 8 connections established. With ingestion rate of 280k each vmstorage receives 280k/5 = 56k samples/s via 8 connections. Hence, each connection is supposed to transfer 56/8 = 7k samples/s. Closing all connections at once (a regular shutdown of the vmstorage node) means vminserts need to re-route 56k samples/s immediately, increasing load on each vmstorage by 12.5% at once.
Closing 8 connections gradually on 2m interval means we aim to re-route 7k samples each 120s/8=15s. This should smooth resource usage by vmstorages and vminserts.

According to the screenshot above, the load was indeed smoother. CPU usage and ingestion speed were less anomalous for the patched version. But what is also very interesting is vmiserts behavior:

It is clear, that vminserts writing to the patched storage nodes experienced lower connection saturation and memory usage during vmstorage restart. Lower saturation also means smaller queue build-up, which results into higher data freshness for read queries.

I personally think we should proceed with adding this feature to the upstream.

…torage Implements graceful shutdown approach suggested here - #4922 (comment) Test results for this can be found here - #4922 (comment) Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com>

* app/vmstorage: close vminsert connections gradually before stopping storage Implements graceful shutdown approach suggested here - #4922 (comment) Test results for this can be found here - #4922 (comment) Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * app/vmstorage: update graceful shutdown logic - close connections from vminsert in determenistic order - update flag description - lower default timeout to 25 seconds. 25 seconds value was chosen because the lowest default value used in default configuration deployments is 30s(default value in Kubernetes and ansible-playbooks). Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/cluster: add information about re-routing enhancement during restart Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/changelog: add entry for new command-line flag Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * {app/vmstorage,lib/ingestserver}: address review feedback Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/cluster: add note to update workload scheduler timeout Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * wip --------- Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>

valyala · 2023-11-14T00:15:46Z

The commit f783476 enables gradual closing of vminsert connections during vmstorage graceful shutdown. By default vminsert connections are closed during 25 seconds. This duration can be tuned with the -storage.vminsertConnsShutdownDuration command-line flag at vmstorage. See these docs for more details.

This commit will be included in the next release of VictoriaMetrics. In the mean time it is possible to build vmstorage from this commit according to these docs and verifying whether the patched vmstorage helps reducing data ingestion slowdown during rolling restarts.

Updates #4922

Previously there was off-by-one error, which resulted in logging len(conns-1) connections instead of len(conns) Updates #4922

* app/vmstorage: close vminsert connections gradually before stopping storage Implements graceful shutdown approach suggested here - VictoriaMetrics#4922 (comment) Test results for this can be found here - VictoriaMetrics#4922 (comment) Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * app/vmstorage: update graceful shutdown logic - close connections from vminsert in determenistic order - update flag description - lower default timeout to 25 seconds. 25 seconds value was chosen because the lowest default value used in default configuration deployments is 30s(default value in Kubernetes and ansible-playbooks). Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/cluster: add information about re-routing enhancement during restart Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/changelog: add entry for new command-line flag Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * {app/vmstorage,lib/ingestserver}: address review feedback Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * docs/cluster: add note to update workload scheduler timeout Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> * wip --------- Signed-off-by: Zakhar Bessarab <z.bessarab@victoriametrics.com> Co-authored-by: Aliaksandr Valialkin <valyala@victoriametrics.com>

Updates VictoriaMetrics#4922

Previously there was off-by-one error, which resulted in logging len(conns-1) connections instead of len(conns) Updates VictoriaMetrics#4922

valyala · 2023-11-15T22:59:21Z

vmstorage improves re-routing handling for incoming data during graceful shutdown according to these docs starting from v1.95.0.

Closing this feature request as done.

hagen1778 added the question The question issue label Aug 30, 2023

PerGon mentioned this issue Sep 7, 2023

[vmselect] Query fails when 1 vmstorage cannot delivery results - query should succeed as partial? #4944

Closed

f41gh7 mentioned this issue Sep 26, 2023

Statefulset OnDelete Limitation and Enhancement VictoriaMetrics/operator#744

Open

hagen1778 mentioned this issue Sep 26, 2023

vmstorage failing to answer vmselect/vminsert requests for an extended period after restart (LTS, v1.87.7-cluster) #5043

Open

zekker6 mentioned this issue Nov 6, 2023

vmcluster: re-routing enhancement #5293

Merged

valyala added enhancement New feature or request and removed question The question issue labels Nov 14, 2023

valyala added a commit that referenced this issue Nov 14, 2023

lib/ingestserver: typo fix after f783476

3a48766

Updates #4922

valyala added a commit that referenced this issue Nov 14, 2023

lib/ingestserver: typo fix after f783476

f9bd265

Updates #4922

valyala added a commit that referenced this issue Nov 14, 2023

lib/ingestserver: properly log the number of closed connections

e9639a4

Previously there was off-by-one error, which resulted in logging len(conns-1) connections instead of len(conns) Updates #4922

valyala added a commit that referenced this issue Nov 14, 2023

lib/ingestserver: properly log the number of closed connections

3076c1f

Previously there was off-by-one error, which resulted in logging len(conns-1) connections instead of len(conns) Updates #4922

AndrewChubatiuk pushed a commit to AndrewChubatiuk/VictoriaMetrics that referenced this issue Nov 15, 2023

lib/ingestserver: typo fix after f783476

5e6eccd

Updates VictoriaMetrics#4922

valyala closed this as completed Nov 15, 2023

hagen1778 mentioned this issue Dec 7, 2023

cluster: use persistent unique vmstorage IDs for consistent sharding across vmstorage nodes #5438

Open

f41gh7 mentioned this issue Feb 21, 2024

Cluster performance drop due to writes throttling on a single node. #5836

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vmcluster: rerouting enhancement #4922

vmcluster: rerouting enhancement #4922

hagen1778 commented Aug 30, 2023

valyala commented Aug 30, 2023

aierui commented Sep 6, 2023 •

edited

valyala commented Sep 6, 2023

aierui commented Sep 6, 2023 •

edited

wjordan commented Sep 7, 2023

aierui commented Sep 8, 2023

wjordan commented Oct 13, 2023

hagen1778 commented Oct 18, 2023

hagen1778 commented Oct 18, 2023 •

edited

zekker6 commented Oct 23, 2023

hagen1778 commented Nov 2, 2023 •

edited

valyala commented Nov 14, 2023

valyala commented Nov 15, 2023

vmcluster: rerouting enhancement #4922

vmcluster: rerouting enhancement #4922

Comments

hagen1778 commented Aug 30, 2023

Is your question request related to a specific component?

Describe the question in detail

What do you mean by instability?

Why does the cluster become unstable during rerouting?

What can be done?

The moment when re-routing starts

The moment when restarted vmstorage recovers

Experiment

Experiment summary

Troubleshooting docs

valyala commented Aug 30, 2023

aierui commented Sep 6, 2023 • edited

valyala commented Sep 6, 2023

aierui commented Sep 6, 2023 • edited

wjordan commented Sep 7, 2023

aierui commented Sep 8, 2023

wjordan commented Oct 13, 2023

hagen1778 commented Oct 18, 2023

hagen1778 commented Oct 18, 2023 • edited

zekker6 commented Oct 23, 2023

hagen1778 commented Nov 2, 2023 • edited

valyala commented Nov 14, 2023

valyala commented Nov 15, 2023

aierui commented Sep 6, 2023 •

edited

aierui commented Sep 6, 2023 •

edited

hagen1778 commented Oct 18, 2023 •

edited

hagen1778 commented Nov 2, 2023 •

edited