-
Notifications
You must be signed in to change notification settings - Fork 24.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to seal an index #10032
Comments
++, I think the improved recovery part idea is a killer. I wonder if we can re-use one way or another the read only flag we have today on an index. |
nice |
@kimchy I think we can. To me it's a 2 staged process, first we switch the index read only using the flag we have and once we are there and the cluster state has been published we seal the index which causes the actual optimizations to happen. so I think we can just reuse it? |
@s1monw ++, exactly what I was thinking about, and we have the flag recorded in the cluster state, so might need to record it somewhere else as well to make sure nothing went in, ++. |
adding this is also going to make full-cluster restart as well as rolling restarts likely instant. Even for the non-timeseries data / logging case for full restarts we can seal all the indices, shutdown, restart & unseal. This also works for rolling restarts if folks can afford having read only indices which I think is reasonable im most cases since the restart will be pretty fast. very promising! |
+1 |
This is great! Our use case wouldn't allow us to seal an index outside of a rolling restart window or some other temporary maintenance action but we can absolutely get away with sealing them all for an hour or so. So this is a great solution for us! |
+1, especially if sealing is reversible (which #10032 (comment) implies) |
+100 for 'instant recoveries' |
@synhershko yeah. the plan is to allow to unseal as well, making it a viable upgrade/full restart strategy (if you can afford stopping indexing). |
yes it's absolutely possible to unseal and the operation should be very fast. ie. makeing a cluster state update essentially. |
We had some internal discussions how to implement this and I wanted to make sure they are recorded here on the issue. Sealing an index happens basically on two levels, the index and the shard level. Index Level sealingOn the index level we use a
This also requires the entire cluster to be on a version that supports and understands index sealing otherwise this feature will not be available. (we have the ability to check this via The seal process is essentially a cluster state update (setting the block and the id) that waits for all shards to respond. This is very similar to how deleting an index works today. We issue the cluster state update that subsequently gets propagated to all the nodes in the cluster. Inside Once the seal operation is issued we set This is also very similar to the delete logic which is currently implemented in Shard Level sealingOnce the Shard level sealingFor this we are currently planning to use ref-counting similar to what we do on The good news is that due to the cluster block (read only settings) no new indexing operations can be issues such that we will reach 0 eventually. Certainly this requires reference counting (
At that point the index is sealed and no write operation can be submitted to the index anymore. Unseal operationThe unseal operation pretty much reverses the sealing. We process a cluster state update that marks the index as On a shard level we basically Fast RecoveryToday recovery is very resource heavy and often super slow since we don't know if two shards are identical on a document level ie. did all operations reach the replica or not. We can tell on a lucene segment level but the segments are different on all replicas unless we copied the over which takes a huge amount of time. With index sealing we basically mark the replicas as Luckily implementing fast recovery on top of the sealing is very straight forward. Basically what we need is an extension of the For safety reasons, if any operations exist in the transaction log we can't utilize the seal ID for fast recovery. Any operation in the translog indicates an illegal state in the context of the seal ID or in other words it breaks the seal. For instance if an old replica is started on a node that was sealed before but the primary is already accepting writes again we can in theory only recover from the transaction log but for the initial iteration we should skip this optimization. In the future we might be even able to extend this process to issue seal commits on a per shard level while accepting writes. Optimizing / Force Merge on a Sealed indexFor the time based indices usecase it's important to run Proposed work items
I hope I covered all the moving parts at least on a high level. if there are any questions feel free to ask. Once we basically agree I will move this to the issue itself. |
+1 |
Discussing this we came up with a new and simpler plan, which works independently of the cluster update. This gist of it is to have a best effort operation to sync the commit points both primaries and replicas. This "synced flush" is guaranteed to succeed if there are no concurrent indexing operations but will fail gracefully if there are. The result is a marker (sync id) on lucene commit points which allows us to shortcut the phase1 of recoveries which will give us the desired speed up. Since this is a best effort approach we can trigger it when ever a shard becomes inactive or in regular, longish intervals (say 30m) or any other time (TBD). Solution sketch (this is a shard operation):
TODOs:
[x] -> in feature branch https://github.com/elastic/elasticsearch/tree/feature/synced_flush |
thx for updating @bleskes |
@brwe can we close this one? |
elastic#10032 introduced the notion of sealing an index by marking it with a special read only marker, allowing for a couple of optimization to happen. The most important one was to speed up recoveries of shards where we know nothing has changed since they were online by skipping the file based sync phase. During the implementation we came up with a light notion which achieves the same recovery benefits but without the read only aspects which we dubbed synced flush. The fact that it was light weight and didn't put the index in read only mode, allowed us to do it automatically in the background which has great advantage. However we also felt the need to allow users to manually trigger this operation. The implementation at elastic#11179 added the sync flush internal logic and the manual (rest) rest API. The name of the API was modeled after the sealing terminology which may end up being confusing. This commit changes the API name to match the internal synced flush naming, namely `{index}/_flush/synced'. On top of that it contains a couple other changes: - Remove all java client API. This feature is not supposed to be called programtically by applications but rather by admins. - Improve rest responses making structure similar to other (flush) API - Change IndexShard#getOperationsCount to exclude the internal +1 on open shard . it's confusing to get 1 while there are actually no ongoing operations - Some minor other clean ups
Today the correctness of synced-flush is guaranteed by ensuring that there is no ongoing indexing operations on the primary. Unfortunately, this might not be guaranteed in some cases. This commit hardens the correctness of the synced-flush by making sure the number of docs on all shards (with the same shardId) are equal before issuing sync_id. Relates elastic#10032
Today the correctness of synced-flush is guaranteed by ensuring that there is no ongoing indexing operations on the primary. Unfortunately, a replica might fall out of sync with the primary even the condition is met. Moreover, if synced-flush mistakenly issues a sync_id for an out of sync replica, then that replica would not be able to recover from the primary. ES prevents that peer-recovery because it detects that both indexes from primary and replica were sealed with the same sync_id but have a different content. This commit modifies the synced-flush to not issue sync_id for out of sync replicas. This change will report the divergence issue earlier to users and also prevent replicas from getting into the "unrecoverable" state. Relates #10032
Today the correctness of synced-flush is guaranteed by ensuring that there is no ongoing indexing operations on the primary. Unfortunately, a replica might fall out of sync with the primary even the condition is met. Moreover, if synced-flush mistakenly issues a sync_id for an out of sync replica, then that replica would not be able to recover from the primary. ES prevents that peer-recovery because it detects that both indexes from primary and replica were sealed with the same sync_id but have a different content. This commit modifies the synced-flush to not issue sync_id for out of sync replicas. This change will report the divergence issue earlier to users and also prevent replicas from getting into the "unrecoverable" state. Relates #10032
Today the correctness of synced-flush is guaranteed by ensuring that there is no ongoing indexing operations on the primary. Unfortunately, a replica might fall out of sync with the primary even the condition is met. Moreover, if synced-flush mistakenly issues a sync_id for an out of sync replica, then that replica would not be able to recover from the primary. ES prevents that peer-recovery because it detects that both indexes from primary and replica were sealed with the same sync_id but have a different content. This commit modifies the synced-flush to not issue sync_id for out of sync replicas. This change will report the divergence issue earlier to users and also prevent replicas from getting into the "unrecoverable" state. Relates #10032
Today the correctness of synced-flush is guaranteed by ensuring that there is no ongoing indexing operations on the primary. Unfortunately, a replica might fall out of sync with the primary even the condition is met. Moreover, if synced-flush mistakenly issues a sync_id for an out of sync replica, then that replica would not be able to recover from the primary. ES prevents that peer-recovery because it detects that both indexes from primary and replica were sealed with the same sync_id but have a different content. This commit modifies the synced-flush to not issue sync_id for out of sync replicas. This change will report the divergence issue earlier to users and also prevent replicas from getting into the "unrecoverable" state. Relates #10032
Today the correctness of synced-flush is guaranteed by ensuring that there is no ongoing indexing operations on the primary. Unfortunately, a replica might fall out of sync with the primary even the condition is met. Moreover, if synced-flush mistakenly issues a sync_id for an out of sync replica, then that replica would not be able to recover from the primary. ES prevents that peer-recovery because it detects that both indexes from primary and replica were sealed with the same sync_id but have a different content. This commit modifies the synced-flush to not issue sync_id for out of sync replicas. This change will report the divergence issue earlier to users and also prevent replicas from getting into the "unrecoverable" state. Relates #10032
Today the correctness of synced-flush is guaranteed by ensuring that there is no ongoing indexing operations on the primary. Unfortunately, a replica might fall out of sync with the primary even the condition is met. Moreover, if synced-flush mistakenly issues a sync_id for an out of sync replica, then that replica would not be able to recover from the primary. ES prevents that peer-recovery because it detects that both indexes from primary and replica were sealed with the same sync_id but have a different content. This commit modifies the synced-flush to not issue sync_id for out of sync replicas. This change will report the divergence issue earlier to users and also prevent replicas from getting into the "unrecoverable" state. Relates #10032
In a lot of use cases indices are used for indexing only for a limited amount of time. Ie. in the daily index use case indices are created with a highish number of shards to scale out indexing and then after a couple of days these indices are idle in terms of writing. Yet, we still keep all the resources open since we are accepting writes at any time. This not necessary in a lot of cases and would allow for a large amount of optimizations:
merge them into a weekly index to reduce the number of indices.
The text was updated successfully, but these errors were encountered: