[STORE] Move to one datapath per shard #9498

Closed
s1monw opened this Issue Jan 30, 2015 · 5 comments

Projects

None yet

6 participants

@s1monw
Contributor
s1monw commented Jan 30, 2015

Today Elasticsearch has the ability to utilize multiple datapaths
which is useful to spread load across several disks throughput and space wise.
Yet, the way it is utilized is problematic since it's similar to a stripe raid where
each shard gets a dedicated directory on each of the data paths / disks. This method has several
problems:

  • if one data_path is lost all shards on the node are lost since files are spread across disk
  • segment files that are logically belong together are spread across disks where performance implications are unknown.
  • some files historically where not write once such that updates to these files needed to always go to the same directory
  • corruptions due to a lost disk might be detected very late if those files are already loaded into memory
  • listing contents of a shard is not trivial since multiple FS directries are involved
  • code is more complicated since we need to wrap lucenes directory abstraction to make this work.
  • if a user misconfigures a node all shard data is lost ie. due to reordering of the paths

This issues aims to move away from this shard striping towards a datapath per shard such that all
files for a Elasticsearch Shard (lucene index) are on the same directory and no special code is needed.

This requires some changes how we balance disk utilization, today we pick the directories when a file
needs to be written, now we need to select the directory ahead of time. While this seems to be trivial
since we can just pick the least utilized disk we need to take the expected size of the shard into account
and see if we can allocate this shard on a certain node at all. This can have implication on how we do this
since data-paths space is accumulated for this decision.

Rather less trivial is the migration path for this, we need to move from multiple directories to a single directory
likely in the background such that the upgrade process for users is smooth. There are several options:

  • upgrade on shard allocation, copy all the data into the new dedicated directory when the shard is allocated. This makes the selection of the dedicated directory simpler since we do one shard a time but might be a pretty slow process for the upgrade.
  • upgrade over time, select a dedicated directory on shard allocation and redirect all writes to this directory. Merges or recoveries will then automatically use the dedicated directory and an upgrade can be triggered via a merge or the upgrade API. The biggest issue is that we need to keep the distributor directory around which is way simpler now but it would be nice to drop it. Also picking the dedicated shard dir might be harder since we don't really know how disks will fill up over time.
  • don't upgrade and only move new indices to the per-shard path
  • use and upgrade tool that must run pre-node startup

However, I think the code changes can we done in two steps:

  • remove the old strip raid behavior for new indices
  • upgrade old indices

since the latter is much harder.

@kimchy
Member
kimchy commented Feb 5, 2015

+1 on this change, my suggestion to upgrade is to upgrade on node startup automatically. We could scan the data directory, and rebalance shards data paths during startup, and only then make the node available. If we can use MOVE of files it would be great, and if MOVE fails, we can resort to copying it over so it will be more lightweight.

@s1monw s1monw removed the v1.5.0 label Mar 2, 2015
@s1monw
Contributor
s1monw commented Mar 2, 2015

moved out to 2.0 for now

@s1monw s1monw added the blocker label Mar 2, 2015
@joealex
joealex commented Mar 18, 2015

This is a very common scenario when you have very large clusters and multiple disks on each node. Some disk is bound to fail every few days. We had a bunch of disk failures across some days and took some time to figure out the cause was due to shards striped across multiple disks. When it runs it is perfect since the load is balanced and the multiple disk spins are utilized. We had to shutdown ES on the issue node, remove the disk from ES config and bring ES back up after sometime. This forced the Node refresh of data from other Nodes.

With the all shard data on one disk, balancing/utilizing all disks may have to be given a good look, ie if this will cause some disks to be over-utilized due to that index living there getting more data. Also about reading performance since now all from one single disk.

@joealex joealex referenced this issue Mar 18, 2015
Closed

Roadmap for 2.0 #9970

14 of 14 tasks complete
@nik9000
Contributor
nik9000 commented Mar 18, 2015

+1 on doing this.

likely in the background such that the upgrade process for users is smooth

I'm not sure how this can be nicely automated..... I think the optimal process involves slowly moving away from this by moving shards off of a node (a few nodes?), taking it out of the cluster, reformatting its multiple disks to proper raid, adding it back to the cluster, and letting the cluster heal. I suppose in many respects its similar to replacing the disks entirely. We did that about a year ago with great success, btw.

@josephglanville

Overall +1 to this change for increasing durability of shards.

However once this change goes though one will need to either resort to RAID of some description of more shards to make up for the loss of spindles.

In light of this change having multiple shards allocated to the same node would actually be beneficial in terms of IO throughput but could anyone comment on any drawbacks of this? It seems to be the most elegant solution to restore the lost performance without resorting to external systems.

If the increase in shards is indeed feasible it would also be nice as a latter optimisation to attempt to allocate shards of the same index to different disks.

@s1monw s1monw added a commit to s1monw/elasticsearch that referenced this issue Apr 7, 2015
@s1monw s1monw [STORE] Move to on data.path per shard
This commit moves away from using stripe RAID-0 simumlation across multiple
data paths towards using a single path per shard. Multiple data paths are still
supported but shards and it's data is not striped across multiple paths / disks.
This will for instance prevent to loose all shards if a single disk is corrupted.

Indices that are using this features already will automatically upgraded to a single
datapath based on a simple diskspace based heuristic. In general there must be enough
diskspace to move a single shard at any time otherwise the upgrade will fail.

Closes #9498
342cbe4
@s1monw s1monw added a commit to s1monw/elasticsearch that referenced this issue Apr 13, 2015
@s1monw s1monw [STORE] Move to on data.path per shard
This commit moves away from using stripe RAID-0 simumlation across multiple
data paths towards using a single path per shard. Multiple data paths are still
supported but shards and it's data is not striped across multiple paths / disks.
This will for instance prevent to loose all shards if a single disk is corrupted.

Indices that are using this features already will automatically upgraded to a single
datapath based on a simple diskspace based heuristic. In general there must be enough
diskspace to move a single shard at any time otherwise the upgrade will fail.

Closes #9498
1c771ba
@s1monw s1monw added a commit to s1monw/elasticsearch that referenced this issue Apr 20, 2015
@s1monw s1monw [STORE] Move to on data.path per shard
This commit moves away from using stripe RAID-0 simumlation across multiple
data paths towards using a single path per shard. Multiple data paths are still
supported but shards and it's data is not striped across multiple paths / disks.
This will for instance prevent to loose all shards if a single disk is corrupted.

Indices that are using this features already will automatically upgraded to a single
datapath based on a simple diskspace based heuristic. In general there must be enough
diskspace to move a single shard at any time otherwise the upgrade will fail.

Closes #9498
91488c5
@s1monw s1monw added a commit to s1monw/elasticsearch that referenced this issue Apr 20, 2015
@s1monw s1monw [STORE] Move to on data.path per shard
This commit moves away from using stripe RAID-0 simumlation across multiple
data paths towards using a single path per shard. Multiple data paths are still
supported but shards and it's data is not striped across multiple paths / disks.
This will for instance prevent to loose all shards if a single disk is corrupted.

Indices that are using this features already will automatically upgraded to a single
datapath based on a simple diskspace based heuristic. In general there must be enough
diskspace to move a single shard at any time otherwise the upgrade will fail.

Closes #9498
3c76333
@s1monw s1monw added a commit to s1monw/elasticsearch that referenced this issue Apr 20, 2015
@s1monw s1monw [STORE] Move to on data.path per shard
This commit moves away from using stripe RAID-0 simumlation across multiple
data paths towards using a single path per shard. Multiple data paths are still
supported but shards and it's data is not striped across multiple paths / disks.
This will for instance prevent to loose all shards if a single disk is corrupted.

Indices that are using this features already will automatically upgraded to a single
datapath based on a simple diskspace based heuristic. In general there must be enough
diskspace to move a single shard at any time otherwise the upgrade will fail.

Closes #9498
5730c06
@s1monw s1monw closed this in #10461 Apr 20, 2015
@s1monw s1monw added a commit to s1monw/elasticsearch that referenced this issue Apr 29, 2015
@s1monw s1monw Add multi data.path to migration guide
this commit removes the obsolete settings for distributors and updates
the documentation on multiple data.path. It also adds an explain to the
migration guide.

Relates to #9498
Closes #10770
94d8b20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment