Deprecate Shared Gateway #2458

Closed
kimchy opened this Issue Dec 3, 2012 · 18 comments

Comments

Projects
None yet
9 participants
@kimchy
Member

kimchy commented Dec 3, 2012

Shared gateways (shared FS storage or S3 for example) are problematic performance wise since they constantly need to snapshot the state of the index to a shared location, and then use that as the system of record. The local gateway on the other hand doesn't need it, and performs much better.

The main benefit of a shared gateway is the fact that the data is actually stored on another persistent location (i.e. using ephemeral disks on AWS, but still having the data on s3), but then its actually abusing the shared gateway design (to be used as a backup).

In the near future, we will have a proper snapshot(backup)/restore API, which will be the proper way to do backups, but relaying on the shared gateway for that is problematic. Note, backups can still be made by "rsync" the data location for each node "manually".

@kimchy kimchy closed this in 677e6ce Dec 3, 2012

kimchy added a commit that referenced this issue Dec 3, 2012

kimchy added a commit that referenced this issue Dec 3, 2012

@ejain

This comment has been minimized.

Show comment
Hide comment
@ejain

ejain Dec 4, 2012

Contributor

Is there an open issue for the snapshot/restore API yet?

Contributor

ejain commented Dec 4, 2012

Is there an open issue for the snapshot/restore API yet?

@jgriswoldinfogroup

This comment has been minimized.

Show comment
Hide comment
@jgriswoldinfogroup

jgriswoldinfogroup Dec 10, 2012

Why would you deprecate this feature prior to the availability of a backup/restore API?

Why would you deprecate this feature prior to the availability of a backup/restore API?

@fatemehmd

This comment has been minimized.

Show comment
Hide comment
@fatemehmd

fatemehmd Dec 13, 2012

Is there any tutorial on how to configure instances with EBS now that S3 is not an option?

Is there any tutorial on how to configure instances with EBS now that S3 is not an option?

@ejain

This comment has been minimized.

Show comment
Hide comment
@ejain

ejain Dec 13, 2012

Contributor

On Wed, Dec 12, 2012 at 10:27 PM, Fatemeh notifications@github.com wrote:

Is there any tutorial on how to configure instances with EBS now that S3 is not an option?

deprecated != removed

I do hope the backup/restore feature is implemented before support for
the S3 gateway is removed.

Contributor

ejain commented Dec 13, 2012

On Wed, Dec 12, 2012 at 10:27 PM, Fatemeh notifications@github.com wrote:

Is there any tutorial on how to configure instances with EBS now that S3 is not an option?

deprecated != removed

I do hope the backup/restore feature is implemented before support for
the S3 gateway is removed.

@karmi

This comment has been minimized.

Show comment
Hide comment
@karmi

karmi Dec 21, 2012

Member

@fatemehmd The http://www.elasticsearch.org/tutorials/2012/03/21/deploying-elasticsearch-with-chef-solo.html tutorial now walks exactly through that scenario, via the support in the Chef cookbok.

Member

karmi commented Dec 21, 2012

@fatemehmd The http://www.elasticsearch.org/tutorials/2012/03/21/deploying-elasticsearch-with-chef-solo.html tutorial now walks exactly through that scenario, via the support in the Chef cookbok.

@youurayy

This comment has been minimized.

Show comment
Hide comment
@youurayy

youurayy Jan 6, 2013

Never noticed this in the logs, now my S3-gateway-configured cluster crashed after running out of JVM memory on one of the nodes.

Would be beneficial for other users to add this deprecation into the docs.

youurayy commented Jan 6, 2013

Never noticed this in the logs, now my S3-gateway-configured cluster crashed after running out of JVM memory on one of the nodes.

Would be beneficial for other users to add this deprecation into the docs.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Jan 6, 2013

Member

Indeed, deprecate does not mean we are going to remove it. It will not be removed before we have the snapshot/restore API, but even before, I would suggest running in local gateway mode on EBS for example, compared to using the s3 gateway, because of the overhead that comes with continuously snapshotting to it and treating it as the main source of truth.

@ypocat the OOM should not be caused because of the s3 gateway case, it probably happened because of other reasons (common one is faceting on fields that end up abusing memory, we are working on that as well...)

Member

kimchy commented Jan 6, 2013

Indeed, deprecate does not mean we are going to remove it. It will not be removed before we have the snapshot/restore API, but even before, I would suggest running in local gateway mode on EBS for example, compared to using the s3 gateway, because of the overhead that comes with continuously snapshotting to it and treating it as the main source of truth.

@ypocat the OOM should not be caused because of the s3 gateway case, it probably happened because of other reasons (common one is faceting on fields that end up abusing memory, we are working on that as well...)

@truthtrap

This comment has been minimized.

Show comment
Hide comment
@truthtrap

truthtrap Mar 11, 2013

We are actually quite happy with the S3 gateway. We use it with all of the clusters we run. The main use is as a backup, that we can restore from when the cluster hangs or dies. The advantage of this approach is that we are extremely flexible in how we work with nodes.

Working with EBS is not a solution. It would require a very complicated automated setup that manages instance with additional EBS volume(s). It is an approach we often use for things like Postgres and MongoDB. But part of the elasticsearch enthusiasm we feel is the ease of working with cluster technology.

A snapshotting feature is a good replacement. It would really great if we could have some sort of Point in Time Restore with it, but it is not (yet) required. I would like to ask you to leave some overlap of features after you release snapshotting. We do rely on S3 when we upgrade our clusters, for example.

So, please, at least one release with snapshotting and the (deprecated) S3 gateway.

We are actually quite happy with the S3 gateway. We use it with all of the clusters we run. The main use is as a backup, that we can restore from when the cluster hangs or dies. The advantage of this approach is that we are extremely flexible in how we work with nodes.

Working with EBS is not a solution. It would require a very complicated automated setup that manages instance with additional EBS volume(s). It is an approach we often use for things like Postgres and MongoDB. But part of the elasticsearch enthusiasm we feel is the ease of working with cluster technology.

A snapshotting feature is a good replacement. It would really great if we could have some sort of Point in Time Restore with it, but it is not (yet) required. I would like to ask you to leave some overlap of features after you release snapshotting. We do rely on S3 when we upgrade our clusters, for example.

So, please, at least one release with snapshotting and the (deprecated) S3 gateway.

@karmi

This comment has been minimized.

Show comment
Hide comment
@karmi

karmi Mar 11, 2013

Member

Working with EBS is not a solution. It would require a very complicated automated setup that manages instance with additional EBS volume(s).

I can understand why EBS volumes are not a good option in many scenarios, either from technical or economical standpoint. However, I'd say that the provisioning overhead is really low. Given how good abstraction the Fog (Ruby), jClouds (Java) and other libraries provide, I wouldn't describe it as "very complicated"...

Member

karmi commented Mar 11, 2013

Working with EBS is not a solution. It would require a very complicated automated setup that manages instance with additional EBS volume(s).

I can understand why EBS volumes are not a good option in many scenarios, either from technical or economical standpoint. However, I'd say that the provisioning overhead is really low. Given how good abstraction the Fog (Ruby), jClouds (Java) and other libraries provide, I wouldn't describe it as "very complicated"...

@truthtrap

This comment has been minimized.

Show comment
Hide comment
@truthtrap

truthtrap Mar 11, 2013

I don't want to discuss complexity of AWS related issues here. But, if you want to build a cluster-wide 'snapshot mechanism' with EBS that keeps the flexibility of the ElasticSearch (in combination with the AWS Cloud Plugin) you are in for quite a ride.

If you just want persistence of a node, 'plain EBS' is fine. Unfortunately, that is not enough for us. We want to scale a cluster (OUT or IN) within a couple of minutes. We need to be able to rotate all instances in a cluster very easily, without worrying about the data. We have to be able to replace a non-responsive ElasticSearch node by terminating the instance. Etc.

(If you are interested how we approach these things you can read Resilience & Reliability on AWS. It has a dedicated chapter on ElasticSearch, just to show how incredibly impressed we are with it. Most of the work was already done.)

I don't want to discuss complexity of AWS related issues here. But, if you want to build a cluster-wide 'snapshot mechanism' with EBS that keeps the flexibility of the ElasticSearch (in combination with the AWS Cloud Plugin) you are in for quite a ride.

If you just want persistence of a node, 'plain EBS' is fine. Unfortunately, that is not enough for us. We want to scale a cluster (OUT or IN) within a couple of minutes. We need to be able to rotate all instances in a cluster very easily, without worrying about the data. We have to be able to replace a non-responsive ElasticSearch node by terminating the instance. Etc.

(If you are interested how we approach these things you can read Resilience & Reliability on AWS. It has a dedicated chapter on ElasticSearch, just to show how incredibly impressed we are with it. Most of the work was already done.)

@ejain

This comment has been minimized.

Show comment
Hide comment
@ejain

ejain Mar 11, 2013

Contributor

I'll second that setting up EBS complicates things in a setup where nodes are added and removed frequently, especially if performance is an issue.

Contributor

ejain commented Mar 11, 2013

I'll second that setting up EBS complicates things in a setup where nodes are added and removed frequently, especially if performance is an issue.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Mar 12, 2013

Member

Snapshotting to s3 would bring both the advantages of the local gateway with the s3 gateway. we won't remove the s3 gateway before Snapshotting is in place at least for one major version

Member

kimchy commented Mar 12, 2013

Snapshotting to s3 would bring both the advantages of the local gateway with the s3 gateway. we won't remove the s3 gateway before Snapshotting is in place at least for one major version

@youurayy

This comment has been minimized.

Show comment
Hide comment
@youurayy

youurayy Mar 13, 2013

Just to add my 2 cents, the S3 shared gateway did not prevent my cluster from crashing into an irreparable state. I had to code an utility which went through the Lucene index files on disk and recovered / reindexed the data into a freshly initialized cluster. I believe I am much better off with the local gateway and daily snapshots of my EBS RAID5 arrays.

Just to add my 2 cents, the S3 shared gateway did not prevent my cluster from crashing into an irreparable state. I had to code an utility which went through the Lucene index files on disk and recovered / reindexed the data into a freshly initialized cluster. I believe I am much better off with the local gateway and daily snapshots of my EBS RAID5 arrays.

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Mar 13, 2013

Member

the idea here is that snapshot/restore with local gateway allows to strike the right balance between keeping up to date local recoverability with long term recoverability from something like s3

Member

kimchy commented Mar 13, 2013

the idea here is that snapshot/restore with local gateway allows to strike the right balance between keeping up to date local recoverability with long term recoverability from something like s3

@truthtrap

This comment has been minimized.

Show comment
Hide comment
@truthtrap

truthtrap Mar 14, 2013

@Shay, thanks for leaving some overlap in the current s3 gateway and the
new snapshotting feature :)

for us 'local recoverability' is on shard level. we will always plan for
loss of an instance, without loosing the cluster. the cluster can recover
itself. we choose to treat nodes as ephemeral. with full cluster BREAKDOWN
a little bit of lag is not a problem. and for full cluster SHUTDOWN we can
manage this properly ourselves.

we are extreme fans of EBS, actually. and there is another interesting
application for EBS, and that is performance. ephemeral is a lot slower
with most rdbms we tried, for example. so, perhaps EBS is necessary in
cases of severe disk access. AWS has SSD ephemeral disks, but that is still
a bit above budget for most of our apps.

another interesting feature of EBS is that you can easily have 20 smaller
volumes, for the same price as a big volume. because of the nature of EBS
you increase your potential read/write throughput more or less linearly.
this principle could be applied to individual indexes, or even shards, if
they can be assigned to different parts on the filesystem. this would be
better manageable than raid, in case local (instance) recoverability is an
issue.

groet,
jurg.

On Wed, Mar 13, 2013 at 5:11 AM, Shay Banon notifications@github.comwrote:

the idea here is that snapshot/restore with local gateway allows to strike
the right balance between keeping up to date local recoverability with long
term recoverability from something like s3


Reply to this email directly or view it on GitHubhttps://github.com/elasticsearch/elasticsearch/issues/2458#issuecomment-14823515
.

@Shay, thanks for leaving some overlap in the current s3 gateway and the
new snapshotting feature :)

for us 'local recoverability' is on shard level. we will always plan for
loss of an instance, without loosing the cluster. the cluster can recover
itself. we choose to treat nodes as ephemeral. with full cluster BREAKDOWN
a little bit of lag is not a problem. and for full cluster SHUTDOWN we can
manage this properly ourselves.

we are extreme fans of EBS, actually. and there is another interesting
application for EBS, and that is performance. ephemeral is a lot slower
with most rdbms we tried, for example. so, perhaps EBS is necessary in
cases of severe disk access. AWS has SSD ephemeral disks, but that is still
a bit above budget for most of our apps.

another interesting feature of EBS is that you can easily have 20 smaller
volumes, for the same price as a big volume. because of the nature of EBS
you increase your potential read/write throughput more or less linearly.
this principle could be applied to individual indexes, or even shards, if
they can be assigned to different parts on the filesystem. this would be
better manageable than raid, in case local (instance) recoverability is an
issue.

groet,
jurg.

On Wed, Mar 13, 2013 at 5:11 AM, Shay Banon notifications@github.comwrote:

the idea here is that snapshot/restore with local gateway allows to strike
the right balance between keeping up to date local recoverability with long
term recoverability from something like s3


Reply to this email directly or view it on GitHubhttps://github.com/elasticsearch/elasticsearch/issues/2458#issuecomment-14823515
.

@oravecz

This comment has been minimized.

Show comment
Hide comment
@oravecz

oravecz Apr 1, 2013

We have been using the S3 shared gateway as a backup since 2010 in production. Our use case is perhaps a bit different from some ES users because we use ES to store smallish amounts of data. We also deploy to Elastic Beanstalk so instances are created and destroyed by Amazon and snapshotting and reuse of EBS is not appropriate. Sometimes we deploy a memory-only store with ES which only can rely on the shared gateway for any kind of cluster recovery.

I am hopeful that the S3 gateway will not go away altogether, or perhaps it is replaced with the snapshot to s3 that Shay had mentioned. My question however is what is the difference between the S3 Gateway now and the "Snapshot to S3" feature besides the frequency with which they will sync (which is customizable for the shared gateway)?

oravecz commented Apr 1, 2013

We have been using the S3 shared gateway as a backup since 2010 in production. Our use case is perhaps a bit different from some ES users because we use ES to store smallish amounts of data. We also deploy to Elastic Beanstalk so instances are created and destroyed by Amazon and snapshotting and reuse of EBS is not appropriate. Sometimes we deploy a memory-only store with ES which only can rely on the shared gateway for any kind of cluster recovery.

I am hopeful that the S3 gateway will not go away altogether, or perhaps it is replaced with the snapshot to s3 that Shay had mentioned. My question however is what is the difference between the S3 Gateway now and the "Snapshot to S3" feature besides the frequency with which they will sync (which is customizable for the shared gateway)?

@kimchy

This comment has been minimized.

Show comment
Hide comment
@kimchy

kimchy Apr 1, 2013

Member

@oravecz effectively, a schedule snapshot to s3 using the future snapshot API will work in a similar manner to s3 gateway. Recovery will work a bit differently, where if you loose all the cluster data (loose all instances with ephemeral drives), you will need to explicitly "call recover" on the new cluster to recover the data from s3.

Member

kimchy commented Apr 1, 2013

@oravecz effectively, a schedule snapshot to s3 using the future snapshot API will work in a similar manner to s3 gateway. Recovery will work a bit differently, where if you loose all the cluster data (loose all instances with ephemeral drives), you will need to explicitly "call recover" on the new cluster to recover the data from s3.

@thomaswitt

This comment has been minimized.

Show comment
Hide comment
@thomaswitt

thomaswitt Oct 13, 2013

To be honest, we're not a big fan of the more EBS-centric way of running ElasticSearch.

Please do consider that nearly every major downtime at AWS had to do something with EBS (often in conjunction with the loss of data). EBS is - in my opinion - one of the most flawed services (just google "aws downtimes ebs"). Which is also not AWS' fault, we have quite some large customers who invested Millions of $ in their "unbreakable" or "fully redundant" SAN and they ALL had downtimes from some hours to several days.

So we're heavily relying on running all our elastic search stuff only on local instance storage and spread the copies to multiple nodes in multiple availability zones. The S3 gateway always seemed to be a big help in avoiding long reindexing times in times of catastrophic events.

In my opinion, it'd be a good idea to have an easy out-of-the-box-solution for ppl who don't want to run ElasticSearch on a non-local, distributed filesystem.

To be honest, we're not a big fan of the more EBS-centric way of running ElasticSearch.

Please do consider that nearly every major downtime at AWS had to do something with EBS (often in conjunction with the loss of data). EBS is - in my opinion - one of the most flawed services (just google "aws downtimes ebs"). Which is also not AWS' fault, we have quite some large customers who invested Millions of $ in their "unbreakable" or "fully redundant" SAN and they ALL had downtimes from some hours to several days.

So we're heavily relying on running all our elastic search stuff only on local instance storage and spread the copies to multiple nodes in multiple availability zones. The S3 gateway always seemed to be a big help in avoiding long reindexing times in times of catastrophic events.

In my opinion, it'd be a good idea to have an easy out-of-the-box-solution for ppl who don't want to run ElasticSearch on a non-local, distributed filesystem.

@dpb587 dpb587 referenced this issue in cityindex-attic/logsearch Dec 17, 2013

Closed

Analyze/Implement Auto Scaling for Elasticsearch #270

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

mute pushed a commit to mute/elasticsearch that referenced this issue Jul 29, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment