Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation on remote recovery #39483

Merged
merged 9 commits into from
Mar 5, 2019
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/reference/ccr/getting-started.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -253,6 +253,11 @@ PUT /server-metrics-copy/_ccr/follow?wait_for_active_shards=1

//////////////////////////

The follower index is bootstrapped using the <<remote-recovery, remote recovery>>
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
process. The remote recovery process transfers the existing Lucene segment files
from the leader to the follower. When the remote recovery process is complete,
the index following will be initiated.
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved

Now when you index documents into your leader index, you will see these
documents replicated in the follower index. You can
inspect the status of replication using the
Expand Down
65 changes: 65 additions & 0 deletions docs/reference/ccr/remote-recovery.asciidoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
[[remote-recovery]]
=== Remote Recovery

Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
Remote recovery is the process used to build a new copy of a shard on a follower
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
node by copying data from the primary shard in the leader cluster. {es} uses this
remote recovery process to bootstrap a follower index using the data from the
leader index. This allows the follower to receive a copy of the current state of
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
the leader index, even if a complete history of changes is not available on the
leader due to Lucene segment merging.

Remote recovery is a network intensive process that transfers all of the Lucene
segment files from the leader cluster to the follower cluster. The follower
requests that a recovery session be initiated on the primary shard in the leader
cluster. The follower then requests file chunks concurrently from the leader. By
default, the the process concurrently requests `5` large `1mb` file chunks as remote
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
recovery is designed to support leader and follower clusters with high network
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
latency between them.

Information about an in-progress remote recovery can be obtained using the
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
<<cat-recovery,recovery>> api on the follower cluster. Remote recoveries are implemented
using the <<modules-snapshots,snapshot and restore>> infrastructure. This means that
on-going remote recoveries will be labelled as type `snapshot` in the recovery api.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
on-going remote recoveries will be labelled as type `snapshot` in the recovery api.
on-going remote recoveries are labelled as type `snapshot` in the recovery API.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some way to know when the recovery process is complete (other than using the recovery API)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The recovery API is the primary. You could probably use the cat indices api to check if the index is green. You can also add a parameter to a follower request to wait until the process is completed. However, that is documented on the put follow request page.


The following setting can be used to rate-limit the data transmitted during remote
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally store these types of settings in the Elasticsearch Reference in pages like the ones linked here: https://www.elastic.co/guide/en/elasticsearch/reference/master/settings-xpack.html

I am happy to create that page if you agree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that makes sense.

Copy link
Contributor

@lcawl lcawl Mar 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've added those changes.

recoveries:

`ccr.indices.recovery.max_bytes_per_sec`::
Tim-Brooks marked this conversation as resolved.
Show resolved Hide resolved
Limits the total inbound and outbound remote recovery traffic on each node.
Since this limit applies on each node, but there may be many nodes
performing remote recoveries concurrently, the total amount of remote recovery bytes
may be much higher than this limit. If you set this limit too high then there
is a risk that ongoing remote recoveries will consume an excess of bandwidth
(or other resources) which could destabilize the cluster. This setting is used by both
the leader and follower clusters. For example if it is set to `20mb` on a leader, the
leader will only send `20mb/s` to the follower even if the follower is requesting and can
accept `60mb/s`. Defaults to `40mb`.

The following _expert_ settings can be set to manage the resources consumed by
remote recoveries:

`ccr.indices.recovery.max_concurrent_file_chunks`::
Controls the number of file chunk requests that can be sent in parallel per recovery.
As multiple remote recoveries might already running in parallel, increasing this
expert-level setting might only help in situations where remote recovery of a single shard
is not reaching the total inbound and outbound remote recovery traffic as configured by
`ccr.indices.recovery.max_bytes_per_sec`. Defaults to `5`. The maximum allowed value is
`10`.

`ccr.indices.recovery.chunk_size`::
Controls the chunk size requested by the follower during file transfer. Defaults to
`1mb`.

`ccr.indices.recovery.recovery_activity_timeout`::
Controls the timeout for recovery activity. This timeout primarily applies on the leader
cluster. The leader cluster must open resources in-memory to supply data to the follower
during the recovery process. If the leader does not receive recovery requests from the
follower for this period of time, it will close the resources. Defaults to `60 seconds`.

`ccr.indices.recovery.internal_action_timeout`::
Controls the timeout for individual network requests during the remote recovery
process. An individual action timing out can fail the recovery. Defaults to `60 seconds`.


These settings can be dynamically updated on a live cluster with the
<<cluster-update-settings,cluster-update-settings>> API.