-
Notifications
You must be signed in to change notification settings - Fork 13.9k
[hotfix][docs]Review to reduce passive voice, improve grammar and formatting #5277
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -28,12 +28,12 @@ under the License. | |
|
|
||
| ## Tasks and Operator Chains | ||
|
|
||
| For distributed execution, Flink *chains* operator subtasks together into *tasks*. Each task is executed by one thread. | ||
| For distributed execution, Flink *chains* operator subtasks together into *tasks*, with one thread executing each task. | ||
| Chaining operators together into tasks is a useful optimization: it reduces the overhead of thread-to-thread | ||
| handover and buffering, and increases overall throughput while decreasing latency. | ||
| The chaining behavior can be configured; see the [chaining docs](../dev/datastream_api.html#task-chaining-and-resource-groups) for details. | ||
| handover and buffering and increases overall throughput while decreasing latency. | ||
| You can configure the chaining behavior, read the [chaining docs](../dev/datastream_api.html#task-chaining-and-resource-groups) for details. | ||
|
|
||
| The sample dataflow in the figure below is executed with five subtasks, and hence with five parallel threads. | ||
| Five subtasks execute the sample data flow in the figure below with five parallel threads. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The community as well as other projects refer to "dataflow" extensively (also below). |
||
|
|
||
| <img src="../fig/tasks_chains.svg" alt="Operator chaining into Tasks" class="offset" width="80%" /> | ||
|
|
||
|
|
@@ -46,19 +46,23 @@ The Flink runtime consists of two types of processes: | |
| - The **JobManagers** (also called *masters*) coordinate the distributed execution. They schedule tasks, coordinate | ||
| checkpoints, coordinate recovery on failures, etc. | ||
|
|
||
| There is always at least one Job Manager. A high-availability setup will have multiple JobManagers, one of | ||
| which one is always the *leader*, and the others are *standby*. | ||
| There is always at least one Job Manager. A high-availability setup should have multiple JobManagers, one of | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "JobManager"?
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "A high-availability setup should have multiple JobManagers" is, in general, not true -- this is a detail that depends on the underlying cluster management framework. I suggest reworking as follows: There is always at least one JobManager, but some [high-availability setups]({{ site.baseurl }}/ops/jobmanager_high_availability.html) will have multiple JobManagers. |
||
| which one is the *leader*, and the others are *standby*. | ||
|
|
||
| - The **TaskManagers** (also called *workers*) execute the *tasks* (or more specifically, the subtasks) of a dataflow, | ||
| - The **TaskManagers** (also called *workers*) execute the *tasks* (or more specifically, the subtasks) of a data flow, | ||
| and buffer and exchange the data *streams*. | ||
|
|
||
| There must always be at least one TaskManager. | ||
|
|
||
| The JobManagers and TaskManagers can be started in various ways: directly on the machines as a [standalone cluster](../ops/deployment/cluster_setup.html), in | ||
| containers, or managed by resource frameworks like [YARN](../ops/deployment/yarn_setup.html) or [Mesos](../ops/deployment/mesos.html). | ||
| TaskManagers connect to JobManagers, announcing themselves as available, and are assigned work. | ||
| You can start the JobManagers and TaskManagers in three ways: | ||
|
|
||
| The **client** is not part of the runtime and program execution, but is used to prepare and send a dataflow to the JobManager. | ||
| - Directly on the machines as a [standalone cluster](../ops/deployment/cluster_setup.html). | ||
| - In containers. | ||
| - Managed by resource frameworks like [YARN](../ops/deployment/yarn_setup.html) or [Mesos](../ops/deployment/mesos.html). | ||
|
|
||
| TaskManagers connect to JobManagers, announcing themselves as available, and the JobManager assigns them work. | ||
|
|
||
| The **client** is not part of the runtime and program execution, but is used to prepare and send a data flow to the JobManager. | ||
| After that, the client can disconnect, or stay connected to receive progress reports. The client runs either as part of the | ||
| Java/Scala program that triggers the execution, or in the command line process `./bin/flink run ...`. | ||
|
|
||
|
|
@@ -69,15 +73,17 @@ Java/Scala program that triggers the execution, or in the command line process ` | |
| ## Task Slots and Resources | ||
|
|
||
| Each worker (TaskManager) is a *JVM process*, and may execute one or more subtasks in separate threads. | ||
| To control how many tasks a worker accepts, a worker has so called **task slots** (at least one). | ||
| To control how many tasks a worker accepts, a worker has **task slots** (at least one). | ||
|
|
||
| Each *task slot* represents a fixed subset of resources of the TaskManager. A TaskManager with three slots, for example, | ||
| will dedicate 1/3 of its managed memory to each slot. Slotting the resources means that a subtask will not | ||
| dedicates 1/3 of its managed memory to each slot. Slotting the resources means that a subtask will not | ||
| compete with subtasks from other jobs for managed memory, but instead has a certain amount of reserved | ||
| managed memory. Note that no CPU isolation happens here; currently slots only separate the managed memory of tasks. | ||
| managed memory. | ||
|
|
||
| <span class="label label-info">No CPU isolation happens here, slots only separate the managed memory of tasks.</span> | ||
|
|
||
| By adjusting the number of task slots, users can define how subtasks are isolated from each other. | ||
| Having one slot per TaskManager means each task group runs in a separate JVM (which can be started in a | ||
| Having one slot per TaskManager means each task group runs in a separate JVM (which you can start in a | ||
| separate container, for example). Having multiple slots | ||
| means more subtasks share the same JVM. Tasks in the same JVM share TCP connections (via multiplexing) and | ||
| heartbeat messages. They may also share data sets and data structures, thus reducing the per-task overhead. | ||
|
|
@@ -88,40 +94,36 @@ By default, Flink allows subtasks to share slots even if they are subtasks of di | |
| they are from the same job. The result is that one slot may hold an entire pipeline of the | ||
| job. Allowing this *slot sharing* has two main benefits: | ||
|
|
||
| - A Flink cluster needs exactly as many task slots as the highest parallelism used in the job. | ||
| No need to calculate how many tasks (with varying parallelism) a program contains in total. | ||
| - A Flink cluster needs as many task slots as the highest parallelism used in the job. | ||
| There's no need to calculate how many tasks (with varying parallelism) a program contains in total. | ||
|
|
||
| - It is easier to get better resource utilization. Without slot sharing, the non-intensive | ||
| *source/map()* subtasks would block as many resources as the resource intensive *window* subtasks. | ||
| *source/map()* subtasks would block as many resources as the resource-intensive *window* subtasks. | ||
| With slot sharing, increasing the base parallelism in our example from two to six yields full utilization of the | ||
| slotted resources, while making sure that the heavy subtasks are fairly distributed among the TaskManagers. | ||
| slotted resources, while making sure that the heavy subtasks are evenly distributed among the TaskManagers. | ||
|
|
||
| <img src="../fig/slot_sharing.svg" alt="TaskManagers with shared Task Slots" class="offset" width="80%" /> | ||
|
|
||
| The APIs also include a *[resource group](../dev/datastream_api.html#task-chaining-and-resource-groups)* mechanism which can be used to prevent undesirable slot sharing. | ||
| The APIs also include a *[resource group](../dev/datastream_api.html#task-chaining-and-resource-groups)* mechanism which you can use to prevent undesirable slot sharing. | ||
|
|
||
| As a rule-of-thumb, a good default number of task slots would be the number of CPU cores. | ||
| With hyper-threading, each slot then takes 2 or more hardware thread contexts. | ||
| As a rule-of-thumb, a reasonable default number of task slots would be the number of CPU cores. With hyper-threading, each slot then takes 2 or more hardware thread contexts. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Preserve the line split? I'm not sure what the second sentence is saying. Typically hyper-threads are reported as separate cores so would not each slot take a single hardware thread context? |
||
|
|
||
| {% top %} | ||
|
|
||
| ## State Backends | ||
|
|
||
| The exact data structures in which the key/values indexes are stored depends on the chosen [state backend](../ops/state/state_backends.html). One state backend | ||
| stores data in an in-memory hash map, another state backend uses [RocksDB](http://rocksdb.org) as the key/value store. | ||
| In addition to defining the data structure that holds the state, the state backends also implement the logic to | ||
| take a point-in-time snapshot of the key/value state and store that snapshot as part of a checkpoint. | ||
| The exact data structures which store the key/values indexes depends on the chosen [state backend](../ops/state/state_backends.html). One state backend stores data in an in-memory hash map, another state backend uses [RocksDB](http://rocksdb.org) as the key/value store. In addition to defining the data structure that holds the state, the state backends also implement the logic to take a point-in-time snapshot of the key/value state and store that snapshot as part of a checkpoint. | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Preserve the line splits? The conversion to HTML ignores single newlines. |
||
|
|
||
| <img src="../fig/checkpoints.svg" alt="checkpoints and snapshots" class="offset" width="60%" /> | ||
|
|
||
| {% top %} | ||
|
|
||
| ## Savepoints | ||
|
|
||
| Programs written in the Data Stream API can resume execution from a **savepoint**. Savepoints allow both updating your programs and your Flink cluster without losing any state. | ||
| Programs written in the Data Stream API can resume execution from a **savepoint**. Savepoints allow both updating your programs and your Flink cluster without losing any state. | ||
|
|
||
| [Savepoints](../ops/state/savepoints.html) are **manually triggered checkpoints**, which take a snapshot of the program and write it out to a state backend. They rely on the regular checkpointing mechanism for this. During execution programs are periodically snapshotted on the worker nodes and produce checkpoints. For recovery only the last completed checkpoint is needed and older checkpoints can be safely discarded as soon as a new one is completed. | ||
| [Savepoints](../ops/state/savepoints.html) are **manually triggered checkpoints**, which take a snapshot of the program and write it out to a state backend. They rely on the regular checkpointing mechanism for this. During execution, programs are periodically snapshotted on the worker nodes and produce checkpoints. You only need the last completed checkpoint for recovery, and you can safely discard older checkpoints as soon as a new one is completed. | ||
|
|
||
| Savepoints are similar to these periodic checkpoints except that they are **triggered by the user** and **don't automatically expire** when newer checkpoints are completed. Savepoints can be created from the [command line](../ops/cli.html#savepoints) or when cancelling a job via the [REST API](../monitoring/rest_api.html#cancel-job-with-savepoint). | ||
| Savepoints are similar to these periodic checkpoints except that they are **triggered by the user** and **don't automatically expire** when newer checkpoints are completed. You can create savepoints can from the [command line](../ops/cli.html#savepoints) or when canceling a job via the [REST API](../monitoring/rest_api.html#cancel-job-with-savepoint). | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "savepoints can" -> "savepoints" |
||
|
|
||
| {% top %} | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Preserve the semi-colon?