apache · pnowojski · May 4, 2021 · Apr 29, 2021 · Apr 29, 2021
diff --git a/docs/content/docs/ops/monitoring/back_pressure.md b/docs/content/docs/ops/monitoring/back_pressure.md
@@ -37,50 +37,48 @@ If you see a **back pressure warning** (e.g. `High`) for a task, this means that
 Take a simple `Source -> Sink` job as an example. If you see a warning for `Source`, this means that `Sink` is consuming data slower than `Source` is producing. `Sink` is back pressuring the upstream operator `Source`.
 
 
-## Sampling Back Pressure
+## Task performance metrics
 
-Back pressure monitoring works by repeatedly taking back pressure samples of your running tasks. The JobManager triggers repeated calls to `Task.isBackPressured()` for the tasks of your job.
+Every parallel instance of a task (subtask) is exposing a group of three metrics:
+- `backPressureTimeMsPerSecond`, time that subtask spent being back pressured
+- `idleTimeMsPerSecond`, time that subtask spent waiting for something to process
+- `busyTimeMsPerSecond`, time that subtask was busy doing some actual work
+At any point of time these three metrics are adding up approximately to `1000ms`.
 
-{{< img src="/fig/back_pressure_sampling.png" class="img-responsive" >}}
-<!-- https://docs.google.com/drawings/d/1O5Az3Qq4fgvnISXuSf-MqBlsLDpPolNB7EQG7A3dcTk/edit?usp=sharing -->
-
-Internally, back pressure is judged based on the availability of output buffers. If there is no available buffer (at least one) for output, then it indicates that there is back pressure for the task.
-
-By default, the job manager triggers 100 samples every 50ms for each task in order to determine back pressure. The ratio you see in the web interface tells you how many of these samples were indicating back pressure, e.g. `0.01` indicates that only 1 in 100 was back pressured.
-
-- **OK**: 0 <= Ratio <= 0.10
-- **LOW**: 0.10 < Ratio <= 0.5
-- **HIGH**: 0.5 < Ratio <= 1
-
-In order to not overload the task managers with back pressure samples, the web interface refreshes samples only after 60 seconds.
-
-## Configuration
-
-You can configure the number of samples for the job manager with the following configuration keys:
-
-- `web.backpressure.refresh-interval`: Time after which available stats are deprecated and need to be refreshed (DEFAULT: 60000, 1 min).
-- `web.backpressure.num-samples`: Number of samples to take to determine back pressure (DEFAULT: 100).
-- `web.backpressure.delay-between-samples`: Delay between samples to determine back pressure (DEFAULT: 50, 50 ms).
+These metrics are being updated every couple of seconds, and the reported value represents the
+average time that subtask was back pressured (or idle or busy) during the last couple of seconds.
+Keep this in mind if your job has a varying load. For example, a subtask with a constant load of 50%
+and another subtask that is alternating every second between fully loaded and idling will both have
+the same value of `busyTimeMsPerSecond`: around `500ms`.
 
+Internally, back pressure is judged based on the availability of output buffers.
+If a task has no available output buffers, then that task is considered back pressured.
+Idleness, on the other hand, is determined by whether or not there is input available.
 
 ## Example
 
-You can find the *Back Pressure* tab next to the job overview.
+The WebUI aggregates the maximum value of the back pressure and busy metrics from all of the
+subtasks and presents those aggregated values inside the JobGraph. Besides displaying the raw
+values, tasks are also color-coded to make the investigation easier.
 
-### Sampling In Progress
+{{< img src="/fig/back_pressure_job_graph.png" class="img-responsive" >}}
 
-This means that the JobManager triggered a back pressure sample of the running tasks. With the default configuration, this takes about 5 seconds to complete.
+Idling tasks are blue, fully back pressured tasks are black, and fully busy tasks are colored red.
+All values in between are represented as shades between those three colors.
 
-Note that clicking the row, you trigger the sample for all subtasks of this operator.
+### Back Pressure Status
 
-{{< img src="/fig/back_pressure_sampling_in_progress.png" class="img-responsive" >}}
+In the *Back Pressure* tab next to the job overview you can find more detailed metrics.
 
-### Back Pressure Status
+{{< img src="/fig/back_pressure_subtasks.png" class="img-responsive" >}}
 
-If you see status **OK** for the tasks, there is no indication of back pressure. **HIGH** on the other hand means that the tasks are back pressured.
+For subtasks whose status is **OK**, there is no indication of back pressure. **HIGH**, on the
+other hand, means that a subtask is back pressured. Status is defined in the following way:
 
-{{< img src="/fig/back_pressure_sampling_ok.png" class="img-responsive" >}}
+- **OK**: 0% <= back pressured <= 10%
+- **LOW**: 10% < back pressured <= 50%
+- **HIGH**: 50% < back pressured <= 100%
 
-{{< img src="/fig/back_pressure_sampling_high.png" class="img-responsive" >}}
+Additionally, you can find the percentage of time each subtask is back pressured, idle, or busy.
 
 {{< top >}}
diff --git a/docs/content/docs/ops/state/checkpoints.md b/docs/content/docs/ops/state/checkpoints.md
@@ -177,11 +177,7 @@ $ bin/flink run -s :checkpointMetaDataPath [:runArgs]
 
 ### Unaligned checkpoints
 
-{{< hint danger >}}
-Unaligned checkpoints may produce corrupted checkpoints in 1.12.0 and 1.12.1 and we discourage use in production settings.
-{{< /hint >}}
-
-Starting with Flink 1.11, checkpoints can be unaligned. 
+Starting with Flink 1.11, checkpoints can be unaligned.
 [Unaligned checkpoints]({{< ref "docs/concepts/stateful-stream-processing" >}}#unaligned-checkpointing) contain in-flight data (i.e., data stored in
 buffers) as part of the checkpoint state, which allows checkpoint barriers to
 overtake these buffers. Thus, the checkpoint duration becomes independent of the
@@ -194,13 +190,10 @@ independent of the end-to-end latency. Be aware unaligned checkpointing
 adds to I/O to the state backends, so you shouldn't use it when the I/O to
 the state backend is actually the bottleneck during checkpointing.
 
-Note that unaligned checkpoints is a brand-new feature that currently has the
+Note that unaligned checkpointing is a new feature that currently has the
 following limitations:
 
-- You cannot rescale or change job graph with from unaligned checkpoints. You 
-  have to take a savepoint before rescaling. Savepoints are always aligned 
-  independent of the alignment setting of checkpoints.
-- Flink currently does not support concurrent unaligned checkpoints. However, 
+- Flink currently does not support concurrent unaligned checkpoints. However,
   due to the more predictable and shorter checkpointing times, concurrent 
   checkpoints might not be needed at all. However, savepoints can also not 
   happen concurrently to unaligned checkpoints, so they will take slightly 
@@ -219,10 +212,10 @@ state. To support rescaling, watermarks should be stored per key-group in a
 union-state. We most likely will implement this approach as a general solution 
 (didn't make it into Flink 1.11.0).
 
-In the upcoming release(s), Flink will address these limitations and will
-provide a fine-grained way to trigger unaligned checkpoints only for the 
-in-flight data that moves slowly with timeout mechanism. These options will
-decrease the pressure on I/O in the state backends and eventually allow
-unaligned checkpoints to become the default checkpointing. 
+After enabling unaligned checkpoints, you can also specify the alignment timeout via
+`CheckpointConfig.setAlignmentTimeout(Duration)` or `execution.checkpointing.alignment-timeout` in
+the configuration file. When activated, each checkpoint will still begin as an aligned checkpoint,
+but if the alignment time for some subtask exceeds this timeout, then the checkpoint will proceed as an
+unaligned checkpoint.
 
 {{< top >}}
diff --git a/docs/static/fig/back_pressure_job_graph.png b/docs/static/fig/back_pressure_job_graph.png
diff --git a/docs/static/fig/back_pressure_sampling.png b/docs/static/fig/back_pressure_sampling.png
diff --git a/docs/static/fig/back_pressure_sampling_high.png b/docs/static/fig/back_pressure_sampling_high.png
diff --git a/docs/static/fig/back_pressure_sampling_in_progress.png b/docs/static/fig/back_pressure_sampling_in_progress.png
diff --git a/docs/static/fig/back_pressure_sampling_ok.png b/docs/static/fig/back_pressure_sampling_ok.png
diff --git a/docs/static/fig/back_pressure_subtasks.png b/docs/static/fig/back_pressure_subtasks.png