[FLINK-33398][runtime] Support switching from batch to stream mode for one input stream operator #23521

Sxnan · 2023-10-13T07:40:26Z

What is the purpose of the change

This PR supports switching from batch to stream mode to improve the performance during processing backlog data for one input stream operator.

Brief change log

Introduce RecordAttributes to notify the downstream whether the records are backlog data.
Propagate the RecordAttributes at runtime.
Automatically sort the backlog data for one input operator.

Verifying this change

This change added tests and can be verified as follows:

Added integration test for keyed aggregation and keyed windowed aggregation with backlog data.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): yes
The serializers: no
The runtime per-record code paths (performance sensitive): yes
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? yes
If yes, how is the feature documented? JavaDocs

flinkbot · 2023-10-13T07:46:15Z

CI report:

6959d8f Azure: FAILURE

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

Sxnan · 2023-10-17T02:19:25Z

@flinkbot run azure

yunfengzhou-hub

Thanks for the PR. Left some comments as below.

...a/src/main/java/org/apache/flink/streaming/runtime/streamrecord/RecordAttributesBuilder.java

.../src/main/java/org/apache/flink/streaming/runtime/streamrecord/InternalRecordAttributes.java

flink-runtime/src/main/java/org/apache/flink/runtime/source/event/IsBacklogEvent.java

.../java/org/apache/flink/streaming/api/operators/InternalBacklogAwareTimerServiceImplTest.java

...k-streaming-java/src/test/java/org/apache/flink/streaming/api/operators/TestTriggerable.java

...ming-java/src/test/java/org/apache/flink/streaming/runtime/io/RecordAttributesValveTest.java

...src/test/java/org/apache/flink/streaming/api/operators/sort/SortingBacklogDataInputTest.java

...src/test/java/org/apache/flink/test/streaming/api/datastream/StreamingWithBacklogITCase.java

yunfengzhou-hub

Thanks for the update. Left some comments as below.

Besides, this PR does not cover all the changes proposed in FLIP-327. According to offline discussions, we need more PRs to support state cache and built-in multi-input operators. Therefore, it might be better to create child tickets for FLINK-33202 that cover the whole implementation plan, and assign this PR to one of these child tickets.

...a/src/main/java/org/apache/flink/streaming/runtime/streamrecord/StreamElementSerializer.java

...src/test/java/org/apache/flink/test/streaming/api/datastream/StreamingWithBacklogITCase.java

.../java/org/apache/flink/streaming/api/operators/InternalBacklogAwareTimerServiceImplTest.java

flink-streaming-java/src/test/java/org/apache/flink/streaming/api/operators/LambdaTrigger.java

Sxnan · 2023-11-07T08:23:35Z

@yunfengzhou-hub Thanks for the review! I updated the PR. Can you have another look?

yunfengzhou-hub

Thanks for the update. Left some minor comments as below. @xintongsong Could you please take a look at this PR?

...time/src/main/java/org/apache/flink/runtime/source/coordinator/SourceCoordinatorContext.java

...va/org/apache/flink/streaming/api/operators/InternalBacklogAwareTimerServiceManagerImpl.java

...treaming-java/src/main/java/org/apache/flink/streaming/runtime/io/RecordAttributesValve.java

Sxnan · 2023-11-14T02:11:47Z

@xintongsong Could you help review this PR?

SmirAlex · 2023-11-15T09:55:47Z

Hi @Sxnan! Sorry, I understand that I'm not a reviewer, but it happened that I was testing functionality from this MR recently, and I found a bug (in my opinion). It concerns the logic of RecordAttributesValve.
There was very little backlog data in my test at source + it had parallelism 4 (actually, any value > 1 suits the case). Due to very short backlog phase, time interval between sending RecordAttributes(isBacklog=true) and RecordAttributes(isBacklog=false) was also very short. In addition, due to the high parallelism one source subtask could send RecordAttributes(isBacklog=false) even before RecordAttributes(isBacklog=true) of another subtask. As a result, race condition have occurred in RecordAttributesValve#inputRecordAttributes. backlogChannelsCnt was incrementing and decrementing simultaneously, which led to not reaching numInputChannels, so no RecordAttributes was emitted from RecordAttributesValve at all.

I suggest to have different counters for RecordAttributes(isBacklog=true) and for RecordAttributes(isBacklog=false). Therefore, the race condition I mentioned earlier won't affect the result. Something like this:

    if (recordAttributes.isBacklog()) {
        backlogChannelsCnt += 1;
        if (backlogChannelsCnt != numInputChannels) {
            return;
        }
        backlogChannelsCnt = 0;
    } else {
        nonBacklogChannelsCnt += 1;
        if (nonBacklogChannelsCnt != numInputChannels) {
            return;
        }
        nonBacklogChannelsCnt = 0;
    }

    if (lastOutputAttributes == null
            || lastOutputAttributes.isBacklog() != recordAttributes.isBacklog()) {
        if (lastOutputAttributes != null && !lastOutputAttributes.isBacklog()) {
            LOG.warn(
                    "Switching from non-backlog to backlog is currently not supported. Backlog status remains.");
            return;
        }
        lastOutputAttributes = recordAttributes;
        output.emitRecordAttributes(recordAttributes);
    }

WDYT? Also, it is good to have a test for aforementioned case.

Sxnan · 2023-11-16T02:40:38Z

Hi @Sxnan! Sorry, I understand that I'm not a reviewer, but it happened that I was testing functionality from this MR recently, and I found a bug (in my opinion). It concerns the logic of RecordAttributesValve. There was very little backlog data in my test at source + it had parallelism 4 (actually, any value > 1 suits the case). Due to very short backlog phase, time interval between sending RecordAttributes(isBacklog=true) and RecordAttributes(isBacklog=false) was also very short. In addition, due to the high parallelism one source subtask could send RecordAttributes(isBacklog=false) even before RecordAttributes(isBacklog=true) of another subtask. As a result, race condition have occurred in RecordAttributesValve#inputRecordAttributes. backlogChannelsCnt was incrementing and decrementing simultaneously, which led to not reaching numInputChannels, so no RecordAttributes was emitted from RecordAttributesValve at all.

I suggest to have different counters for RecordAttributes(isBacklog=true) and for RecordAttributes(isBacklog=false). Therefore, the race condition I mentioned earlier won't affect the result. Something like this:
    if (recordAttributes.isBacklog()) {
        backlogChannelsCnt += 1;
        if (backlogChannelsCnt != numInputChannels) {
            return;
        }
        backlogChannelsCnt = 0;
    } else {
        nonBacklogChannelsCnt += 1;
        if (nonBacklogChannelsCnt != numInputChannels) {
            return;
        }
        nonBacklogChannelsCnt = 0;
    }

    if (lastOutputAttributes == null
            || lastOutputAttributes.isBacklog() != recordAttributes.isBacklog()) {
        if (lastOutputAttributes != null && !lastOutputAttributes.isBacklog()) {
            LOG.warn(
                    "Switching from non-backlog to backlog is currently not supported. Backlog status remains.");
            return;
        }
        lastOutputAttributes = recordAttributes;
        output.emitRecordAttributes(recordAttributes);
    }
WDYT? Also, it is good to have a test for aforementioned case.

Hi @Smir. Thanks for trying this out! The RecordAttributesValve combines the RecodAttributes from different input channels from the same input. The input is considered in backlog state if and only if all the input channels are backlog = true. Otherwise, some non-backlog records will be considered as backlog records.

Back to your case, where there is very little backlog data and it has parallelism greater than 1. It is possible that an input channel switches to non-backlog state before other input channels switch to backlog state. When that happens, we just process those data as if they are non-backlog data.

SmirAlex · 2023-11-16T13:13:07Z

Thanks for the response, @Sxnan. I got your points. BTW, the reason why delivery guarantees of RecordAttributes elements to downstream operators was important to me is that I was testing the possibility to use them in order to properly implement Processing Time Temporal Join. I described my thoughts about this in FLIP-326 discussion thread. Thus, if eventually it will be decided to reuse RecordAttributes logic in order to solve FLINK-19830, the problem I described will arise one way or another.

Also, I should mention that there is another workaround that guarantees delivery of RecordAttributes(isBacklog=false) (but not isBacklog=true), but at the same time prevents considering some non-backlog records as backlog records. Pseudocode may be like this:

if (lastOutputAttributes == null && nonBacklogChannelsCnt == numInputChannels) {
    nonBacklogChannelsCnt = 0;
    output.emitRecordAttributes(RecordAttributes(isBacklog=false));
}

The downside is that it is possible to have RecordAttributes(isBacklog=false) without RecordAttributes(isBacklog=true) previously, but probably, it's ok. In other respects, this approach worked fine for me.

flink-runtime/src/main/java/org/apache/flink/runtime/source/event/BacklogEvent.java

.../src/test/java/org/apache/flink/runtime/source/coordinator/SourceCoordinatorContextTest.java

...ing-java/src/main/java/org/apache/flink/streaming/runtime/streamrecord/RecordAttributes.java

xintongsong · 2023-11-28T09:40:16Z

...treaming-java/src/main/java/org/apache/flink/streaming/api/operators/BacklogTimeService.java

+        if (newKey == null) {
+            currentWatermark = maxWatermarkDuringBacklog;
+        }


What does newKey == null mean? And why we need to update the current watermark under such condition?

xintongsong · 2023-11-28T09:58:47Z

...ng-java/src/main/java/org/apache/flink/streaming/api/operators/InternalTimerServiceImpl.java

+    public KeyGroupedInternalPriorityQueue<TimerHeapInternalTimer<K, N>> getEventTimeTimersQueue() {
+        return eventTimeTimersQueue;
+    }


It feels hacky to expose an internal queue from one component and use it to construct another component. If the queue is meant to be shared, I think the correct way is to create the queue out side the two components and pass it in as constructor arguments.

xintongsong · 2023-11-28T10:09:52Z

...main/java/org/apache/flink/streaming/api/operators/InternalBacklogAwareTimerServiceImpl.java

+ * the timer service will be the maximum watermark during backlog processing.
+ */
+@Internal
+public class InternalBacklogAwareTimerServiceImpl<K, N> implements InternalTimerService<N> {


TBH, I'm not a big fan of inherit. The problem is that, by directly accessing protected fields of the superclass from the subclasses, the contract between the classes becomes obscure. As subclass can see internals of the superclass, changes to the superclass can easily break things that subclasses depend on thus cause problems, and such issues are very hard to avoid.

In many cases, an inherit approach can be replaced by an equivalent approach based on composition, which has clearer contracts between classes. But admittedly it's no always possible and sometimes gets costly.

For the classes below, I guess I haven't spent enough time to explore whether it's possible to switch to a composition based approach. Do you think it's feasible?

BacklogTimeService

BatchExecutionInternalTimeService

InternalBacklogAwareTimerServiceManagerImpl

InternalTimeServiceManagerImpl

xintongsong · 2023-11-28T10:45:23Z

...-streaming-java/src/main/java/org/apache/flink/streaming/api/graph/StreamGraphGenerator.java

+        if (Duration.ZERO.equals(
+                configuration.get(
+                        ExecutionCheckpointingOptions.CHECKPOINTING_INTERVAL_DURING_BACKLOG))) {


Same here for deciding whether to enable the mixed mode.

xintongsong · 2023-11-28T10:50:47Z

...table-planner/src/test/scala/org/apache/flink/table/planner/runtime/utils/TimeTestUtil.scala

+    override def processRecordAttributes(recordAttributes: RecordAttributes): Unit =
+      super.processRecordAttributes(recordAttributes)


I'm not familiar with Scala. What is this for? Shouldn't the method from the super class be called anyway if we don't override it?

…acklog is changed

…put and TwoInputStreamOperator

…utput

…ethod of PushingAsyncDataInput#DataOutput

…utes downstream when backlog status changed

…ttributes

…orting records during backlog processing

…ed during backlog processing

…g processing

…t is disabled during backlog processing

…dAttributes

Sxnan · 2023-12-13T03:19:38Z

After an offline discussion with @xintongsong, we will split this PR into two PRs. This PR will focus on optimizing the one input operator during backlog. The other PR will focus on propagating the RecordAttributes through the job graph.

This PR should be merged after the other PR.

Sxnan force-pushed the FLIP-327 branch from dff5c54 to adb943c Compare October 13, 2023 07:41

Sxnan marked this pull request as ready for review October 16, 2023 03:48

Sxnan force-pushed the FLIP-327 branch 2 times, most recently from 36a0166 to edd6648 Compare October 16, 2023 06:37

yunfengzhou-hub reviewed Oct 18, 2023

View reviewed changes

Sxnan force-pushed the FLIP-327 branch 4 times, most recently from df7e9f5 to 756f929 Compare October 24, 2023 10:11

yunfengzhou-hub reviewed Oct 25, 2023

View reviewed changes

Sxnan changed the title ~~[FLINK-33202][runtime] Support switching from batch to stream mode to improve throughput when processing backlog data~~ [FLINK-33398][runtime] Support switching from batch to stream mode for one input stream operator Oct 30, 2023

Sxnan force-pushed the FLIP-327 branch 4 times, most recently from d83c464 to a4013fc Compare November 7, 2023 06:35

Sxnan requested a review from yunfengzhou-hub November 7, 2023 08:23

yunfengzhou-hub reviewed Nov 9, 2023

View reviewed changes

Sxnan force-pushed the FLIP-327 branch from a4013fc to 5ced952 Compare November 9, 2023 09:21

Sxnan force-pushed the FLIP-327 branch from 5ced952 to bd97555 Compare November 16, 2023 02:47

Sxnan force-pushed the FLIP-327 branch 3 times, most recently from 91c334d to ffd611b Compare November 22, 2023 14:28

xintongsong reviewed Nov 29, 2023

View reviewed changes

Sxnan added 2 commits December 11, 2023 17:30

[hotfix][runtime] CollectionDataInput handles WatermarkStatus

43991e2

[FLINK-33202][runtime] SourceCoordinator notify SourceOperator when b…

dbae3a5

…acklog is changed

Sxnan force-pushed the FLIP-327 branch from ffd611b to 9189b50 Compare December 11, 2023 09:44

Sxnan added 12 commits December 11, 2023 17:50

[FLINK-33202][runtime] Introduce RecordAttributes

acc662c

[FLINK-33202][runtime] Introduce processRecordAttributes method to In…

48b760a

…put and TwoInputStreamOperator

[FLINK-33202][runtime] Introduce emitRecordAttributes method to the O…

73cab31

…utput

[FLINK-33202][runtime] Introduce and implement emitRecordAttributes m…

f07b3c4

…ethod of PushingAsyncDataInput#DataOutput

[FLINK-33202][runtime] SourceOperator send Watermark and RecordAttrib…

7ab7df7

…utes downstream when backlog status changed

[FLINK-33202][runtime] AbstractStreamTaskNetworkInput process RecordA…

de4de1e

…ttributes

[FLINK-33202][runtime] Introduce SortingBacklogDataInput to support s…

3f77362

…orting records during backlog processing

[FLINK-33202][runtime] Sort backlog records when checkpoint is disabl…

076acdd

…ed during backlog processing

[FLINK-33202][runtime] Introduce TimerService that is aware of backlo…

b127ce7

…g processing

[FLINK-33202][runtime] Use backlog aware timer service when checkpoin…

61e2811

…t is disabled during backlog processing

[FLINK-33202][runtime] AbstractStreamOperator implements processRecor…

45b6b66

…dAttributes

[FLINK-33202][runtime] Add IT case for mix mode execution

6959d8f

Sxnan force-pushed the FLIP-327 branch from 9189b50 to 6959d8f Compare December 11, 2023 09:50

flinkbot added the component=Runtime/Task label Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-33398][runtime] Support switching from batch to stream mode for one input stream operator #23521

[FLINK-33398][runtime] Support switching from batch to stream mode for one input stream operator #23521

Sxnan commented Oct 13, 2023 •

edited

Loading

flinkbot commented Oct 13, 2023 •

edited

Loading

Sxnan commented Oct 17, 2023

yunfengzhou-hub left a comment

yunfengzhou-hub left a comment

Sxnan commented Nov 7, 2023

yunfengzhou-hub left a comment

Sxnan commented Nov 14, 2023

SmirAlex commented Nov 15, 2023

Sxnan commented Nov 16, 2023

SmirAlex commented Nov 16, 2023

xintongsong Nov 28, 2023

xintongsong Nov 28, 2023

xintongsong Nov 28, 2023

xintongsong Nov 28, 2023

xintongsong Nov 28, 2023

Sxnan commented Dec 13, 2023

		override def processRecordAttributes(recordAttributes: RecordAttributes): Unit =
		super.processRecordAttributes(recordAttributes)

[FLINK-33398][runtime] Support switching from batch to stream mode for one input stream operator #23521

Are you sure you want to change the base?

[FLINK-33398][runtime] Support switching from batch to stream mode for one input stream operator #23521

Conversation

Sxnan commented Oct 13, 2023 • edited Loading

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Oct 13, 2023 • edited Loading

CI report:

Sxnan commented Oct 17, 2023

yunfengzhou-hub left a comment

Choose a reason for hiding this comment

yunfengzhou-hub left a comment

Choose a reason for hiding this comment

Sxnan commented Nov 7, 2023

yunfengzhou-hub left a comment

Choose a reason for hiding this comment

Sxnan commented Nov 14, 2023

SmirAlex commented Nov 15, 2023

Sxnan commented Nov 16, 2023

SmirAlex commented Nov 16, 2023

xintongsong Nov 28, 2023

Choose a reason for hiding this comment

xintongsong Nov 28, 2023

Choose a reason for hiding this comment

xintongsong Nov 28, 2023

Choose a reason for hiding this comment

xintongsong Nov 28, 2023

Choose a reason for hiding this comment

xintongsong Nov 28, 2023

Choose a reason for hiding this comment

Sxnan commented Dec 13, 2023

Sxnan commented Oct 13, 2023 •

edited

Loading

flinkbot commented Oct 13, 2023 •

edited

Loading