Flink: change sink shuffle to use RowData as data type and statistics key type #7494

stevenzwu · 2023-05-01T20:46:38Z

, because FlinkSink normalize the data type to RowData before coming to writer. Also added custom type serializers for MapDataStatistics and /DataStatisticsOrRecord.

… key type, because FlinkSink normalize the data type to RowData before coming to writer. Also added custom type serializers for MapDataStatistics and /DataStatisticsOrRecord.

stevenzwu · 2023-05-01T21:47:04Z

address some of the comments from issue #7393.

@huyuanfeng2018 @hililiwei @yegangy0718 can you help review?

@yegangy0718 will follow up with a separate PR on jmh benchmark.

stevenzwu · 2023-05-01T23:39:04Z

...k/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsFactory.java

Factory is replaced by TypeSerializer

stevenzwu · 2023-05-01T23:39:57Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatistics.java

@@ -28,10 +29,10 @@
 * (sketching) can be used.
 */
 @Internal
-interface DataStatistics<K> {
+interface DataStatistics<D extends DataStatistics, S> {


used generic trick for the strong type check. it shouldn't matter to users since all these are internal classes.

stevenzwu · 2023-05-01T23:43:09Z

...7/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/MapDataStatisticsSerializer.java

+
+@Internal
+class MapDataStatisticsSerializer
+    extends TypeSerializer<DataStatistics<MapDataStatistics, Map<RowData, Long>>> {


use the base interface DataStatistics so that it can be used by DataStatisticsOrRecordSerializer.

DataStatisticsOrRecordSerializer( TypeSerializer<DataStatistics<D, S>> statisticsSerializer, TypeSerializer<RowData> recordSerializer)

stevenzwu · 2023-05-01T23:45:09Z

...17/flink/src/test/java/org/apache/iceberg/flink/sink/shuffle/TestDataStatisticsOperator.java

-    assertTrue(mapDataStatistics.dataStatistics().containsKey("b"));
-    assertEquals(2L, (long) mapDataStatistics.dataStatistics().get("a"));
-    assertEquals(1L, (long) mapDataStatistics.dataStatistics().get("b"));
+    try (OneInputStreamOperatorTestHarness<


need to wrap the block in try with test harness. otherwise, Flink don't know how to serialize the output type ofDataStatisticsOrRecord <>. test harness has the proper setup(...) on output type serializer

stevenzwu · 2023-05-02T01:56:49Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatistics.java

@@ -42,12 +43,19 @@
   *
   * @param key generate from data by applying key selector
   */
-  void add(K key);
+  void add(RowData key);


we will use RowDataProjection to extract the key. PR #7493 is related.

huyuanfeng2018 · 2023-05-04T02:24:40Z

flink/v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatistics.java

@@ -42,12 +43,19 @@
   *
   * @param key generate from data by applying key selector
   */
-  void add(K key);
+  void add(RowData key);


Is it necessary to add a method here to count statistical information with values? For example, data under different keys may have different sizes. Is it also a consideration when controlling subsequent balance，like： add(Rowdata key, V v) Among them, v may represent the record bytes of the row corresponding to the current key. What do you think？

we are only counting records per key. To get the bytes, it would require serialization or some other trick of estimation. Agree bytes would be the best. but record count is probably also good enough.

huyuanfeng2018 · 2023-05-04T02:30:10Z

.../v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java

@@ -40,50 +40,49 @@
 * shuffle record to improve data clustering while maintaining relative balanced traffic
 * distribution to downstream subtasks.
 */
-class DataStatisticsOperator<T, K> extends AbstractStreamOperator<DataStatisticsOrRecord<T, K>>
-    implements OneInputStreamOperator<T, DataStatisticsOrRecord<T, K>>, OperatorEventHandler {
+class DataStatisticsOperator<D extends DataStatistics<D, S>, S>


Ok, I think this DataStatisticsOperator is good for collecting some statistical information. It seems that we also need an Operator to determine the partitionID, such as PartitionIdAssignerOperator and then pass org.apache.flink.api.java.functions.IdPartitioner From custom data distribution, I haven’t seen the design of this piece in the design document, can you briefly introduce the follow-up implementation

Yes, this is only for collecting statistics to guide the partitioner decision. we will implement a custom range partitioner (Flink ) that splits the value into ranges (one for each assigner writer subtask) based on the statistics.

According to my understanding, each writer will be responsible for one or more partitions, and we will distribute the data arriving from upstream to the corresponding writers, right?

@hililiwei that is correct. custom range partitioner for Flink DataStream will distribute the data to writer subtasks with good clustering based on the data statistics.

pvary · 2023-05-08T08:30:11Z

@stevenzwu: I like that with a correct reader we can make sure that the IcebergSource could read whatever data structure we want. It would be good to have the possibility for the same thing for the FlinkSink (or IcebergSink later). Wouldn't this PR close this possibility by forcing the Sink to use RowData?

stevenzwu · 2023-05-08T16:02:45Z

@stevenzwu: I like that with a correct reader we can make sure that the IcebergSource could read whatever data structure we want. It would be good to have the possibility for the same thing for the FlinkSink (or IcebergSink later). Wouldn't this PR close this possibility by forcing the Sink to use RowData?

@pvary FlinkSink can still support other input types (like Avro GenericRecord). The behavior (internally) is that all input type is converted to Flink RowData, which is the only supported record format for the underline file writer in Iceberg.

That is the reason why @huyuanfeng2018 was suggesting if RowData is the only key type we need to support. If in the future, Flink sink supports other record format for the underline file writer, maybe we can revisit this decision. for now, RowData is the only type Flink file writer supports.

.../v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java

hililiwei · 2023-05-09T06:26:53Z

...nk/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOrRecordSerializer.java

+
+  @Override
+  public int getLength() {
+    return -1;


We only care about the amount of data, not the size of data, right?

@hililiwei not sure if I understand the question here. can you elaborate?

here is the javadoc for this method from the TypeSerializer interface.

Returns: The length of the data type, or -1 for variable length data types.

pvary · 2023-05-09T14:04:16Z

@pvary FlinkSink can still support other input types (like Avro GenericRecord). The behavior (internally) is that all input type is converted to Flink RowData, which is the only supported record format for the underline file writer in Iceberg.

That is the reason why @huyuanfeng2018 was suggesting if RowData is the only key type we need to support. If in the future, Flink sink supports other record format for the underline file writer, maybe we can revisit this decision. for now, RowData is the only type Flink file writer supports.

Thanks @stevenzwu for the answer!

yegangy0718

One more question, how do we plan to register the DataStatisticsOrRecordSerializer? Do we plan to define TypeInfoFactory or use env config to register it?

stevenzwu · 2023-05-12T23:53:02Z

One naive question, how do we plan to register the DataStatisticsOrRecordSerializer? Do we plan to define TypeInfoFactory or use env config to register it?

@yegangy0718 good question. we probably also need to implement a TypeInfo class when adding the DataStatisticsOperator. that way, Flink would know which type serializer to use. TypeInformation impl can be followed up as a separate PR.

Here is the DataStream API.

    /**
     * Method for passing user defined operators along with the type information that will transform
     * the DataStream.
     *
     * @param operatorName name of the operator, for logging purposes
     * @param outTypeInfo the output type of the operator
     * @param operator the object containing the transformation logic
     * @param <R> type of the return stream
     * @return the data stream constructed
     * @see #transform(String, TypeInformation, OneInputStreamOperatorFactory)
     */
    @PublicEvolving
    public <R> SingleOutputStreamOperator<R> transform(
            String operatorName,
            TypeInformation<R> outTypeInfo,
            OneInputStreamOperator<T, R> operator) {

        return doTransform(operatorName, outTypeInfo, SimpleOperatorFactory.of(operator));
    }

… data type and statistics key type

…type and statistics key type (#7632)

github-actions bot added the flink label May 1, 2023

Flink: change sink shuffle to use RowData as data type and statistics…

092eded

… key type, because FlinkSink normalize the data type to RowData before coming to writer. Also added custom type serializers for MapDataStatistics and /DataStatisticsOrRecord.

stevenzwu force-pushed the shuffle-row-data branch from 5dbec0c to 092eded Compare May 1, 2023 21:43

stevenzwu commented May 2, 2023

View reviewed changes

huyuanfeng2018 reviewed May 4, 2023

View reviewed changes

stevenzwu added this to In Progress in [Priority 2] Flink: add more shuffling support for streaming writer May 4, 2023

yegangy0718 reviewed May 9, 2023

View reviewed changes

.../v1.17/flink/src/main/java/org/apache/iceberg/flink/sink/shuffle/DataStatisticsOperator.java Outdated Show resolved Hide resolved

hililiwei reviewed May 9, 2023

View reviewed changes

pvary approved these changes May 9, 2023

View reviewed changes

fix warnings from generic type usage

85ff311

stevenzwu requested review from hililiwei and yegangy0718 and removed request for hililiwei and yegangy0718 May 12, 2023 15:53

yegangy0718 approved these changes May 12, 2023

View reviewed changes

stevenzwu merged commit 203b8db into apache:master May 15, 2023
13 checks passed

stevenzwu added a commit to stevenzwu/iceberg that referenced this pull request May 17, 2023

Flink: backport PR apache#7494. change sink shuffle to use RowData as…

1fb86dc

… data type and statistics key type

stevenzwu added a commit that referenced this pull request May 18, 2023

Flink: backport PR #7494. change sink shuffle to use RowData as data …

67decf9

…type and statistics key type (#7632)

stevenzwu moved this from In Progress to Done in [Priority 2] Flink: add more shuffling support for streaming writer May 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flink: change sink shuffle to use RowData as data type and statistics key type #7494

Flink: change sink shuffle to use RowData as data type and statistics key type #7494

stevenzwu commented May 1, 2023

stevenzwu commented May 1, 2023

stevenzwu May 1, 2023

stevenzwu May 1, 2023

stevenzwu May 1, 2023

stevenzwu May 1, 2023

stevenzwu May 2, 2023

huyuanfeng2018 May 4, 2023

stevenzwu May 4, 2023

huyuanfeng2018 May 4, 2023

stevenzwu May 4, 2023

hililiwei May 9, 2023

stevenzwu May 11, 2023

pvary commented May 8, 2023

stevenzwu commented May 8, 2023

hililiwei May 9, 2023

stevenzwu May 11, 2023

pvary commented May 9, 2023

yegangy0718 left a comment •

edited

stevenzwu commented May 12, 2023 •

edited

Flink: change sink shuffle to use RowData as data type and statistics key type #7494

Flink: change sink shuffle to use RowData as data type and statistics key type #7494

Conversation

stevenzwu commented May 1, 2023

stevenzwu commented May 1, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary commented May 8, 2023

stevenzwu commented May 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pvary commented May 9, 2023

yegangy0718 left a comment • edited

Choose a reason for hiding this comment

stevenzwu commented May 12, 2023 • edited

yegangy0718 left a comment •

edited

stevenzwu commented May 12, 2023 •

edited