[FLINK-29980] Handle partition keys directly in hive bulk format #21290

Aitozi · 2022-11-11T02:59:13Z

What is the purpose of the change

This is meant to leverage the EnrichedRowData to handle partition keys logic in the hive connectors. After this it will not depend on the parquet/orc formats to make up the rowdata with partition keys. Also, the formats module do not have to care about make up the partition keys.

At the first try, I want to handle the partition keys in the hive. But I found that can not finish in a single PR without touching the parquet/orc format's code. So I mix the PR with two commits as you can see.

Brief change log

wrap the HiveInputFormat with FileInfoExtractorBulkFormat
decorate the HiveTableInputFormat's records with the record mapping

Verifying this change

This change is already covered by existing tests: HiveSourceITCase and HiveTableSourceITCase

flinkbot · 2022-11-11T03:03:38Z

CI report:

8558e4d Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

… logic

Aitozi · 2022-11-14T12:58:20Z

@flinkbot run azure

Aitozi · 2022-11-17T09:46:59Z

hi @luoyuxia , can you help review this pr, thanks

luoyuxia · 2022-12-01T11:38:56Z

Thanks for contribution, I'll have a look when I'm free.

luoyuxia

@Aitozi Thanks for contribution. I left some comments. PTAL.
BTW, could you please rebase master?

luoyuxia · 2023-03-06T04:20:54Z

...r-files/src/main/java/org/apache/flink/connector/file/table/FileInfoExtractorBulkFormat.java


    public FileInfoExtractorBulkFormat(
-            BulkFormat<RowData, FileSourceSplit> wrapped,
+            BulkFormat<RowData, SplitT> wrapped,


Why change this?

luoyuxia · 2023-03-06T04:21:36Z

...r-files/src/main/java/org/apache/flink/connector/file/table/FileInfoExtractorBulkFormat.java

            DataType producedDataType,
            TypeInformation<RowData> producedTypeInformation,
            Map<String, FileSystemTableSource.FileInfoAccessor> metadataColumns,
            List<String> partitionColumns,
-            String defaultPartName) {
+            PartitionFieldExtractor<SplitT> partitionFieldExtractor) {


Why change this? It seems like a code refactor?

luoyuxia · 2023-03-06T04:24:09Z

...r-files/src/main/java/org/apache/flink/connector/file/table/FileInfoExtractorBulkFormat.java

        // Fill the metadata + partition columns row
+        List<FileInfoExtractor.PartitionColumn> partitionColumns =


Seems like another code refactor? Do we really need refactor it? Is there any special reason?

luoyuxia · 2023-03-06T06:32:45Z

...ink-connector-files/src/main/java/org/apache/flink/connector/file/src/util/RecordMapper.java

+
+/** Record mapper definition. */
+@FunctionalInterface
+public interface RecordMapper<I, O> {


Why move RecordMapper to here? I think it's fine to keep it in origin place.

luoyuxia · 2023-03-06T07:05:11Z

...nk-orc-nohive/src/main/java/org/apache/flink/orc/nohive/OrcNoHiveColumnarRowInputFormat.java

@@ -58,14 +56,10 @@ OrcColumnarRowInputFormat<VectorizedRowBatch, SplitT> createPartitionedFormat(
                    Configuration hadoopConfig,
                    RowType tableType,
                    List<String> partitionKeys,


Then, as todo said, partitionKeys code should be pruned.

After partitionKeys code is pruned, the name & comment of this method should change.

luoyuxia · 2023-03-06T08:23:57Z

...s/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveSourceBuilder.java

+                        inputFormat,
+                        producedType,
+                        producedTypeInfo,
+                        new HashMap<>(),


nit:Collections.emptyMap()

luoyuxia · 2023-03-06T08:27:52Z

...onnector-hive/src/main/java/org/apache/flink/connectors/hive/read/HiveMapredSplitReader.java

@@ -143,21 +139,6 @@ public HiveMapredSplitReader(

        // construct reuse row
        this.row = new GenericRowData(selectedFields.length);


selectedFields will contains partition column, we should exclude partition column as we handle in extern wrapper.

luoyuxia · 2023-03-06T08:47:17Z

...arquet/src/test/java/org/apache/flink/formats/parquet/ParquetColumnarRowInputFormatTest.java

@@ -294,56 +286,6 @@ void testProjectionReadUnknownField(int rowGroupSize) throws IOException {
                });
    }

-    @ParameterizedTest
-    @MethodSource("parameters")
-    void testPartitionValues(int rowGroupSize) throws IOException {


After remove the invalid test in parquet/orc input format test, I think we should add test for FileInfoExtractorBulkFormat to make sure it can get partition columns correctly.

luoyuxia · 2023-03-06T08:51:57Z

...k-connector-files/src/main/java/org/apache/flink/connector/file/table/FileInfoExtractor.java

+    }
+
+    /** Info of the partition column. */
+    public static class PartitionColumn implements Serializable {


Do we really need this class?

luoyuxia · 2023-03-06T08:52:33Z

...k-connector-files/src/main/java/org/apache/flink/connector/file/table/FileInfoExtractor.java

+                        producedRowFieldNames, mutableRowFieldNames, fixedRowFieldNames);
+    }
+
+    public List<PartitionColumn> getPartitionColumns() {


Can we just return the name of partition column? So that we won't need PartitionColumn

flinkbot added the component=Connectors/Hive label Nov 11, 2022

Aitozi marked this pull request as draft November 11, 2022 09:30

[FLINK-29980] Handle partition keys directly in hive bulk format

b12372e

Aitozi force-pushed the hive-partition-keys branch from 4c3b6bf to 0a9702c Compare November 14, 2022 12:22

[FLINK-25113] Cleanup from Parquet and Orc the partition key handling…

8558e4d

… logic

Aitozi force-pushed the hive-partition-keys branch from 0a9702c to 8558e4d Compare November 14, 2022 12:49

Aitozi marked this pull request as ready for review November 14, 2022 12:58

luoyuxia reviewed Mar 6, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FLINK-29980] Handle partition keys directly in hive bulk format #21290

[FLINK-29980] Handle partition keys directly in hive bulk format #21290

Aitozi commented Nov 11, 2022 •

edited

flinkbot commented Nov 11, 2022 •

edited

Aitozi commented Nov 14, 2022

Aitozi commented Nov 17, 2022

luoyuxia commented Dec 1, 2022

luoyuxia left a comment

luoyuxia Mar 6, 2023

luoyuxia Mar 6, 2023

luoyuxia Mar 6, 2023

luoyuxia Mar 6, 2023

luoyuxia Mar 6, 2023

luoyuxia Mar 6, 2023

luoyuxia Mar 6, 2023

luoyuxia Mar 6, 2023

luoyuxia Mar 6, 2023

luoyuxia Mar 6, 2023

luoyuxia Mar 6, 2023

		// Fill the metadata + partition columns row
		List<FileInfoExtractor.PartitionColumn> partitionColumns =

		@@ -143,21 +139,6 @@ public HiveMapredSplitReader(

		// construct reuse row
		this.row = new GenericRowData(selectedFields.length);

[FLINK-29980] Handle partition keys directly in hive bulk format #21290

Are you sure you want to change the base?

[FLINK-29980] Handle partition keys directly in hive bulk format #21290

Conversation

Aitozi commented Nov 11, 2022 • edited

What is the purpose of the change

Brief change log

Verifying this change

flinkbot commented Nov 11, 2022 • edited

CI report:

Aitozi commented Nov 14, 2022

Aitozi commented Nov 17, 2022

luoyuxia commented Dec 1, 2022

luoyuxia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Aitozi commented Nov 11, 2022 •

edited

flinkbot commented Nov 11, 2022 •

edited