[FLINK-32365][orc]get orc table statistics in parallel #22805

Baibaiwuguo · 2023-06-16T08:33:11Z

What is the purpose of the change

get orc table statistics in parallel, Improve acquisition speed.

Brief change log

Let HiveTableSource extend from SupportStatisticsReport

Verifying this change

Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (100MB)
Extended integration test for recovery after master (JobManager) failure
Added test that validates that TaskInfo is transferred only once across recoveries
Manually verified the change by running a 4 node cluster with 2 JobManagers and 4 TaskManagers, a stateful streaming program, and killing one JobManager and two TaskManagers during the execution, verifying that recovery happens correctly.

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): (yes / no)
The public API, i.e., is any changed class annotated with @Public(Evolving): (yes / no)
The serializers: (yes / no / don't know)
The runtime per-record code paths (performance sensitive): (yes / no / don't know)
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: (yes / no / don't know)
The S3 file system connector: (yes / no / don't know)

Documentation

Does this pull request introduce a new feature? (yes / no)
If yes, how is the feature documented? (not applicable / docs / JavaDocs / not documented)

flinkbot · 2023-06-16T08:41:29Z

CI report:

8ae1dbd Azure: SUCCESS

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run azure re-run the last Azure build

luoyuxia

@Baibaiwuguo Thanks for contribution. Overall LGTM. But I'd like to have @swuferhong give it a review.

luoyuxia · 2023-06-16T08:59:27Z

...formats/flink-orc/src/main/java/org/apache/flink/orc/util/OrcFormatStatisticsReportUtil.java

@@ -58,11 +63,15 @@ public static TableStats getTableStatistics(
            long rowCount = 0;
            Map<String, ColumnStatistics> columnStatisticsMap = new HashMap<>();
            RowType producedRowType = (RowType) producedDataType.getLogicalType();
+            ExecutorService executorService = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors());


I'm wondering can we make it configurable but default is Runtime.getRuntime().availableProcessors() just like some configuration s3.upload.max.concurrent.uploads.

Baibaiwuguo · 2023-06-19T07:22:58Z

@luoyuxia I also think tasks can be configured more rationally. I will refine this issue.

Baibaiwuguo · 2023-06-20T09:10:43Z

@luoyuxia I find the code is called in multiple places. We make it configurable, we need change more moudles and we get more parameters. if we set parameter in hadoop config，both orc and parquet can use this parameter. Could you give me some idea？

swuferhong · 2023-06-21T09:20:45Z

@luoyuxia I find the code is called in multiple places. We make it configurable, we need change more moudles and we get more parameters. if we set parameter in hadoop config，both orc and parquet can use this parameter. Could you give me some idea？

Hi, did you encounter the problem of slow reporting ORC statistics during using hive connector? If that, I think you can add this parameter into HiveOptions as a Flink conf, and you need to set this flink conf into job conf in method HiveSourceBuilder.setFlinkConfigurationToJobConf() (jobConf will be add into hadoopConf in hive source) . By doing this, you can get this parameter from hadoopConf, if this parameter not in hadoopConf, you can set it as Runtime.getRuntime().availableProcessors() as default. WDYT, @luoyuxia .

luoyuxia · 2023-06-25T01:20:35Z

I will also recommend to only support to configure in HiveOption in the first.

Baibaiwuguo · 2023-06-25T11:54:32Z

@luoyuxia I add configure in HiveOption. I change both of orc and parquet.

Baibaiwuguo · 2023-06-29T10:13:55Z

@dongjoon-hyun @swuferhong @luoyuxia Could you help review when you are free.

luoyuxia · 2023-06-30T01:12:49Z

@Baibaiwuguo The test fails.

luoyuxia

@baiwuchang Thanks for updating. I left minor comments. PTAL

luoyuxia · 2023-06-30T03:39:23Z

...nectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveOptions.java

@@ -134,6 +134,13 @@ public class HiveOptions {
                                    + " custom: use policy class to create a commit policy."
                                    + " Support to configure multiple policies: 'metastore,success-file'.");

+    public static final ConfigOption<Integer> TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM =


Please remeber to add the doc for the newly added option.

luoyuxia · 2023-06-30T04:08:36Z

...formats/flink-orc/src/main/java/org/apache/flink/orc/util/OrcFormatStatisticsReportUtil.java

        try {
            long rowCount = 0;
            Map<String, ColumnStatistics> columnStatisticsMap = new HashMap<>();
            RowType producedRowType = (RowType) producedDataType.getLogicalType();
+
+            ExecutorService executorService = Executors.newFixedThreadPool(statisticsThreadNum);


Executors.newFixedThreadPool( statisticsThreadNum, new ExecutorThreadFactory("orc-get-table-statistic-worker"));

?

luoyuxia · 2023-06-30T04:11:06Z

.../src/main/java/org/apache/flink/formats/parquet/utils/ParquetFormatStatisticsReportUtil.java

        try {
            Map<String, Statistics<?>> columnStatisticsMap = new HashMap<>();
            RowType producedRowType = (RowType) producedDataType.getLogicalType();
+            ExecutorService executorService = Executors.newFixedThreadPool(statisticsThreadNum);


luoyuxia · 2023-06-30T04:13:40Z

...ors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveTableSource.java

@@ -373,13 +374,18 @@ private TableStats getMapRedInputFormatStatistics(
                        .toLowerCase();
        List<Path> files =
                inputSplits.stream().map(FileSourceSplit::path).collect(Collectors.toList());
+        int statisticsThreadNum = flinkConf.get(TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM);


Check the thread num is not less than 1;

Baibaiwuguo · 2023-07-04T03:57:36Z

I changed some of the logic. The task read the file footer in parallel and calculate the file footer in serial.
I add the doc for the newly added option. I need you to review it when you free.

� Conflicts: � flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveOptions.java

luoyuxia · 2023-07-06T06:15:10Z

I'll have a look when i'm free. Thxs.

luoyuxia

@Baibaiwuguo Thanks for updating. I left minor comment. PTAL.
Please rememer to append a commit to address my comments.

luoyuxia · 2023-07-06T11:51:27Z

docs/content/docs/connectors/table/hive/hive_read_write.md

@@ -206,6 +206,10 @@ Users can do some performance tuning by tuning the split's size with the follow
 - Currently, these configurations for tuning split size only works for the Hive table stored as ORC format.
 {{< /hint >}}

+### Read Table Statistics


Please don't forget to also update chinese doc

Please also specify why we may need to scan the table's to get statistics.
When the table statistic is not available from Hive metastore, we will then try to get the statistic by scanning the table.

luoyuxia · 2023-07-06T11:58:00Z

...nectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveOptions.java

@@ -134,6 +134,13 @@ public class HiveOptions {
                                    + " custom: use policy class to create a commit policy."
                                    + " Support to configure multiple policies: 'metastore,success-file'.");

+    public static final ConfigOption<Integer> TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM =
+            key("table.exec.hive.read-format-statistics.thread-num")


After rethink it, I don't think we don't need to expose the word format which I think may make user confused. So, I'd like to rename it to
table.exec.hive.read-statistics.thread-num.
WDYT?

luoyuxia · 2023-07-06T11:59:47Z

...formats/flink-orc/src/main/java/org/apache/flink/orc/util/OrcFormatStatisticsReportUtil.java

+                    }
+                }
+            }
+            executorService.shutdownNow();


Please use try {} finnal{} to shutdown the executorService

luoyuxia · 2023-07-06T12:03:07Z

.../src/main/java/org/apache/flink/formats/parquet/utils/ParquetFormatStatisticsReportUtil.java

            }
+            executorService.shutdownNow();


…-statistics.thread-num`

Baibaiwuguo · 2023-07-06T12:53:58Z

@luoyuxia Thank you for the very careful and patient review. I have fix your comment. PTAL

luoyuxia

@Baibaiwuguo Thanks for updating. I left minor comment. Should LGTM in next iteration.

luoyuxia · 2023-07-07T01:34:40Z

...nectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveOptions.java

@@ -135,7 +135,7 @@ public class HiveOptions {
                                    + " Support to configure multiple policies: 'metastore,success-file'.");

    public static final ConfigOption<Integer> TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM =


TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM -> TABLE_EXEC_HIVE_READ_STATISTICS_THREAD_NUM

luoyuxia · 2023-07-07T01:43:58Z

docs/content.zh/docs/connectors/table/hive/hive_read_write.md

@@ -190,6 +190,10 @@ Flink 允许你灵活的配置并发推断策略。你可以在 `TableConfig`
 - 目前上述参数仅适用于 ORC 格式的 Hive 表。
 {{< /hint >}}

+### 读取表统计信息
+
+当hive metastore(如`orc`或`parquet`)中没有表的统计信息时，需要扫描表获取信息。你可以使用`table.exec.hive.read-statistics.thread-num`去配置扫描线程数。默认值是当前系统可用处理器数，你配置的值应该大于0。


suggestion：
当hive metastore 中没有表的统计信息时，Flink 会尝试扫描表来获取统计信息从而生成合适的执行计划。此过程可以会比较耗时，你可以使用table.exec.hive.read-statistics.thread-num去配置使用多少个线程去扫描表，默认值是当前系统可用处理器数，配置的值应该大于0。

luoyuxia · 2023-07-07T01:49:04Z

docs/content/docs/connectors/table/hive/hive_read_write.md

@@ -208,7 +208,9 @@ Users can do some performance tuning by tuning the split's size with the follow

 ### Read Table Statistics

-To obtain hive table statistics faster, When hive table format is `orc` or `parquet`. You can use `table.exec.hive.read-format-statistics.thread-num` to configure the thread number. The default value is the number of available processors in the current system and the configured value should be bigger than 0.
+When the table statistic is not available from Hive metastore, such as `orc` or `parquet`. We will then try to get the statistic by scanning the table. 


Suggestion:
When the table statistic is not available from the Hive meta store, Flink will try to scan the table to get the statistic to generate a better execution plan. It may cost some time to get the statistic.

luoyuxia · 2023-07-07T01:51:46Z

docs/content/docs/connectors/table/hive/hive_read_write.md

@@ -208,7 +208,9 @@ Users can do some performance tuning by tuning the split's size with the follow

 ### Read Table Statistics

-To obtain hive table statistics faster, When hive table format is `orc` or `parquet`. You can use `table.exec.hive.read-format-statistics.thread-num` to configure the thread number. The default value is the number of available processors in the current system and the configured value should be bigger than 0.
+When the table statistic is not available from Hive metastore, such as `orc` or `parquet`. We will then try to get the statistic by scanning the table. 


Suggestion:
When the table statistic is not available from the Hive metastore, Flink will try to scan the table to get the statistic to generate a better execution plan. It may cost some time to get the statistic. To get it faster, you can use table.exec.hive.read-statistics.thread-num to configure how many threads to use to scan the table.
The default value is the number of available processors in the current system and the configured value should be bigger than 0.

Baibaiwuguo · 2023-07-07T07:51:25Z

@luoyuxia Thanks. I have fix your comment.

luoyuxia

@Baibaiwuguo Thanks for updating. LGTM.
But I would like have @swuferhong giave another final review.

Baibaiwuguo · 2023-07-11T10:32:06Z

@luoyuxia thanks for your review!

Baibaiwuguo · 2023-07-11T10:38:34Z

@swuferhong hi, thanks for you ideas. Can you help me to review it when you are free?

swuferhong

Hi, @Baibaiwuguo . LGTM +1

[orc]get orc table statistics in parallel

f530649

luoyuxia reviewed Jun 16, 2023

View reviewed changes

Baibaiwuguo requested a review from luoyuxia June 19, 2023 07:56

refer: config orc/parquet statistics thread number

bfc7616

Baibaiwuguo added 2 commits June 30, 2023 11:18

refer：spotless apply project

6403b4f

refer：spotless apply

92f8da7

luoyuxia reviewed Jun 30, 2023

View reviewed changes

Baibaiwuguo and others added 2 commits June 30, 2023 15:36

fix：require thread number and thread pool name & add hive doc

e33fec6

fix: remove update statistics in parallel

b20bc73

Baibaiwuguo requested a review from luoyuxia July 6, 2023 04:06

Merge branch 'master' of github.com:apache/flink into refer-orc

866aabb

� Conflicts: � flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveOptions.java

luoyuxia reviewed Jul 6, 2023

View reviewed changes

fix： add hive zh doc & change configure name to `table.exec.hive.read…

182279d

…-statistics.thread-num`

Baibaiwuguo requested a review from luoyuxia July 6, 2023 13:02

luoyuxia reviewed Jul 7, 2023

View reviewed changes

fix： change hive configure

8ae1dbd

Baibaiwuguo requested a review from luoyuxia July 7, 2023 07:52

luoyuxia approved these changes Jul 7, 2023

View reviewed changes

swuferhong approved these changes Jul 17, 2023

View reviewed changes

luoyuxia merged commit 0f2ee60 into apache:master Jul 18, 2023

flinkbot added the component=Connectors/ORC label Apr 4, 2024

		@@ -135,7 +135,7 @@ public class HiveOptions {
		+ " Support to configure multiple policies: 'metastore,success-file'.");

		public static final ConfigOption<Integer> TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM =

[FLINK-32365][orc]get orc table statistics in parallel #22805

[FLINK-32365][orc]get orc table statistics in parallel #22805

Conversation

Baibaiwuguo commented Jun 16, 2023 • edited

What is the purpose of the change

Brief change log

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

flinkbot commented Jun 16, 2023 • edited

CI report:

luoyuxia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Baibaiwuguo commented Jun 19, 2023

Baibaiwuguo commented Jun 20, 2023

swuferhong commented Jun 21, 2023

luoyuxia commented Jun 25, 2023

Baibaiwuguo commented Jun 25, 2023

Baibaiwuguo commented Jun 29, 2023

luoyuxia commented Jun 30, 2023

luoyuxia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Baibaiwuguo commented Jul 4, 2023

luoyuxia commented Jul 6, 2023

luoyuxia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Baibaiwuguo commented Jul 6, 2023

luoyuxia left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Baibaiwuguo commented Jul 7, 2023

luoyuxia left a comment

Choose a reason for hiding this comment

Baibaiwuguo commented Jul 11, 2023

Baibaiwuguo commented Jul 11, 2023

swuferhong left a comment

Choose a reason for hiding this comment

Baibaiwuguo commented Jun 16, 2023 •

edited

flinkbot commented Jun 16, 2023 •

edited