-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-32365][orc]get orc table statistics in parallel #22805
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Baibaiwuguo Thanks for contribution. Overall LGTM. But I'd like to have @swuferhong give it a review.
@@ -58,11 +63,15 @@ public static TableStats getTableStatistics( | |||
long rowCount = 0; | |||
Map<String, ColumnStatistics> columnStatisticsMap = new HashMap<>(); | |||
RowType producedRowType = (RowType) producedDataType.getLogicalType(); | |||
ExecutorService executorService = Executors.newFixedThreadPool(Runtime.getRuntime().availableProcessors()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering can we make it configurable but default is Runtime.getRuntime().availableProcessors()
just like some configuration s3.upload.max.concurrent.uploads
.
@luoyuxia I also think tasks can be configured more rationally. I will refine this issue. |
@luoyuxia I find the code is called in multiple places. We make it configurable, we need change more moudles and we get more parameters. if we set parameter in hadoop config,both orc and parquet can use this parameter. Could you give me some idea? |
Hi, did you encounter the problem of slow reporting ORC statistics during using hive connector? If that, I think you can add this parameter into |
I will also recommend to only support to configure in HiveOption in the first. |
@luoyuxia I add configure in HiveOption. I change both of orc and parquet. |
@dongjoon-hyun @swuferhong @luoyuxia Could you help review when you are free. |
@Baibaiwuguo The test fails. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@baiwuchang Thanks for updating. I left minor comments. PTAL
@@ -134,6 +134,13 @@ public class HiveOptions { | |||
+ " custom: use policy class to create a commit policy." | |||
+ " Support to configure multiple policies: 'metastore,success-file'."); | |||
|
|||
public static final ConfigOption<Integer> TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remeber to add the doc for the newly added option.
try { | ||
long rowCount = 0; | ||
Map<String, ColumnStatistics> columnStatisticsMap = new HashMap<>(); | ||
RowType producedRowType = (RowType) producedDataType.getLogicalType(); | ||
|
||
ExecutorService executorService = Executors.newFixedThreadPool(statisticsThreadNum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Executors.newFixedThreadPool(
statisticsThreadNum,
new ExecutorThreadFactory("orc-get-table-statistic-worker"));
?
try { | ||
Map<String, Statistics<?>> columnStatisticsMap = new HashMap<>(); | ||
RowType producedRowType = (RowType) producedDataType.getLogicalType(); | ||
ExecutorService executorService = Executors.newFixedThreadPool(statisticsThreadNum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dito
@@ -373,13 +374,18 @@ private TableStats getMapRedInputFormatStatistics( | |||
.toLowerCase(); | |||
List<Path> files = | |||
inputSplits.stream().map(FileSourceSplit::path).collect(Collectors.toList()); | |||
int statisticsThreadNum = flinkConf.get(TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check the thread num is not less than 1;
I changed some of the logic. The task read the file footer in parallel and calculate the file footer in serial. |
� Conflicts: � flink-connectors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/HiveOptions.java
I'll have a look when i'm free. Thxs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Baibaiwuguo Thanks for updating. I left minor comment. PTAL.
Please rememer to append
a commit to address my comments.
@@ -206,6 +206,10 @@ Users can do some performance tuning by tuning the split's size with the follow | |||
- Currently, these configurations for tuning split size only works for the Hive table stored as ORC format. | |||
{{< /hint >}} | |||
|
|||
### Read Table Statistics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't forget to also update chinese doc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please also specify why we may need to scan the table's to get statistics.
When the table statistic is not available from Hive metastore, we will then try to get the statistic by scanning the table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -134,6 +134,13 @@ public class HiveOptions { | |||
+ " custom: use policy class to create a commit policy." | |||
+ " Support to configure multiple policies: 'metastore,success-file'."); | |||
|
|||
public static final ConfigOption<Integer> TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM = | |||
key("table.exec.hive.read-format-statistics.thread-num") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After rethink it, I don't think we don't need to expose the word format
which I think may make user confused. So, I'd like to rename it to
table.exec.hive.read-statistics.thread-num
.
WDYT?
} | ||
} | ||
} | ||
executorService.shutdownNow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use try {} finnal{} to shutdown the executorService
} | ||
executorService.shutdownNow(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dito
…-statistics.thread-num`
@luoyuxia Thank you for the very careful and patient review. I have fix your comment. PTAL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Baibaiwuguo Thanks for updating. I left minor comment. Should LGTM in next iteration.
@@ -135,7 +135,7 @@ public class HiveOptions { | |||
+ " Support to configure multiple policies: 'metastore,success-file'."); | |||
|
|||
public static final ConfigOption<Integer> TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TABLE_EXEC_HIVE_READ_FORMAT_STATISTICS_THREAD_NUM -> TABLE_EXEC_HIVE_READ_STATISTICS_THREAD_NUM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -190,6 +190,10 @@ Flink 允许你灵活的配置并发推断策略。你可以在 `TableConfig` | |||
- 目前上述参数仅适用于 ORC 格式的 Hive 表。 | |||
{{< /hint >}} | |||
|
|||
### 读取表统计信息 | |||
|
|||
当hive metastore(如`orc`或`parquet`)中没有表的统计信息时,需要扫描表获取信息。你可以使用`table.exec.hive.read-statistics.thread-num`去配置扫描线程数。默认值是当前系统可用处理器数,你配置的值应该大于0。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion:
当hive metastore 中没有表的统计信息时,Flink 会尝试扫描表来获取统计信息从而生成合适的执行计划。此过程可以会比较耗时,你可以使用table.exec.hive.read-statistics.thread-num
去配置使用多少个线程去扫描表,默认值是当前系统可用处理器数,配置的值应该大于0。
@@ -208,7 +208,9 @@ Users can do some performance tuning by tuning the split's size with the follow | |||
|
|||
### Read Table Statistics | |||
|
|||
To obtain hive table statistics faster, When hive table format is `orc` or `parquet`. You can use `table.exec.hive.read-format-statistics.thread-num` to configure the thread number. The default value is the number of available processors in the current system and the configured value should be bigger than 0. | |||
When the table statistic is not available from Hive metastore, such as `orc` or `parquet`. We will then try to get the statistic by scanning the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
When the table statistic is not available from the Hive meta store, Flink will try to scan the table to get the statistic to generate a better execution plan. It may cost some time to get the statistic.
@@ -208,7 +208,9 @@ Users can do some performance tuning by tuning the split's size with the follow | |||
|
|||
### Read Table Statistics | |||
|
|||
To obtain hive table statistics faster, When hive table format is `orc` or `parquet`. You can use `table.exec.hive.read-format-statistics.thread-num` to configure the thread number. The default value is the number of available processors in the current system and the configured value should be bigger than 0. | |||
When the table statistic is not available from Hive metastore, such as `orc` or `parquet`. We will then try to get the statistic by scanning the table. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion:
When the table statistic is not available from the Hive metastore, Flink will try to scan the table to get the statistic to generate a better execution plan. It may cost some time to get the statistic. To get it faster, you can use table.exec.hive.read-statistics.thread-num
to configure how many threads to use to scan the table.
The default value is the number of available processors in the current system and the configured value should be bigger than 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@luoyuxia Thanks. I have fix your comment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Baibaiwuguo Thanks for updating. LGTM.
But I would like have @swuferhong giave another final review.
@luoyuxia thanks for your review! |
@swuferhong hi, thanks for you ideas. Can you help me to review it when you are free? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, @Baibaiwuguo . LGTM +1
What is the purpose of the change
get orc table statistics in parallel, Improve acquisition speed.
Brief change log
Let HiveTableSource extend from SupportStatisticsReport
Verifying this change
Please make sure both new and modified tests in this PR follows the conventions defined in our code quality guide: https://flink.apache.org/contributing/code-style-and-quality-common.html#testing
(Please pick either of the following options)
This change is a trivial rework / code cleanup without any test coverage.
(or)
This change is already covered by existing tests, such as (please describe tests).
(or)
This change added tests and can be verified as follows:
(example:)
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (yes / no)Documentation