-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-25010][Connectors/Hive] Speed up hive's createMRSplits by multi thread #17988
[FLINK-25010][Connectors/Hive] Speed up hive's createMRSplits by multi thread #17988
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 28b8d68 (Thu Dec 02 07:34:19 UTC 2021) Warnings:
Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Myracle Thanks for contribution. Looks general good to me. Just left some comments.
Also, would you like to share how long it takes to split how many hive partitions with single thread and muti-thread?
...ector-hive/src/main/java/org/apache/flink/connectors/hive/ContinuousHiveSplitEnumerator.java
Show resolved
Hide resolved
...tors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/MRSplitsGetter.java
Outdated
Show resolved
Hide resolved
...tors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/MRSplitsGetter.java
Show resolved
Hide resolved
We have 5000 partitions for testing. When the thread is 1, it takes about 341 seconds. When the thread is 3, it takes 122 seconds. More threads means less time to split. |
66b45d8
to
ed14110
Compare
...tors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/MRSplitsGetter.java
Show resolved
Hide resolved
ed14110
to
140a48b
Compare
...tors/flink-connector-hive/src/main/java/org/apache/flink/connectors/hive/MRSplitsGetter.java
Outdated
Show resolved
Hide resolved
140a48b
to
f015c1d
Compare
lgtm. @wuchong @JingsongLi @lirui-apache Do you have time to review / merge? |
Thanks @luoyuxia for the reviewing. I will have a look and merge it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
f015c1d
to
e7be851
Compare
@flinkbot run azure |
1 similar comment
@flinkbot run azure |
@@ -51,4 +51,11 @@ | |||
.withDescription( | |||
"If it is false, using flink native writer to write parquet and orc files; " | |||
+ "If it is true, using hadoop mapred record writer to write parquet and orc files."); | |||
|
|||
public static final ConfigOption<Integer> TABLE_EXEC_HIVE_PARTITION_SPLIT_THREAD_NUM = | |||
key("table.exec.hive.partition-split.thread.num") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry @Myracle, I have one more question. Is this configuration just like Hive's hive.load.dynamic.partitions.thread
? I think maybe we can call the configuration table.exec.hive.load-partition-splits.thread-num
to be closer to the actual behavior, what do you think?
Besides, could you also add this configuration to the Hive documentation ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wuchong Good suggestion! I have modified it. Can you review it again? Thank you.
e7be851
to
19a283c
Compare
@@ -163,6 +163,12 @@ following parameters in `TableConfig` (note that these parameters affect all sou | |||
<td>Integer</td> | |||
<td>Sets max infer parallelism for source operator.</td> | |||
</tr> | |||
<tr> | |||
<td><h5>table.exec.hive.load-partition-splits.thread-num</h5></td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This config option is not belong to Source Parallelism Inference
, maybe we need to have a separate section, e.g. Load Partiton Splits
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wuchong Sorry for the wrong place. I have modified it. Please review it again. Thanks.
19a283c
to
ccc3e62
Compare
Thanks for the updating @Myracle |
What is the purpose of the change
If there are many hive partitions, creating splits will take much time, for example, up to ten minutes. This mr speeds up the process by multi threads for different partitions.
Brief change log
Verifying this change
This change is already covered by existing tests, such as HiveTableSourceITCase.
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: (no)Documentation