New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FLINK-17733][FLINK-16448][hive][doc] Adjust Hive doc & Add documentation for real-time hive #12537
Conversation
Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community Automated ChecksLast check on commit 765b588 (Tue Jun 09 03:19:11 UTC 2020) ✅no warnings Mention the bot in a comment to re-run the automated checks. Review Progress
Please see the Pull Request Review Guide for a full explanation of the review process. The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commandsThe @flinkbot bot supports the following commands:
|
|
||
To improve the real-time performance of hive reading, Flink support real-time Hive table stream read: | ||
|
||
- Partition table, monitor the generation of partition, and read the new partition incrementally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to mention that new data added to an existing partition won't be consumed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will explain more in NOTE.
<td><h5>streaming-source.enable</h5></td> | ||
<td style="word-wrap: break-word;">false</td> | ||
<td>Boolean</td> | ||
<td>Enable streaming source or not. NOTES: For non-partition table, please make sure that each file should be put atomically into the target directory, otherwise the reader may get incomplete data.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For partitioned table, I think each partition should also be added atomically right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, will be modified to:
Enable streaming source or not.NOTES: Please make sure that each partition/file should be written atomically, otherwise the reader may get incomplete data.
<td><h5>streaming-source.consume-start-offset</h5></td> | ||
<td style="word-wrap: break-word;">1970-00-00</td> | ||
<td>String</td> | ||
<td>Start offset for streaming consuming. How to parse and compare offsets depends on your order. For create-time and partition-time, should be a timestamp string. For partition-time, will use partition time extractor to extract time from partition.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mention the supported timestamp format is yyyy-[m]m-[d]d [hh:mm:ss]
?
<td><h5>streaming-source.consume-order</h5></td> | ||
<td style="word-wrap: break-word;">create-time</td> | ||
<td>String</td> | ||
<td>The consume order of streaming source, support create-time and partition-time. create-time compare partition/file creation time, this is not the partition create time in Hive metaStore, but the folder/file create time in filesystem; partition-time compare time represented by partition name. For non-partition table, this value should always be 'create-time'.</td> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should warn the users that the create-time
here is actually the modification time in Hadoop FS. So if the partition folder somehow gets updated, e.g. changing ACL attributes, it can affect how the data is consumed.
And perhaps we should make partition-time
as the default and recommended value for this property?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default partition-time
+1, but this option unify partition and non-partition. It is a little hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. If we change default value to partition-time and user is using a non-partitioned table then the query will fail, right? If that's the case, then maybe we can go with create-time as default, but with proper warnings.
|
||
Note: | ||
|
||
- Monitor strategy is scan all directories/files in location path now. If there is too more partitions, will have performance problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Monitor strategy is scan all directories/files in location path now. If there is too more partitions, will have performance problems. | |
- Monitor strategy is to scan all directories/files in location path now. If there are too many partitions, there will be performance problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, thanks for updating
…tion for real-time hive This closes #12537
…tion for real-time hive This closes apache#12537
What is the purpose of the change
Add a page to document hive streaming writing reading and join.
Verifying this change
Does this pull request potentially affect one of the following parts:
@Public(Evolving)
: noDocumentation