New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[FLINK-17733][FLINK-16448][hive][doc] Adjust Hive doc & Add documentation for real-time hive #12537

Merged

JingsongLi merged 4 commits into apache:master from JingsongLi:streamWriteDoc

Jun 11, 2020

Contributor

JingsongLi commented Jun 9, 2020

What is the purpose of the change

Add a page to document hive streaming writing reading and join.

Verifying this change

Does this pull request potentially affect one of the following parts:

Dependencies (does it add or upgrade a dependency): no
The public API, i.e., is any changed class annotated with @Public(Evolving): no
The serializers: no
The runtime per-record code paths (performance sensitive):no
Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
The S3 file system connector: no

Documentation

Does this pull request introduce a new feature? no


          [FLINK-17733][hive][doc] Add documentation for real-time hive

765b588

Collaborator

flinkbot commented Jun 9, 2020

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 765b588 (Tue Jun 09 03:19:11 UTC 2020)

✅no warnings

_{Mention the bot in a comment to re-run the automated checks.}

Review Progress

❓ 1. The [description] looks good.
❓ 2. There is [consensus] that the contribution should go into to Flink.
❓ 3. Needs [attention] from.
❓ 4. The change fits into the overall [architecture].
❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.

The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
@flinkbot approve all to approve all aspects
@flinkbot approve-until architecture to approve everything until architecture
@flinkbot attention @username1 [@username2 ..] to require somebody's attention
@flinkbot disapprove architecture to remove an approval you gave earlier

rmetzger added the review=description? label

JingsongLi changed the title ~~[FLINK-17733][hive][doc] Add documentation for real-time hive~~ [FLINK-17733][FLINK-16448][hive][doc] Adjust Hive doc & Add documentation for real-time hive

Collaborator

flinkbot commented Jun 9, 2020 •

edited

CI report:

7ff56f0 Azure: FAILURE
f1055df Azure: PENDING

Bot commands

The @flinkbot bot supports the following commands:

@flinkbot run travis re-run the last Travis build
@flinkbot run azure re-run the last Azure build

rmetzger added component=Connectors/Hive component=Documentation labels

JingsongLi commented

View reviewed changes

docs/dev/table/hive/hive_streaming.md Outdated Show resolved Hide resolved

lirui-apache reviewed

View reviewed changes

docs/dev/table/hive/hive_streaming.md Outdated Show resolved Hide resolved

docs/dev/table/hive/hive_streaming.md


		To improve the real-time performance of hive reading, Flink support real-time Hive table stream read:

		- Partition table, monitor the generation of partition, and read the new partition incrementally.

Contributor

lirui-apache Jun 11, 2020

We need to mention that new data added to an existing partition won't be consumed

Contributor Author

JingsongLi Jun 11, 2020

I will explain more in NOTE.

docs/dev/table/hive/hive_streaming.md Outdated

+                      <td><h5>streaming-source.enable</h5></td>
+                      <td style="word-wrap: break-word;">false</td>
+                      <td>Boolean</td>
+                      <td>Enable streaming source or not. NOTES: For non-partition table, please make sure that each file should be put atomically into the target directory, otherwise the reader may get incomplete data.</td>

Contributor

lirui-apache Jun 11, 2020

For partitioned table, I think each partition should also be added atomically right?

Contributor Author

JingsongLi Jun 11, 2020

Yes, will be modified to:
Enable streaming source or not.NOTES: Please make sure that each partition/file should be written atomically, otherwise the reader may get incomplete data.

docs/dev/table/hive/hive_streaming.md Outdated

+                      <td><h5>streaming-source.consume-start-offset</h5></td>
+                      <td style="word-wrap: break-word;">1970-00-00</td>
+                      <td>String</td>
+                      <td>Start offset for streaming consuming. How to parse and compare offsets depends on your order. For create-time and partition-time, should be a timestamp string. For partition-time, will use partition time extractor to extract time from partition.</td>

Contributor

lirui-apache Jun 11, 2020

Mention the supported timestamp format is yyyy-[m]m-[d]d [hh:mm:ss]?

docs/dev/table/hive/hive_streaming.md Outdated

+                      <td><h5>streaming-source.consume-order</h5></td>
+                      <td style="word-wrap: break-word;">create-time</td>
+                      <td>String</td>
+                      <td>The consume order of streaming source, support create-time and partition-time. create-time compare partition/file creation time, this is not the partition create time in Hive metaStore, but the folder/file create time in filesystem; partition-time compare time represented by partition name. For non-partition table, this value should always be 'create-time'.</td>

Contributor

lirui-apache Jun 11, 2020

I think we should warn the users that the create-time here is actually the modification time in Hadoop FS. So if the partition folder somehow gets updated, e.g. changing ACL attributes, it can affect how the data is consumed.
And perhaps we should make partition-time as the default and recommended value for this property?

Contributor Author

JingsongLi Jun 11, 2020 •

edited

Default partition-time +1, but this option unify partition and non-partition. It is a little hard.

Contributor

lirui-apache Jun 11, 2020

I see. If we change default value to partition-time and user is using a non-partitioned table then the query will fail, right? If that's the case, then maybe we can go with create-time as default, but with proper warnings.

docs/dev/table/hive/hive_streaming.md Outdated


		Note:

		- Monitor strategy is scan all directories/files in location path now. If there is too more partitions, will have performance problems.

Contributor

lirui-apache Jun 11, 2020

Suggested change

      
            - Monitor strategy is scan all directories/files in location path now. If there is too more partitions, will have performance problems.
          
            - Monitor strategy is to scan all directories/files in location path now. If there are too many partitions, there will be performance problems.

JingsongLi added 2 commits

June 11, 2020 15:15


          Fix comments

3ba1021


          Fix e.g.

7ff56f0

lirui-apache approved these changes

View reviewed changes

Contributor

lirui-apache left a comment

+1, thanks for updating

Fix

f1055df

JingsongLi merged commit 582d9e4 into apache:master

JingsongLi added a commit that referenced this pull request


          [FLINK-17733][FLINK-16448][hive][doc] Adjust Hive doc & Add documenta…

59d8844

…tion for real-time hive

This closes #12537

zhangjun0x01 pushed a commit to zhangjun0x01/flink that referenced this pull request


          [FLINK-17733][FLINK-16448][hive][doc] Adjust Hive doc & Add documenta…

fa37af2

…tion for real-time hive

This closes apache#12537

JingsongLi deleted the streamWriteDoc branch

November 5, 2020 09:44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment