Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FLINK-17733][FLINK-16448][hive][doc] Adjust Hive doc & Add documentation for real-time hive #12537

Merged
merged 4 commits into from Jun 11, 2020

Conversation

JingsongLi
Copy link
Contributor

What is the purpose of the change

Add a page to document hive streaming writing reading and join.

Verifying this change

image
image

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): no
  • The serializers: no
  • The runtime per-record code paths (performance sensitive):no
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? no

@flinkbot
Copy link
Collaborator

flinkbot commented Jun 9, 2020

Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
to review your pull request. We will use this comment to track the progress of the review.

Automated Checks

Last check on commit 765b588 (Tue Jun 09 03:19:11 UTC 2020)

✅no warnings

Mention the bot in a comment to re-run the automated checks.

Review Progress

  • ❓ 1. The [description] looks good.
  • ❓ 2. There is [consensus] that the contribution should go into to Flink.
  • ❓ 3. Needs [attention] from.
  • ❓ 4. The change fits into the overall [architecture].
  • ❓ 5. Overall code [quality] is good.

Please see the Pull Request Review Guide for a full explanation of the review process.


The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required Bot commands
The @flinkbot bot supports the following commands:

  • @flinkbot approve description to approve one or more aspects (aspects: description, consensus, architecture and quality)
  • @flinkbot approve all to approve all aspects
  • @flinkbot approve-until architecture to approve everything until architecture
  • @flinkbot attention @username1 [@username2 ..] to require somebody's attention
  • @flinkbot disapprove architecture to remove an approval you gave earlier

@JingsongLi JingsongLi changed the title [FLINK-17733][hive][doc] Add documentation for real-time hive [FLINK-17733][FLINK-16448][hive][doc] Adjust Hive doc & Add documentation for real-time hive Jun 9, 2020
@flinkbot
Copy link
Collaborator

flinkbot commented Jun 9, 2020

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run travis re-run the last Travis build
  • @flinkbot run azure re-run the last Azure build

docs/dev/table/hive/hive_streaming.md Outdated Show resolved Hide resolved

To improve the real-time performance of hive reading, Flink support real-time Hive table stream read:

- Partition table, monitor the generation of partition, and read the new partition incrementally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to mention that new data added to an existing partition won't be consumed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will explain more in NOTE.

<td><h5>streaming-source.enable</h5></td>
<td style="word-wrap: break-word;">false</td>
<td>Boolean</td>
<td>Enable streaming source or not. NOTES: For non-partition table, please make sure that each file should be put atomically into the target directory, otherwise the reader may get incomplete data.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For partitioned table, I think each partition should also be added atomically right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will be modified to:
Enable streaming source or not.NOTES: Please make sure that each partition/file should be written atomically, otherwise the reader may get incomplete data.

<td><h5>streaming-source.consume-start-offset</h5></td>
<td style="word-wrap: break-word;">1970-00-00</td>
<td>String</td>
<td>Start offset for streaming consuming. How to parse and compare offsets depends on your order. For create-time and partition-time, should be a timestamp string. For partition-time, will use partition time extractor to extract time from partition.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mention the supported timestamp format is yyyy-[m]m-[d]d [hh:mm:ss]?

<td><h5>streaming-source.consume-order</h5></td>
<td style="word-wrap: break-word;">create-time</td>
<td>String</td>
<td>The consume order of streaming source, support create-time and partition-time. create-time compare partition/file creation time, this is not the partition create time in Hive metaStore, but the folder/file create time in filesystem; partition-time compare time represented by partition name. For non-partition table, this value should always be 'create-time'.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should warn the users that the create-time here is actually the modification time in Hadoop FS. So if the partition folder somehow gets updated, e.g. changing ACL attributes, it can affect how the data is consumed.
And perhaps we should make partition-time as the default and recommended value for this property?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Default partition-time +1, but this option unify partition and non-partition. It is a little hard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. If we change default value to partition-time and user is using a non-partitioned table then the query will fail, right? If that's the case, then maybe we can go with create-time as default, but with proper warnings.


Note:

- Monitor strategy is scan all directories/files in location path now. If there is too more partitions, will have performance problems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Monitor strategy is scan all directories/files in location path now. If there is too more partitions, will have performance problems.
- Monitor strategy is to scan all directories/files in location path now. If there are too many partitions, there will be performance problems.

Copy link
Contributor

@lirui-apache lirui-apache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, thanks for updating

@JingsongLi JingsongLi merged commit 582d9e4 into apache:master Jun 11, 2020
JingsongLi added a commit that referenced this pull request Jun 11, 2020
zhangjun0x01 pushed a commit to zhangjun0x01/flink that referenced this pull request Jul 8, 2020
@JingsongLi JingsongLi deleted the streamWriteDoc branch November 5, 2020 09:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants