Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-26758: Allow use scratchdir for staging final job #3831

Merged
merged 2 commits into from Dec 6, 2022
Merged

HIVE-26758: Allow use scratchdir for staging final job #3831

merged 2 commits into from Dec 6, 2022

Conversation

yigress
Copy link
Contributor

@yigress yigress commented Dec 5, 2022

What changes were proposed in this pull request?

  1. add a hive configuration hive.use.scratchdir.for.staging

  2. for native table, no-mm, no-direct-insert, no-acid, change dynamic partition staging directory layout from
    <dest_path>/<static_partition>/<staging_dir>/<dynamic_partition>
    to
    <dest_path>/<staging_dir>/<static_partition>/<dynamic_partition>

  3. when hive.use.scratchdir.for.staging=true, FileSinkOperator's dirName, DynamicContext's sourcePath change from
    <dest_path>/{hive.exec.stagingdir}
    to
    <hive.exec.scratchdir>

for example for query
insert into/overwrite table partition(year=2001, season) select...

before the change, the FileSinkOperator conf has
<table_path>/year=2001/.staging_dir/season=xxx

after the change, it has
<table_path>/.staging_dir/year=2001/season=xxx

This change allow to swap <table_path> with another path such as <hive.exec.scratchdir>, and the moveTask will move into <table_path>

Why are the changes needed?

In the S3 blobstorage optimization, HIVE-15121 / HIVE-17620 changed interim job path to use hive.exec.scracthdir, final job to use hive.exec.stagingdir. https://issues.apache.org/jira/browse/HIVE-15215 is open whether to use scratch for staging dir for S3.

However for blobstorage where 'rename' is slow and no encryption, it can help performance to use scratchdir to staging query results and use the MoveTask to copy to blobstorage. This is especially true when there is FileMerge task.
This may also help cross-filesystem when user wants to use local cluster filesystem to staging query results and move the results to destination filesystem.

Does this PR introduce any user-facing change?

This adds a new hive configuration.

How was this patch tested?

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a few nits

@yigress
Copy link
Contributor Author

yigress commented Dec 5, 2022

thanks @sunchao for the review! addressed comments

@sonarcloud
Copy link

sonarcloud bot commented Dec 6, 2022

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 2 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@sunchao sunchao merged commit 6de98ba into apache:master Dec 6, 2022
@sunchao
Copy link
Member

sunchao commented Dec 6, 2022

Merged to master, thanks @yigress !

dengzhhu653 pushed a commit to dengzhhu653/hive that referenced this pull request Dec 15, 2022
tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Dec 16, 2022
DongWei-4 pushed a commit to DongWei-4/hive that referenced this pull request Dec 29, 2022
yeahyung pushed a commit to yeahyung/hive that referenced this pull request Jul 20, 2023
tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants