Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-27494: Deduplicate the task result that generated by more branch… #4479

Merged
merged 1 commit into from Jul 31, 2023

Conversation

dengzhhu653
Copy link
Member

@dengzhhu653 dengzhhu653 commented Jul 12, 2023

…es in union all

What changes were proposed in this pull request?

Why are the changes needed?

Think of an union all case inserting data to an external table(dynamic partition):
FS[1] specPath: <staging_dir>/-ext-10000/HIVE_UNION_SUBDIR_1
FS[2] specPath: <staging_dir>/-ext-10000/HIVE_UNION_SUBDIR_2

Finally, the output of FS will locate under the directory:
FS[1]: <staging_dir>/_tmp.-ext-10000/<dynamic_partition>/HIVE_UNION_SUBDIR_1
FS[2]: <staging_dir>/_tmp.-ext-10000/<dynamic_partition>/HIVE_UNION_SUBDIR_2

When the task finishes, FS[1] and FS[2] will call jobClose to move their output to <staging_dir>/-ext-10000, the MoveTask is responsible for moving it to the destination table.

int close(TezWork work, int rc, DAGClient dagClient) {
try {
List<BaseWork> ws = work.getAllWork();
for (BaseWork w: ws) {
if (w instanceof MergeJoinWork) {
w = ((MergeJoinWork) w).getMainWork();
}
for (Operator<?> op: w.getAllOperators()) {
op.jobClose(conf, rc == 0);
}
}

Before the FS[1] moves the data, it will rename <staging_dir>/_tmp.-ext-10000 to <staging_dir>/_tmp.-ext-10000.moved,
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1453-L1456

FS[1] will only deduplicate the result under the _tmp.-ext-10000.moved/<dynamic_partition>/HIVE_UNION_SUBDIR_1, and add the valid output file to filesKept, ignore the result under HIVE_UNION_SUBDIR_2 or more...
When it turns to FS[2] to close the operator, as the temp output directory has already been renamed to _tmp.-ext-10000.moved, so the FS[2] will do nothing.

So this will cause task result duplication, and in some cases, the result will lose as the filesKept only keeps the result of only one branch.

To solve this, one idea is making every FS have his own private temp output directory. For example:
FS[1]: <staging_dir>/-ext-10000/_tmp.HIVE_UNION_SUBDIR_1/<dynamic_partition>/HIVE_UNION_SUBDIR_1
FS[2]: <staging_dir>/-ext-10000/_tmp.HIVE_UNION_SUBDIR_2/<dynamic_partition>/HIVE_UNION_SUBDIR_2

When FS closes, each FS operator only takes care of his own output, and moves the temp output to <staging_dir>/-ext-10000 as it does before, and this method will bring some performance gain in case renaming to _tmp.-ext-10000.moved is not allowed:
https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/Utilities.java#L1461-L1462

Does this PR introduce any user-facing change?

No

Is the change a dependency upgrade?

No

How was this patch tested?

Added UT

@sonarcloud
Copy link

sonarcloud bot commented Jul 12, 2023

Kudos, SonarCloud Quality Gate passed!    Quality Gate passed

Bug A 0 Bugs
Vulnerability A 0 Vulnerabilities
Security Hotspot A 0 Security Hotspots
Code Smell A 7 Code Smells

No Coverage information No Coverage information
No Duplication information No Duplication information

@dengzhhu653
Copy link
Member Author

Hi @abstractdog, @deniskuzZ, @kasakrisz cloud you please take a look if have secs? thanks in advance!

Copy link
Member

@deniskuzZ deniskuzZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but let me take a bit more closer look

Copy link
Contributor

@kasakrisz kasakrisz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dengzhhu653 dengzhhu653 merged commit 419301a into apache:master Jul 31, 2023
5 checks passed
tarak271 pushed a commit to tarak271/hive-1 that referenced this pull request Dec 19, 2023
…es in union all (Zhihua Deng, reviewed by Denys Kuzmenko, Krisztian Kasa, Stamatis Zampetakis)

Closes apache#4479
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants