-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HADOOP-18582. skip unnecessary cleanup logic in distcp #5251
Conversation
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxWk , thank you for the patch. I entered one comment. Can you also please confirm if your Apache JIRA ID is 10000kang? If so, I can get HADOOP-18582 assigned to you.
@@ -152,6 +152,15 @@ public void abortJob(JobContext jobContext, | |||
} | |||
|
|||
private void cleanupTempFiles(JobContext context) { | |||
final boolean directWrite = context.getConfiguration().getBoolean( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an existing call to Configuration conf = context.getConfiguration();
below. Can you please lift that up and then reuse it for the conf.getBoolean(...)
calls that you need to add?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
yes,that's me |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits and you need to fix the checkstyle warnings reported
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5251/2/artifact/out/results-checkstyle-hadoop-tools_hadoop-distcp.txt
Rest changes LGTM
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java
Outdated
Show resolved
Hide resolved
021e8f2
to
1d7dcd5
Compare
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
production code looks good, just test tuning. we always want meaninful messages on failure, and ContractTestUtils in hadoop-common does a lot for asserting about filesystem state
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/mapred/TestCopyCommitter.java
Outdated
Show resolved
Hide resolved
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
@MaxWk , thank you for the contribution and for incorporating code review feedback.
I won't commit yet, in case Steve wants to approve too. (I see Ayush has already approved.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There has been no additional feedback. I'll plan on committing this one later this week.
Co-authored-by: 万康 <mingge@xiaohongshu.com> Reviewed-by: Steve Loughran <stevel@apache.org> Signed-off-by: Ayush Saxena <ayushsaxena@apache.org> Signed-off-by: Chris Nauroth <cnauroth@apache.org> (cherry picked from commit 3b7b79b)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have committed this to trunk and branch-3.3. (This code doesn't compile when applied to branch-3.2, so I couldn't commit it there.)
@MaxWk , thank you for the contribution. @steveloughran and @ayushtkn , thank you for code reviewing.
nice. |
final boolean append = conf.getBoolean( | ||
DistCpOptionSwitch.APPEND.getConfigLabel(), false); | ||
final boolean useTempTarget = !append && !directWrite; | ||
if (!useTempTarget) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewing this, I think the cleanup should only be skipped on direct not append.
why so, append will choose between overwrite and append depending on the state of the destination file -a decision made case by case. when doing distcp with -append, new files or those whose dest checksum doesn't match that of the source for the same #of bytes will use a temp file unless -direct is set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that same check as here in RetriableFileCopyCommand
final boolean toAppend = action == FileAction.APPEND;
final boolean useTempTarget = !toAppend && !directWrite;
Path targetPath = useTempTarget ? getTempFile(target, context) : target;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's doing it file by file: some files will be appended to, but new ones, ones with checksum mismatch are overwritten by first writing to temp, then rename, as usual
any -append job which added files can create temp paths
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanx for the explanation. Will give some time to the original author, In case no one volunteers for few days, I will fix this.
Going to have to do a followup here; we still need to do cleanup on -append. anyone want to take that up? |
Have removed it for append in #5409 |
Description of PR
reduce execution time when distcp in direct or append mode.https://issues.apache.org/jira/browse/HADOOP-18582
How was this patch tested?
For code changes:
LICENSE
,LICENSE-binary
,NOTICE-binary
files?