-
Notifications
You must be signed in to change notification settings - Fork 9.2k
HADOOP-18723. Add detail logs if distcp checksum mismatch #5603
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
🎊 +1 overall
This message was automatically generated. |
ayushtkn
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checkstyle has complains
|
@ayushtkn Thank you for the review. Updated the PR, PTAL. |
|
🎊 +1 overall
This message was automatically generated. |
| LOG.error("Checksum not equal. Source checksum: {}, target checksum: {}", | ||
| sourceChecksum, targetChecksum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error seems to much for this case, info should be more than enough for these cases, and I don't think we need to put the checksum in the log, rather put the sourcePath and targetPath, or at worst both path & checksum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log checksums at debug maybe
anyway, doesn't a checksum mismatch mean "source file needs copying again". so really the thing to log is not the mismatch but why the copy is taking place
steveloughran
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented. it is way too noisy for any cloud upload right now.
would be good here for you to run the s3 and abfs distcp contract tests and look at their output, or do a distcp from hdfs to one of these stores and see what the logs look like.
I am with @ayushtkn here: we must save ERROR log messages to where distcp actually fails
| LOG.error("Checksum not equal. Source checksum: {}, target checksum: {}", | ||
| sourceChecksum, targetChecksum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log checksums at debug maybe
anyway, doesn't a checksum mismatch mean "source file needs copying again". so really the thing to log is not the mismatch but why the copy is taking place
| // comparison that took place and return not compatible. | ||
| // else if matched, return compatible with the matched result. | ||
| if (sourceChecksum == null || targetChecksum == null) { | ||
| LOG.error("Checksum incompatible. Source checksum: {}, target checksum: {}", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not unusual against object stores, as all stores which don't have hdfs-+compatible checksums disable them so that distcp to cloud stores don't blow up. If you logged at error then there'd be an entry for every single copy.
propose: use a LogExactlyOnce at info to day source or target fs doesn't support checksums and that they should use -skipCrc
| fs, new Path(sourceBase + srcFilename), null, | ||
| fs, new Path(targetBase + srcFilename), | ||
| sourceCurrStatus.getLen())); | ||
| assertTrue(log.getOutput().contains("Checksum not equal")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asserts to use Assert's assertThat(log.getOutput).contains(...) for better message
|
The issue we met was not "source file needs copying again", but source file uploaded with a different checksum type. Because it's distcp, so the temp file was removed after the exception. We need to first skip the crc check, then run "hadoop fs -checksum" on the source file and target file to find out the root cause of this one. The initial idea of this addition of log was to help others with the same issue to save to process of redundant works and just check the log for mismatch reasons. |
|
💔 -1 overall
This message was automatically generated. |
|
🎊 +1 overall
This message was automatically generated. |
|
Well, I'm afraid your specific problem does not match Dee why do use cases of uploading to stores without checksums. Now, I would I've been happier if distcp's -skipCrc option was required to copy data from an FS with checksums to one without, but it is not and to add it now would break so many people's workflows. So what do we do here? maybe: create counters of why files were copied, specifically
Then after a job you can see why files were copied from the host where the job was launched. Then if you want to know why there were issues such as checksums and modtimes, you can log out to debug. Obviously, this will be something to add to the distcp documentation. Now: big warning. I am personally scared of distCp. It is a critical workflow tool and even use programmatically, yet it is surprisingly brittle. It is a running joke that's the last person two add any code to the module gets to field or support calls until someone else comes along. Thank you for volunteering! This also explains why we will be very reluctant/strict about taking on changes. Don't take it personally is as hey everyone gets that same grilling here. |
Totally understand. I have removed the error log if checksum is null for sourcePath and targetPath, so a FS without checksum won't be affected. I have also changed the error log to INFO level if both checksum exists but not equal, or a DEBUG level will be more preferred? Personally I think INFO level can save us more time since we don't need to reconfig and restart the job. |
| LOG.info("Checksum not equal. Source checksum: {}, target checksum: {}", | ||
| sourceChecksum, targetChecksum); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you aren't putting the paths for which checksum mismatch happened, if there are thousands of file being copied and bunch of them log this.
How would you figure out whose checksum didn't match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our failed jobs, the mismatch source and target path was printed by https://github.com/apache/hadoop/blob/trunk/hadoop-tools/hadoop-distcp/src/main/java/org/apache/hadoop/tools/util/DistCpUtils.java#L633.
Thanks for pointing that out. The paths have also been added to DistCpUtils.
| fs, new Path(sourceBase + srcFilename), null, | ||
| fs, new Path(targetBase + srcFilename), | ||
| sourceCurrStatus.getLen())); | ||
| assertThat(log.getOutput(), containsString("Checksum not equal")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wrong import & assertThat.
This import
import static org.assertj.core.api.Assertions.assertThat;
and it works like this
assertThat(log.getOutput()).contains("Checksum not equal");
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copied the hamcrest version from "org.apache.hadoop.conf.TestReconfiguration".
But changed to the Assertions.assertThat one.
|
🎊 +1 overall
This message was automatically generated. |
|
We're closing this stale PR because it has been open for 100 days with no activity. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
Description of PR
We encountered some errors of mismatch checksum during Distcp jobs. It took us some time to figure out that checksum type is different.
Adding error logs shall help us to figure out such problems faster.
How was this patch tested?
Add unit test.
For code changes:
LICENSE,LICENSE-binary,NOTICE-binaryfiles?