Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1635] Eliminate passing FASTQ splittable status via config. #1636

Merged

Conversation

@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jul 26, 2017

Resolves #1635. Instead of passing whether a FASTQ was splittable via config, checks to see if the compression codec is splittable. This is more reliable. In the case of a .gz file, the BGZFEnhancedGZipCodec properly handles this edge case by checking the stream type; this coupled with us explicitly checking the stream when split picking ensures that we don't try to create an invalid GZIP split. Additionally, I identified and fixed an error in the old FASTQ code that did a seek on the uncompressed input stream to backtrack if seeing a line of quality scores that began with @ when identifying the position of the first valid record in a split. Instead, we check for two successive lines that start with an @, which indicates that the first line contains quality scores, while the second line contains read names.

Resolves #1635. Instead of passing whether a FASTQ was splittable via config,
checks to see if the compression codec is splittable. This is more reliable.
In the case of a .gz file, the BGZFEnhancedGZipCodec properly handles this
edge case by checking the stream type; this coupled with us explicitly
checking the stream when split picking ensures that we don't try to create an
invalid GZIP split. Additionally, I identified and fixed an error in the old
FASTQ code that did a seek on the uncompressed input stream to backtrack if
seeing a line of quality scores that began with @ when identifying the position
of the first valid record in a split. Instead, we check for two successive lines
that start with an @, which indicates that the first line contains quality
scores, while the second line contains read names.
@fnothaft fnothaft added this to the 0.23.0 milestone Jul 26, 2017
@coveralls
Copy link

@coveralls coveralls commented Jul 26, 2017

Coverage Status

Coverage remained the same at 83.961% when pulling e64119b on fnothaft:issues/1635-no-splittable-fastq-config into 7449b14 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 26, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2281/
Test PASSed.

// a contract where it will put the file's splittable status into the hadoop
// configuration object.
isSplittable = conf.getBoolean(FastqInputFormat.FILE_SPLITTABLE, false);
// if our codec is splittable, we can (tentatively) say that

This comment has been minimized.

@heuermh

heuermh Jul 26, 2017
Member

Yer editor accidentally used tabs for some of these lines

This comment has been minimized.

@fnothaft

fnothaft Jul 26, 2017
Author Member

Thanks for catching this; I was editing this patch on a different computer from my usual and I was wondering why the diff looked weird.

reader = new LineReader(stream);
} else {
// see above note about
// SplittableCompressionCodec.createInputStream needing the stream
// to be at offset 0
stream.seek(0);

This comment has been minimized.

@heuermh

heuermh Jul 26, 2017
Member

the comment above this line can be removed

This comment has been minimized.

@fnothaft

fnothaft Jul 26, 2017
Author Member

I think this is still useful info to keep around, but I'll update the comment to better reflect the changed code.

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 26, 2017

Pushed a commit addressing reviewer comments.

BTW @heuermh do you think it would be worthwhile to add something to our CI that would flag any tabs in our source and fail the build? I would've missed those if you hadn't caught them.

@coveralls
Copy link

@coveralls coveralls commented Jul 26, 2017

Coverage Status

Coverage remained the same at 83.961% when pulling a78b510 on fnothaft:issues/1635-no-splittable-fastq-config into 7449b14 on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jul 26, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2284/
Test PASSed.

@heuermh
Copy link
Member

@heuermh heuermh commented Jul 26, 2017

do you think it would be worthwhile to add something to our CI that would flag any tabs in our source and fail the build? I would've missed those if you hadn't caught them.

We have a linter that runs on the scala source, this made it through because it was a java source file. I don't think we can put a CI check on the whole repo because some of our test resources require tab characters.

@heuermh heuermh merged commit c8a2202 into bigdatagenomics:master Jul 26, 2017
3 checks passed
3 checks passed
codacy/pr Good work! A positive pull request.
Details
coverage/coveralls Coverage remained the same at 83.961%
Details
default Merged build finished.
Details
@heuermh
Copy link
Member

@heuermh heuermh commented Jul 26, 2017

Thank you, @fnothaft

@fnothaft fnothaft deleted the fnothaft:issues/1635-no-splittable-fastq-config branch Jul 26, 2017
@heuermh
Copy link
Member

@heuermh heuermh commented Jul 26, 2017

Sorry, wrong button, I should've squashed.

@fnothaft
Copy link
Member Author

@fnothaft fnothaft commented Jul 26, 2017

We have a linter that runs on the scala source, this made it through because it was a java source file. I don't think we can put a CI check on the whole repo because some of our test resources require tab characters.

I mean, sure, but we could do something like:

find adam-*/src -name "*.java" -exec ./scripts/failIfHasTabs.sh {} \;
find adam-*/src -name "*.R" -exec ./scripts/failIfHasTabs.sh {} \;
find adam-*/src -name "*.py" -exec ./scripts/failIfHasTabs.sh {} \;
@heuermh
Copy link
Member

@heuermh heuermh commented Jul 26, 2017

+1, add *.pom, *.sh

@heuermh heuermh added this to Completed in Release 0.23.0 Jan 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

4 participants
You can’t perform that action at this time.