New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-384] Adds import from FASTQ. #385

Merged
merged 1 commit into from Oct 1, 2014

Conversation

Projects
None yet
4 participants
@fnothaft
Member

fnothaft commented Sep 17, 2014

Resolves #384. Adds:

  • Load from "single ended" FASTQ
  • Load from interleaved FASTQ
  • Load from non-interleaved paired end (two file) FASTQ

Load from single ended FASTQ and interleaved FASTQ are handled seamlessly by the ADAMContext.adamLoad method. Since paired ended (but non-interleaved) data requires two file paths, I haven't added it into adamLoad; thus, it sits on its own.

I wrote our own FASTQ input format instead of using the one in Hadoop-BAM; theirs is only compatible with Hadoop 1, performs unnecessary parsing, and doesn't seem to pick splits correctly anyways. The SingleFastqInputFormat is almost a direct copy of the InterleavedFastqInputFormat, which we've tested pretty well on clusters.

* quality string?
*
* For now I'm going to assume single-line sequences. This works for our sequencing
* application. We'll see if someone complains in other applications.

This comment has been minimized.

@tdanford

tdanford Oct 1, 2014

Contributor

The possibility of multi-line FASTQ sequences is a well known bug in the format definition -- for precisely the reason you point out here. +1 on your call to "ignore, and revisit if someone complains."

@tdanford

tdanford Oct 1, 2014

Contributor

The possibility of multi-line FASTQ sequences is a well known bug in the format definition -- for precisely the reason you point out here. +1 on your call to "ignore, and revisit if someone complains."

This comment has been minimized.

@fnothaft

fnothaft Oct 1, 2014

Member

Agreed, bug/"feature". Alas, the curse of text based file formats.

Anyways, we'll need to fix that (sooner rather than later); I chose to not tackle it now, because Hadoop-BAM hadn't tackled it either.

@fnothaft

fnothaft Oct 1, 2014

Member

Agreed, bug/"feature". Alas, the curse of text based file formats.

Anyways, we'll need to fix that (sooner rather than later); I chose to not tackle it now, because Hadoop-BAM hadn't tackled it either.

class FastqRecordConverter extends Serializable with Logging {
def convertPair(element: (Void, Text)): Iterable[AlignmentRecord] = {

This comment has been minimized.

@tdanford

tdanford Oct 1, 2014

Contributor

Out of curiosity, what's the reasoning behind returning an Iterable[AlignmentRecord] rather than (AlignmentRecord, AlignmentRecord)?

@tdanford

tdanford Oct 1, 2014

Contributor

Out of curiosity, what's the reasoning behind returning an Iterable[AlignmentRecord] rather than (AlignmentRecord, AlignmentRecord)?

This comment has been minimized.

@fnothaft

fnothaft Oct 1, 2014

Member

It's been a while, but I believe the reasoning was that an Iterable can be flatMapped.

@fnothaft

fnothaft Oct 1, 2014

Member

It's been a while, but I believe the reasoning was that an Iterable can be flatMapped.

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Oct 1, 2014

Contributor

See my comments inline; this PR also needs a rebase off the latest master, @fnothaft. Otherwise, it's looking pretty good to me!

Contributor

tdanford commented Oct 1, 2014

See my comments inline; this PR also needs a rebase off the latest master, @fnothaft. Otherwise, it's looking pretty good to me!

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 1, 2014

Member

Changes made and code is rebased. Thanks for the review, @tdanford !

Member

fnothaft commented Oct 1, 2014

Changes made and code is rebased. Thanks for the review, @tdanford !

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 1, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/270/

AmplabJenkins commented Oct 1, 2014

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/270/

massie added a commit that referenced this pull request Oct 1, 2014

Merge pull request #385 from fnothaft/fastq-input
[ADAM-384] Adds import from FASTQ.

@massie massie merged commit 99e6f2d into bigdatagenomics:master Oct 1, 2014

1 check passed

default Merged build finished.
Details
@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Oct 1, 2014

Member

Thanks, Frank!

Member

massie commented Oct 1, 2014

Thanks, Frank!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment