New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support loading paired fastqs #831

Closed
wants to merge 8 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@ryan-williams
Member

ryan-williams commented Sep 24, 2015

fixes #771

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 24, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/925/
Test PASSed.

AmplabJenkins commented Sep 24, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/925/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 24, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/928/
Test PASSed.

AmplabJenkins commented Sep 24, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/928/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 24, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/930/
Test PASSed.

AmplabJenkins commented Sep 24, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/930/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 25, 2015

Member

Overall LGTM! I've dropped a few comments inline RE: style nits and semantics.

Member

fnothaft commented Sep 25, 2015

Overall LGTM! I've dropped a few comments inline RE: style nits and semantics.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Sep 25, 2015

Member

Per my first comment response, I think I want to split out the filename->readGroup stuff. It's a little dirty and I don't know that I actually need it.

Also, I think having a way for transform to actually convert two fastq files to one ADAM file is a sensible thing and missing from this PR, so I'm going to add that, and respond to your other comments, thanks for the review, lmk if you have opinions on any of that plan.

Member

ryan-williams commented Sep 25, 2015

Per my first comment response, I think I want to split out the filename->readGroup stuff. It's a little dirty and I don't know that I actually need it.

Also, I think having a way for transform to actually convert two fastq files to one ADAM file is a sensible thing and missing from this PR, so I'm going to add that, and respond to your other comments, thanks for the review, lmk if you have opinions on any of that plan.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 25, 2015

Member

That plan sounds great here; I would agree that 2 FASTQ --> 1 ADAM is definitely useful.

Member

fnothaft commented Sep 25, 2015

That plan sounds great here; I would agree that 2 FASTQ --> 1 ADAM is definitely useful.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Sep 25, 2015

Member

OK, much simpler PR now, and the functionality is exposed via transform

Member

ryan-williams commented Sep 25, 2015

OK, much simpler PR now, and the functionality is exposed via transform

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 25, 2015

Member

Jenkins, test this please.

Member

fnothaft commented Sep 25, 2015

Jenkins, test this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 25, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/936/
Test PASSed.

AmplabJenkins commented Sep 25, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/936/
Test PASSed.

@ryan-williams

This comment has been minimized.

Show comment
Hide comment
@ryan-williams

ryan-williams Sep 30, 2015

Member

I've added some more functionality here, pretty well delineated in to individual commits if you want to browse them that way. This all addresses issues I encountered transforming the illumina platinum fastqs to ADAM format:

  • set recordGroupName via cmdline arg -record_group
  • merge a set of alignments with another via cmdline arg -concat <filename>
  • control validation stringency via cmdline arg -stringency (replaces -strict_BQSR)

With all of this, I can run a command like the following, which seems to me like the minimum functionality necessary to get the data we need out of paired .fastqs and in to output .adams:

bin/adam-submit -- transform foo_1.fastq -paired_fastq foo_2.fastq -concat foo_unpaired.fastq -record_group foo foo.adam

Split out on to more lines, for maybe more clarity:

bin/adam-submit --           \
  transform                  \
  foo_1.fastq                \
  -paired_fastq foo_2.fastq  \
  -concat foo_unpaired.fastq \
  -record_group foo          \
  foo.adam

This takes a _1.fastq file, a _2.fastq, and a corresponding _unpaired.fastq file (present in the 50x-coverage platinum data) and makes one .adam file / RDD[AlignmentRecord] out of them, setting the recordGroup name in the process.

I've run variations of this over the ~14TB of platinum data several times, and passing -stringency SILENT as well is a huge time-saver for skipping two .cache.counts.

I'm interested in peoples' thoughts on this; there are several bits of work herein but they logically comprise supporting "loading paired-fastq data" in the minimum sane way, from my experimenting.

Member

ryan-williams commented Sep 30, 2015

I've added some more functionality here, pretty well delineated in to individual commits if you want to browse them that way. This all addresses issues I encountered transforming the illumina platinum fastqs to ADAM format:

  • set recordGroupName via cmdline arg -record_group
  • merge a set of alignments with another via cmdline arg -concat <filename>
  • control validation stringency via cmdline arg -stringency (replaces -strict_BQSR)

With all of this, I can run a command like the following, which seems to me like the minimum functionality necessary to get the data we need out of paired .fastqs and in to output .adams:

bin/adam-submit -- transform foo_1.fastq -paired_fastq foo_2.fastq -concat foo_unpaired.fastq -record_group foo foo.adam

Split out on to more lines, for maybe more clarity:

bin/adam-submit --           \
  transform                  \
  foo_1.fastq                \
  -paired_fastq foo_2.fastq  \
  -concat foo_unpaired.fastq \
  -record_group foo          \
  foo.adam

This takes a _1.fastq file, a _2.fastq, and a corresponding _unpaired.fastq file (present in the 50x-coverage platinum data) and makes one .adam file / RDD[AlignmentRecord] out of them, setting the recordGroup name in the process.

I've run variations of this over the ~14TB of platinum data several times, and passing -stringency SILENT as well is a huge time-saver for skipping two .cache.counts.

I'm interested in peoples' thoughts on this; there are several bits of work herein but they logically comprise supporting "loading paired-fastq data" in the minimum sane way, from my experimenting.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 1, 2015

Member

OOC, why is the last commit (make flagstat safe...) rolled into this PR? All the others look OK, except possibly for 2df67c3 but I'll comment on that one inline.

Reviewing now.

Member

fnothaft commented Oct 1, 2015

OOC, why is the last commit (make flagstat safe...) rolled into this PR? All the others look OK, except possibly for 2df67c3 but I'll comment on that one inline.

Reviewing now.

@Args4jOption(required = false, name = "-record_group", usage = "Set converted FASTQs' record-group names to this value; if empty-string is passed, use the basename of the input file, minus the extension.")
var fastqRecordGroup: String = null
@Args4jOption(required = false, name = "-concat", usage = "Concatenate this file with <INPUT> and write the result to <OUTPUT>")
var concatFilename: String = null

This comment has been minimized.

@fnothaft

fnothaft Oct 1, 2015

Member

As I was commenting on the main page of the PR, I don't get the usage case for this? If you want to consolidate multiple files, why wouldn't you just use this command with the glob syntax to select multiple input files?

@fnothaft

fnothaft Oct 1, 2015

Member

As I was commenting on the main page of the PR, I don't get the usage case for this? If you want to consolidate multiple files, why wouldn't you just use this command with the glob syntax to select multiple input files?

This comment has been minimized.

@ryan-williams

ryan-williams Oct 2, 2015

Member

well, in the paired-fastq case (cf. the example cmdline on the main thread) I don't think using a glob as the <INPUT> argument helps me: in the presence of the -paired_fastq flag, the <INPUT> is interpreted as a FASTQ with the first-of-pair reads; the -concat flag is used to add some unpaired reads in to the same .adam as the paired reads.

The glob syntax would result in trying to zip up one RDD that is [first-of-pair and unpaired reads] with a second that is [second-of-pair] reads.

In any case, allowing for an explicit way to merge two files together in transform, which the -concat flag does here, seems worth doing, e.g. for cases where a glob is too coarse a mechanism.

I do have some desire to make an ADAMCommand that will merge two or more RDD[AlignmentRecord] files, since in the 200x Illumina Platinum data there are many FASTQs from each individual and I want to merge them into per-individual FASTQs, but I'd open a separate issue/PR for that.

@ryan-williams

ryan-williams Oct 2, 2015

Member

well, in the paired-fastq case (cf. the example cmdline on the main thread) I don't think using a glob as the <INPUT> argument helps me: in the presence of the -paired_fastq flag, the <INPUT> is interpreted as a FASTQ with the first-of-pair reads; the -concat flag is used to add some unpaired reads in to the same .adam as the paired reads.

The glob syntax would result in trying to zip up one RDD that is [first-of-pair and unpaired reads] with a second that is [second-of-pair] reads.

In any case, allowing for an explicit way to merge two files together in transform, which the -concat flag does here, seems worth doing, e.g. for cases where a glob is too coarse a mechanism.

I do have some desire to make an ADAMCommand that will merge two or more RDD[AlignmentRecord] files, since in the 200x Illumina Platinum data there are many FASTQs from each individual and I want to merge them into per-individual FASTQs, but I'd open a separate issue/PR for that.

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

I think the paired-fastq case makes a lot of sense, but to me, that is a special case (hence the inclusion of the separate -paired_fastq flag and code path).

The glob syntax would result in trying to zip up one RDD that is [first-of-pair and unpaired reads] with a second that is [second-of-pair] reads.

Glob should give the union, not the zip. For me, I would see the merge command that you described as duplicative of globbing. In the case you're describing, running adam-submit merge RG1.adam RG2.adam RG3.adam ... merged.adam; adam-submit flagstat merged.adam and adam-submit flagstat RG*.adam (RG1,2,3,...,n.adam are the converted FASTQs, one per read group) should yield equivalent results. Am I oversimplifying?

@fnothaft

fnothaft Oct 2, 2015

Member

I think the paired-fastq case makes a lot of sense, but to me, that is a special case (hence the inclusion of the separate -paired_fastq flag and code path).

The glob syntax would result in trying to zip up one RDD that is [first-of-pair and unpaired reads] with a second that is [second-of-pair] reads.

Glob should give the union, not the zip. For me, I would see the merge command that you described as duplicative of globbing. In the case you're describing, running adam-submit merge RG1.adam RG2.adam RG3.adam ... merged.adam; adam-submit flagstat merged.adam and adam-submit flagstat RG*.adam (RG1,2,3,...,n.adam are the converted FASTQs, one per read group) should yield equivalent results. Am I oversimplifying?

This comment has been minimized.

@ryan-williams

ryan-williams Oct 2, 2015

Member

The glob syntax would result in trying to zip up one RDD … with a second

Glob should give the union, not the zip.

Yea, but in the paired-fastq case I need something more like a zip (in the validation code-path; really I just need to verify the two files' counts), not a union; my point was that, given that globbing does a union, it is not suitable for my use-case here.

For me, I would see the merge command that you described as duplicative of globbing.

Only if the files being merged are all the files in a directory, right? What if I want to merge two specific files from a directory that has 10 similarly-named files in it?

The main point I'm trying to make is: we're exposing a logical "merge" capability but hiding it in the "glob" code path (which also restricts the kinds of filenames that can be merged), which feels indirect/confusing/obfuscated to me; I think it's good to support globbing but we should also expose the general capability that it represents (mergeing RDD[AR]s) more explicitly.

One semi-related thought is that maybe implying "concatenation" is not ideal, since what is really happening is a "union"? I could make this flag -merge instead of -concat if you like.

In fact, one possible approach I'd considered for exposing a more general "merge" capability was to have a -merge flag that takes a comma-delimited list of filenames to merge together (since args4j and the larger Scala ecosystem in 2015 have no solution for parsing a variable number of arguments, afaik).

Thanks, let me know your thoughts on all of this.

@ryan-williams

ryan-williams Oct 2, 2015

Member

The glob syntax would result in trying to zip up one RDD … with a second

Glob should give the union, not the zip.

Yea, but in the paired-fastq case I need something more like a zip (in the validation code-path; really I just need to verify the two files' counts), not a union; my point was that, given that globbing does a union, it is not suitable for my use-case here.

For me, I would see the merge command that you described as duplicative of globbing.

Only if the files being merged are all the files in a directory, right? What if I want to merge two specific files from a directory that has 10 similarly-named files in it?

The main point I'm trying to make is: we're exposing a logical "merge" capability but hiding it in the "glob" code path (which also restricts the kinds of filenames that can be merged), which feels indirect/confusing/obfuscated to me; I think it's good to support globbing but we should also expose the general capability that it represents (mergeing RDD[AR]s) more explicitly.

One semi-related thought is that maybe implying "concatenation" is not ideal, since what is really happening is a "union"? I could make this flag -merge instead of -concat if you like.

In fact, one possible approach I'd considered for exposing a more general "merge" capability was to have a -merge flag that takes a comma-delimited list of filenames to merge together (since args4j and the larger Scala ecosystem in 2015 have no solution for parsing a variable number of arguments, afaik).

Thanks, let me know your thoughts on all of this.

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

Yea, but in the paired-fastq case I need something more like a zip (in the validation code-path; really I just need to verify the two files' counts), not a union; my point was that, given that globbing does a union, it is not suitable for my use-case here.

Oh, yeah, agreed there.

Only if the files being merged are all the files in a directory, right? What if I want to merge two specific files from a directory that has 10 similarly-named files in it?

I don't think that restriction exists... Let me test this out.

The main point I'm trying to make is: we're exposing a logical "merge" capability but hiding it in the "glob" code path (which also restricts the kinds of filenames that can be merged), which feels indirect/confusing/obfuscated to me; I think it's good to support globbing but we should also expose the general capability that it represents (mergeing RDD[AR]s) more explicitly.

My feeling here is that the "Hadoopey" way to "merge" files is with the glob syntax.

(since args4j and the larger Scala ecosystem in 2015 have no solution for parsing a variable number of arguments, afaik).

LOL, why would anyone ever want to parse a variable number of args, right?

@fnothaft

fnothaft Oct 2, 2015

Member

Yea, but in the paired-fastq case I need something more like a zip (in the validation code-path; really I just need to verify the two files' counts), not a union; my point was that, given that globbing does a union, it is not suitable for my use-case here.

Oh, yeah, agreed there.

Only if the files being merged are all the files in a directory, right? What if I want to merge two specific files from a directory that has 10 similarly-named files in it?

I don't think that restriction exists... Let me test this out.

The main point I'm trying to make is: we're exposing a logical "merge" capability but hiding it in the "glob" code path (which also restricts the kinds of filenames that can be merged), which feels indirect/confusing/obfuscated to me; I think it's good to support globbing but we should also expose the general capability that it represents (mergeing RDD[AR]s) more explicitly.

My feeling here is that the "Hadoopey" way to "merge" files is with the glob syntax.

(since args4j and the larger Scala ecosystem in 2015 have no solution for parsing a variable number of arguments, afaik).

LOL, why would anyone ever want to parse a variable number of args, right?

This comment has been minimized.

@ryan-williams

ryan-williams Oct 2, 2015

Member

I don't think that restriction exists... Let me test this out.

Scenario: directory conatins files:

foo1.adam
foo2.adam
foo3.adam
foo4.adam

I want to merge foo1.adam and foo3.adam. How do I do that with a glob?

@ryan-williams

ryan-williams Oct 2, 2015

Member

I don't think that restriction exists... Let me test this out.

Scenario: directory conatins files:

foo1.adam
foo2.adam
foo3.adam
foo4.adam

I want to merge foo1.adam and foo3.adam. How do I do that with a glob?

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

Here's how. So, I've got the NCI-60 cell line VCFs on my cluster:

[fnothaft@amp-bdg-master adam]$ hdfs dfs -ls nci-60
Found 62 items
-rw-r--r--   3 fnothaft bdg    3134256 2015-08-14 12:56 nci-60/786-0.vcf
-rw-r--r--   3 fnothaft bdg    2978900 2015-08-14 12:56 nci-60/A498.vcf
-rw-r--r--   3 fnothaft bdg    2820609 2015-08-14 12:56 nci-60/A549_ATCC.vcf
-rw-r--r--   3 fnothaft bdg    2542611 2015-08-14 12:56 nci-60/ACHN.vcf
-rw-r--r--   3 fnothaft bdg    2271851 2015-08-14 12:56 nci-60/BT-549.vcf
-rw-r--r--   3 fnothaft bdg    2479103 2015-08-14 12:56 nci-60/CAKI-1.vcf
-rw-r--r--   3 fnothaft bdg    3587049 2015-08-14 12:56 nci-60/CCRF-CEM.vcf
-rw-r--r--   3 fnothaft bdg    3167485 2015-08-14 12:56 nci-60/COLO-205.vcf
-rw-r--r--   3 fnothaft bdg    2340965 2015-08-14 12:56 nci-60/DU145.vcf
-rw-r--r--   3 fnothaft bdg    3060940 2015-08-14 12:57 nci-60/EKVX.vcf
-rw-r--r--   3 fnothaft bdg    4637113 2015-08-14 12:57 nci-60/HCC2998.vcf
-rw-r--r--   3 fnothaft bdg    3639758 2015-08-14 12:57 nci-60/HCT-116.vcf
-rw-r--r--   3 fnothaft bdg    4228608 2015-08-14 12:57 nci-60/HCT-15.vcf
-rw-r--r--   3 fnothaft bdg    3154541 2015-08-14 12:57 nci-60/HL-60.vcf
-rw-r--r--   3 fnothaft bdg    3053736 2015-08-14 12:57 nci-60/HOP-62.vcf
-rw-r--r--   3 fnothaft bdg    2630902 2015-08-14 12:57 nci-60/HOP-92.vcf
-rw-r--r--   3 fnothaft bdg    3421877 2015-08-14 12:57 nci-60/HT29.vcf
-rw-r--r--   3 fnothaft bdg    3074905 2015-08-14 12:57 nci-60/Hs_578T.vcf
-rw-r--r--   3 fnothaft bdg    2062796 2015-08-14 12:57 nci-60/IGR-OV1.vcf
-rw-r--r--   3 fnothaft bdg    3097878 2015-08-14 12:57 nci-60/K562.vcf
-rw-r--r--   3 fnothaft bdg    3549820 2015-08-14 12:57 nci-60/KM12.vcf
-rw-r--r--   3 fnothaft bdg    2990354 2015-08-14 12:58 nci-60/LOX_IMVI.vcf
-rw-r--r--   3 fnothaft bdg    3209673 2015-08-14 12:58 nci-60/M14.vcf
-rw-r--r--   3 fnothaft bdg    3130806 2015-08-14 12:58 nci-60/MALME-3M.vcf
-rw-r--r--   3 fnothaft bdg    3111573 2015-08-14 12:58 nci-60/MCF7.vcf
-rw-r--r--   3 fnothaft bdg    2585546 2015-08-14 12:58 nci-60/MDA-MB-231.vcf
-rw-r--r--   3 fnothaft bdg    3207907 2015-08-14 12:58 nci-60/MDA-MB-435.vcf
-rw-r--r--   3 fnothaft bdg    2815280 2015-08-14 12:58 nci-60/MDA-MB-468.vcf
-rw-r--r--   3 fnothaft bdg    3227869 2015-08-14 12:58 nci-60/MDA-N.vcf
-rw-r--r--   3 fnothaft bdg    3859048 2015-08-14 12:58 nci-60/MOLT-4.vcf
-rw-r--r--   3 fnothaft bdg    2201289 2015-08-14 12:58 nci-60/NCI-ADR-RES.vcf
-rw-r--r--   3 fnothaft bdg    2501709 2015-08-14 12:58 nci-60/NCI-H226.vcf
-rw-r--r--   3 fnothaft bdg    2824803 2015-08-14 12:58 nci-60/NCI-H23.vcf
-rw-r--r--   3 fnothaft bdg    2393886 2015-08-14 12:59 nci-60/NCI-H322M.vcf
-rw-r--r--   3 fnothaft bdg    2995505 2015-08-14 12:59 nci-60/NCI-H460.vcf
-rw-r--r--   3 fnothaft bdg    2643773 2015-08-14 12:59 nci-60/NCI-H522.vcf
-rw-r--r--   3 fnothaft bdg    3048289 2015-08-14 12:59 nci-60/OVCAR-3.vcf
-rw-r--r--   3 fnothaft bdg    2758191 2015-08-14 12:59 nci-60/OVCAR-4.vcf
-rw-r--r--   3 fnothaft bdg    3022511 2015-08-14 12:59 nci-60/OVCAR-5.vcf
-rw-r--r--   3 fnothaft bdg    2156678 2015-08-14 12:59 nci-60/OVCAR-8.vcf
-rw-r--r--   3 fnothaft bdg    2665990 2015-08-14 12:59 nci-60/PC-3.vcf
-rw-r--r--   3 fnothaft bdg    3514785 2015-08-14 12:59 nci-60/RPMI-8226.vcf
-rw-r--r--   3 fnothaft bdg    2960508 2015-08-14 12:59 nci-60/RXF-393.vcf
-rw-r--r--   3 fnothaft bdg    2803561 2015-08-14 12:59 nci-60/SF-268.vcf
-rw-r--r--   3 fnothaft bdg    2349250 2015-08-14 12:59 nci-60/SF-295.vcf
-rw-r--r--   3 fnothaft bdg    2986553 2015-08-14 13:00 nci-60/SF-539.vcf
-rw-r--r--   3 fnothaft bdg    3068189 2015-08-14 13:00 nci-60/SK-MEL-2.vcf
-rw-r--r--   3 fnothaft bdg    2873485 2015-08-14 13:00 nci-60/SK-MEL-28.vcf
-rw-r--r--   3 fnothaft bdg    2482513 2015-08-14 13:00 nci-60/SK-MEL-5.vcf
-rw-r--r--   3 fnothaft bdg    2613323 2015-08-14 13:00 nci-60/SK-OV-3.vcf
-rw-r--r--   3 fnothaft bdg    2375594 2015-08-14 13:00 nci-60/SN12C.vcf
-rw-r--r--   3 fnothaft bdg    2616109 2015-08-14 13:00 nci-60/SNB-19.vcf
-rw-r--r--   3 fnothaft bdg    2789099 2015-08-14 13:00 nci-60/SNB-75.vcf
-rw-r--r--   3 fnothaft bdg    3194543 2015-08-14 13:00 nci-60/SR.vcf
-rw-r--r--   3 fnothaft bdg    2120777 2015-08-14 13:00 nci-60/SW-620.vcf
-rw-r--r--   3 fnothaft bdg    2923266 2015-08-14 13:00 nci-60/T-47D.vcf
-rw-r--r--   3 fnothaft bdg    2886430 2015-08-14 13:00 nci-60/TK-10.vcf
-rw-r--r--   3 fnothaft bdg    3838582 2015-08-14 13:01 nci-60/U251.vcf
-rw-r--r--   3 fnothaft bdg    2478462 2015-08-14 13:01 nci-60/UACC-257.vcf
-rw-r--r--   3 fnothaft bdg    3186749 2015-08-14 13:01 nci-60/UACC-62.vcf
-rw-r--r--   3 fnothaft bdg    3085119 2015-08-14 13:01 nci-60/UO-31.vcf
...

If I want to convert just OVCAR-3 and OVCAR-8, I do:

./bin/adam-submit ${ADAM_OPTS} -- vcf2adam "nci-60/OVCAR-{3,8}.vcf" nci-60/OVCAR-38.adam

Of course, use caution to ensure that the shell doesn't expand your glob (hence the quotes above).

@fnothaft

fnothaft Oct 2, 2015

Member

Here's how. So, I've got the NCI-60 cell line VCFs on my cluster:

[fnothaft@amp-bdg-master adam]$ hdfs dfs -ls nci-60
Found 62 items
-rw-r--r--   3 fnothaft bdg    3134256 2015-08-14 12:56 nci-60/786-0.vcf
-rw-r--r--   3 fnothaft bdg    2978900 2015-08-14 12:56 nci-60/A498.vcf
-rw-r--r--   3 fnothaft bdg    2820609 2015-08-14 12:56 nci-60/A549_ATCC.vcf
-rw-r--r--   3 fnothaft bdg    2542611 2015-08-14 12:56 nci-60/ACHN.vcf
-rw-r--r--   3 fnothaft bdg    2271851 2015-08-14 12:56 nci-60/BT-549.vcf
-rw-r--r--   3 fnothaft bdg    2479103 2015-08-14 12:56 nci-60/CAKI-1.vcf
-rw-r--r--   3 fnothaft bdg    3587049 2015-08-14 12:56 nci-60/CCRF-CEM.vcf
-rw-r--r--   3 fnothaft bdg    3167485 2015-08-14 12:56 nci-60/COLO-205.vcf
-rw-r--r--   3 fnothaft bdg    2340965 2015-08-14 12:56 nci-60/DU145.vcf
-rw-r--r--   3 fnothaft bdg    3060940 2015-08-14 12:57 nci-60/EKVX.vcf
-rw-r--r--   3 fnothaft bdg    4637113 2015-08-14 12:57 nci-60/HCC2998.vcf
-rw-r--r--   3 fnothaft bdg    3639758 2015-08-14 12:57 nci-60/HCT-116.vcf
-rw-r--r--   3 fnothaft bdg    4228608 2015-08-14 12:57 nci-60/HCT-15.vcf
-rw-r--r--   3 fnothaft bdg    3154541 2015-08-14 12:57 nci-60/HL-60.vcf
-rw-r--r--   3 fnothaft bdg    3053736 2015-08-14 12:57 nci-60/HOP-62.vcf
-rw-r--r--   3 fnothaft bdg    2630902 2015-08-14 12:57 nci-60/HOP-92.vcf
-rw-r--r--   3 fnothaft bdg    3421877 2015-08-14 12:57 nci-60/HT29.vcf
-rw-r--r--   3 fnothaft bdg    3074905 2015-08-14 12:57 nci-60/Hs_578T.vcf
-rw-r--r--   3 fnothaft bdg    2062796 2015-08-14 12:57 nci-60/IGR-OV1.vcf
-rw-r--r--   3 fnothaft bdg    3097878 2015-08-14 12:57 nci-60/K562.vcf
-rw-r--r--   3 fnothaft bdg    3549820 2015-08-14 12:57 nci-60/KM12.vcf
-rw-r--r--   3 fnothaft bdg    2990354 2015-08-14 12:58 nci-60/LOX_IMVI.vcf
-rw-r--r--   3 fnothaft bdg    3209673 2015-08-14 12:58 nci-60/M14.vcf
-rw-r--r--   3 fnothaft bdg    3130806 2015-08-14 12:58 nci-60/MALME-3M.vcf
-rw-r--r--   3 fnothaft bdg    3111573 2015-08-14 12:58 nci-60/MCF7.vcf
-rw-r--r--   3 fnothaft bdg    2585546 2015-08-14 12:58 nci-60/MDA-MB-231.vcf
-rw-r--r--   3 fnothaft bdg    3207907 2015-08-14 12:58 nci-60/MDA-MB-435.vcf
-rw-r--r--   3 fnothaft bdg    2815280 2015-08-14 12:58 nci-60/MDA-MB-468.vcf
-rw-r--r--   3 fnothaft bdg    3227869 2015-08-14 12:58 nci-60/MDA-N.vcf
-rw-r--r--   3 fnothaft bdg    3859048 2015-08-14 12:58 nci-60/MOLT-4.vcf
-rw-r--r--   3 fnothaft bdg    2201289 2015-08-14 12:58 nci-60/NCI-ADR-RES.vcf
-rw-r--r--   3 fnothaft bdg    2501709 2015-08-14 12:58 nci-60/NCI-H226.vcf
-rw-r--r--   3 fnothaft bdg    2824803 2015-08-14 12:58 nci-60/NCI-H23.vcf
-rw-r--r--   3 fnothaft bdg    2393886 2015-08-14 12:59 nci-60/NCI-H322M.vcf
-rw-r--r--   3 fnothaft bdg    2995505 2015-08-14 12:59 nci-60/NCI-H460.vcf
-rw-r--r--   3 fnothaft bdg    2643773 2015-08-14 12:59 nci-60/NCI-H522.vcf
-rw-r--r--   3 fnothaft bdg    3048289 2015-08-14 12:59 nci-60/OVCAR-3.vcf
-rw-r--r--   3 fnothaft bdg    2758191 2015-08-14 12:59 nci-60/OVCAR-4.vcf
-rw-r--r--   3 fnothaft bdg    3022511 2015-08-14 12:59 nci-60/OVCAR-5.vcf
-rw-r--r--   3 fnothaft bdg    2156678 2015-08-14 12:59 nci-60/OVCAR-8.vcf
-rw-r--r--   3 fnothaft bdg    2665990 2015-08-14 12:59 nci-60/PC-3.vcf
-rw-r--r--   3 fnothaft bdg    3514785 2015-08-14 12:59 nci-60/RPMI-8226.vcf
-rw-r--r--   3 fnothaft bdg    2960508 2015-08-14 12:59 nci-60/RXF-393.vcf
-rw-r--r--   3 fnothaft bdg    2803561 2015-08-14 12:59 nci-60/SF-268.vcf
-rw-r--r--   3 fnothaft bdg    2349250 2015-08-14 12:59 nci-60/SF-295.vcf
-rw-r--r--   3 fnothaft bdg    2986553 2015-08-14 13:00 nci-60/SF-539.vcf
-rw-r--r--   3 fnothaft bdg    3068189 2015-08-14 13:00 nci-60/SK-MEL-2.vcf
-rw-r--r--   3 fnothaft bdg    2873485 2015-08-14 13:00 nci-60/SK-MEL-28.vcf
-rw-r--r--   3 fnothaft bdg    2482513 2015-08-14 13:00 nci-60/SK-MEL-5.vcf
-rw-r--r--   3 fnothaft bdg    2613323 2015-08-14 13:00 nci-60/SK-OV-3.vcf
-rw-r--r--   3 fnothaft bdg    2375594 2015-08-14 13:00 nci-60/SN12C.vcf
-rw-r--r--   3 fnothaft bdg    2616109 2015-08-14 13:00 nci-60/SNB-19.vcf
-rw-r--r--   3 fnothaft bdg    2789099 2015-08-14 13:00 nci-60/SNB-75.vcf
-rw-r--r--   3 fnothaft bdg    3194543 2015-08-14 13:00 nci-60/SR.vcf
-rw-r--r--   3 fnothaft bdg    2120777 2015-08-14 13:00 nci-60/SW-620.vcf
-rw-r--r--   3 fnothaft bdg    2923266 2015-08-14 13:00 nci-60/T-47D.vcf
-rw-r--r--   3 fnothaft bdg    2886430 2015-08-14 13:00 nci-60/TK-10.vcf
-rw-r--r--   3 fnothaft bdg    3838582 2015-08-14 13:01 nci-60/U251.vcf
-rw-r--r--   3 fnothaft bdg    2478462 2015-08-14 13:01 nci-60/UACC-257.vcf
-rw-r--r--   3 fnothaft bdg    3186749 2015-08-14 13:01 nci-60/UACC-62.vcf
-rw-r--r--   3 fnothaft bdg    3085119 2015-08-14 13:01 nci-60/UO-31.vcf
...

If I want to convert just OVCAR-3 and OVCAR-8, I do:

./bin/adam-submit ${ADAM_OPTS} -- vcf2adam "nci-60/OVCAR-{3,8}.vcf" nci-60/OVCAR-38.adam

Of course, use caution to ensure that the shell doesn't expand your glob (hence the quotes above).

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

You can also do this across directories. Let's say I also wanted to load the chr20 VCF from 1000G, which is located here:

[fnothaft@amp-bdg-master data]$ hdfs dfs -ls /data/1kg-genotypes/
Found 24 items
-rw-r--r--   3 fnothaft bdg  34999123163 2015-09-11 20:23 /data/1kg-genotypes/0181860441fce741b1c34c27de0c7d95.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr15.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  85579386265 2015-09-11 21:38 /data/1kg-genotypes/0bdeb99215dfda2a7d2fe3c7fffbb8b1.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr3.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  75066370280 2015-09-11 21:25 /data/1kg-genotypes/2b67e1d968d9d90acfbd07a9a56d8985.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr6.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  56592766853 2015-09-11 21:05 /data/1kg-genotypes/386a681a3f413e426fcfd94a8f28d1ff.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr12.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  78364336068 2015-09-11 21:30 /data/1kg-genotypes/3cba98c57413edffa54335382723afbd.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr5.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  67654231596 2015-09-11 21:21 /data/1kg-genotypes/3ce9ec00721599910a8559f7a9b4b6b0.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr8.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  42522617578 2015-09-11 20:49 /data/1kg-genotypes/4332e865880a0499d32378cbdbad291a.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr13.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  33723323691 2015-09-11 20:22 /data/1kg-genotypes/5119cbb69a595dbc5fc33f729c534b8d.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr18.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  16064789455 2015-09-11 21:03 /data/1kg-genotypes/5c7ecbcb8396de30f00c389bfeeb2180.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr21.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  25236539814 2015-09-11 20:16 /data/1kg-genotypes/5d2fbed137fabddb7db1978088f2cb07.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr19.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  37491381512 2015-09-11 20:31 /data/1kg-genotypes/6dcc8433a2e0af7411a88408bd783686.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr16.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  15304146564 2015-09-11 21:12 /data/1kg-genotypes/727230f6a9d4dcd581d00c67dce2ac41.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  38957455789 2015-09-11 20:33 /data/1kg-genotypes/7cb1c0c0b7f0e744b9515f2793d3473b.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr14.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  93086828627 2015-09-11 21:38 /data/1kg-genotypes/83904d42bf1aeb708ee3dcb92457cd34.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg 102433657522 2015-09-11 21:41 /data/1kg-genotypes/89a49cd0c6da043ee67783a2e656089e.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr2.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  39585660568 2015-09-11 20:35 /data/1kg-genotypes/8ec5d5d5e2bafbcb34834a7350abaa7c.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chrX.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  58670776704 2015-09-11 21:11 /data/1kg-genotypes/9f4fc764f6252797b0cee1ebc2d2fa0c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
hr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg            0 2015-09-11 21:11 /data/1kg-genotypes/_SUCCESS
-rw-r--r--   3 fnothaft bdg  68602158250 2015-09-11 21:23 /data/1kg-genotypes/c076f1c67ed5105c787b27ca82cacc19.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  26476036862 2015-09-11 20:01 /data/1kg-genotypes/d47964c28806a97a1f656c7802237e5c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr20.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  51166771761 2015-09-11 20:57 /data/1kg-genotypes/deecfc73a58a1270467835b94c01adaf.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr9.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  58300480951 2015-09-11 21:10 /data/1kg-genotypes/e88431af8b2e8623836d85664fadf97c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr10.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  84767980059 2015-09-11 21:34 /data/1kg-genotypes/f282e6bd9e5f733f12224cc0a572cb39.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  32386267135 2015-09-11 20:23 /data/1kg-genotypes/f2b4e316e8228a4827442a43857b8dca.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.

Then I run:

./bin/adam-submit ${ADAM_OPTS} -- vcf2adam "{/data/1kg-genotypes,nci-60}/*{OVCAR-3,OVCAR-8,chr20}*.vcf" nci-60/OVCAR-38_and_1kg_chr20.adam

I am not proud of that last glob, but it gets the job done.

@fnothaft

fnothaft Oct 2, 2015

Member

You can also do this across directories. Let's say I also wanted to load the chr20 VCF from 1000G, which is located here:

[fnothaft@amp-bdg-master data]$ hdfs dfs -ls /data/1kg-genotypes/
Found 24 items
-rw-r--r--   3 fnothaft bdg  34999123163 2015-09-11 20:23 /data/1kg-genotypes/0181860441fce741b1c34c27de0c7d95.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr15.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  85579386265 2015-09-11 21:38 /data/1kg-genotypes/0bdeb99215dfda2a7d2fe3c7fffbb8b1.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr3.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  75066370280 2015-09-11 21:25 /data/1kg-genotypes/2b67e1d968d9d90acfbd07a9a56d8985.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr6.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  56592766853 2015-09-11 21:05 /data/1kg-genotypes/386a681a3f413e426fcfd94a8f28d1ff.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr12.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  78364336068 2015-09-11 21:30 /data/1kg-genotypes/3cba98c57413edffa54335382723afbd.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr5.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  67654231596 2015-09-11 21:21 /data/1kg-genotypes/3ce9ec00721599910a8559f7a9b4b6b0.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr8.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  42522617578 2015-09-11 20:49 /data/1kg-genotypes/4332e865880a0499d32378cbdbad291a.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr13.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  33723323691 2015-09-11 20:22 /data/1kg-genotypes/5119cbb69a595dbc5fc33f729c534b8d.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr18.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  16064789455 2015-09-11 21:03 /data/1kg-genotypes/5c7ecbcb8396de30f00c389bfeeb2180.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr21.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  25236539814 2015-09-11 20:16 /data/1kg-genotypes/5d2fbed137fabddb7db1978088f2cb07.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr19.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  37491381512 2015-09-11 20:31 /data/1kg-genotypes/6dcc8433a2e0af7411a88408bd783686.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr16.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  15304146564 2015-09-11 21:12 /data/1kg-genotypes/727230f6a9d4dcd581d00c67dce2ac41.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr22.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  38957455789 2015-09-11 20:33 /data/1kg-genotypes/7cb1c0c0b7f0e744b9515f2793d3473b.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr14.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  93086828627 2015-09-11 21:38 /data/1kg-genotypes/83904d42bf1aeb708ee3dcb92457cd34.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr1.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg 102433657522 2015-09-11 21:41 /data/1kg-genotypes/89a49cd0c6da043ee67783a2e656089e.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr2.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  39585660568 2015-09-11 20:35 /data/1kg-genotypes/8ec5d5d5e2bafbcb34834a7350abaa7c.ftp___ftp-trace.
ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chrX.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  58670776704 2015-09-11 21:11 /data/1kg-genotypes/9f4fc764f6252797b0cee1ebc2d2fa0c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
hr11.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg            0 2015-09-11 21:11 /data/1kg-genotypes/_SUCCESS
-rw-r--r--   3 fnothaft bdg  68602158250 2015-09-11 21:23 /data/1kg-genotypes/c076f1c67ed5105c787b27ca82cacc19.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr7.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  26476036862 2015-09-11 20:01 /data/1kg-genotypes/d47964c28806a97a1f656c7802237e5c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr20.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  51166771761 2015-09-11 20:57 /data/1kg-genotypes/deecfc73a58a1270467835b94c01adaf.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr9.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  58300480951 2015-09-11 21:10 /data/1kg-genotypes/e88431af8b2e8623836d85664fadf97c.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr10.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  84767980059 2015-09-11 21:34 /data/1kg-genotypes/f282e6bd9e5f733f12224cc0a572cb39.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf
-rw-r--r--   3 fnothaft bdg  32386267135 2015-09-11 20:23 /data/1kg-genotypes/f2b4e316e8228a4827442a43857b8dca.ftp___ftp-trace.ncbi.nih.gov_1000genomes_ftp_release_20110521_ALL.

Then I run:

./bin/adam-submit ${ADAM_OPTS} -- vcf2adam "{/data/1kg-genotypes,nci-60}/*{OVCAR-3,OVCAR-8,chr20}*.vcf" nci-60/OVCAR-38_and_1kg_chr20.adam

I am not proud of that last glob, but it gets the job done.

This comment has been minimized.

@ryan-williams

ryan-williams Oct 2, 2015

Member

Ah, right, we just pass the <INPUT> arg to Hadoop and let it expand things, so we don't just get * expansion but {} and a few other things, thanks, I knew that at one point but forgot.

In any case, I don't think making that the only way to access merging-rdd functionality is ideal; relying on hadoop's globbing, making sure your shell doesn't expand it for you first, remembering subtle differences between hadoop's glob expansion and most shells', etc. make it a little tricky. Plus, there's the use here where I actually need to distinguish between the unpaired file that I am merging and the file in the <INPUT> argument.

So tl;dr: good that we can do this with "globs" but I still feel like exposing it explicitly makes sense.

@ryan-williams

ryan-williams Oct 2, 2015

Member

Ah, right, we just pass the <INPUT> arg to Hadoop and let it expand things, so we don't just get * expansion but {} and a few other things, thanks, I knew that at one point but forgot.

In any case, I don't think making that the only way to access merging-rdd functionality is ideal; relying on hadoop's globbing, making sure your shell doesn't expand it for you first, remembering subtle differences between hadoop's glob expansion and most shells', etc. make it a little tricky. Plus, there's the use here where I actually need to distinguish between the unpaired file that I am merging and the file in the <INPUT> argument.

So tl;dr: good that we can do this with "globs" but I still feel like exposing it explicitly makes sense.

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

SGTM.

@fnothaft

fnothaft Oct 2, 2015

Member

SGTM.

@@ -121,14 +130,11 @@ class Transform(protected val args: TransformArgs) extends BDGSparkCommand[Trans
if (args.recalibrateBaseQualities) {
log.info("Recalibrating base qualities")
val knownSnps: SnpTable = createKnownSnpsTable(sc)
val stringency = if (args.strictBQSR) {

This comment has been minimized.

@fnothaft

fnothaft Oct 1, 2015

Member

I'm not sure how I feel about collapsing import stringency and BQSR stringency down into a single flag. The two aren't logically tied together.

Is your thought here that the two stringencies (aside: what is the plural of stringency? I think that I am doing English wrong here...) are independent? I.e., you will only set the stringency flag on import if you are reading from FASTQ, which is unaligned data. If you're running BQSR, your data must be aligned, therefore the function of the stringency flag is context dependent?

Let me know if what I said wasn't clear...

@fnothaft

fnothaft Oct 1, 2015

Member

I'm not sure how I feel about collapsing import stringency and BQSR stringency down into a single flag. The two aren't logically tied together.

Is your thought here that the two stringencies (aside: what is the plural of stringency? I think that I am doing English wrong here...) are independent? I.e., you will only set the stringency flag on import if you are reading from FASTQ, which is unaligned data. If you're running BQSR, your data must be aligned, therefore the function of the stringency flag is context dependent?

Let me know if what I said wasn't clear...

This comment has been minimized.

@ryan-williams

ryan-williams Oct 2, 2015

Member

Your question makes sense. I guess the short answer is that yes, I think the two notions of stringency are "independent" and unlikely to both need to be specified (let alone to different values) in the same transform run.

I'd go a step further though and say that it's worth formalizing the idea that, in the context of any one transform command, there might be a level of strictness of sanity/validation-checking that we should expose to the user via a cmdline flag. On the other hand, though, exposing multiple flags related to this basic idea feels redundant / like over-doing it, to me.

@ryan-williams

ryan-williams Oct 2, 2015

Member

Your question makes sense. I guess the short answer is that yes, I think the two notions of stringency are "independent" and unlikely to both need to be specified (let alone to different values) in the same transform run.

I'd go a step further though and say that it's worth formalizing the idea that, in the context of any one transform command, there might be a level of strictness of sanity/validation-checking that we should expose to the user via a cmdline flag. On the other hand, though, exposing multiple flags related to this basic idea feels redundant / like over-doing it, to me.

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

That sounds reasonable, but I got kinda lost here:

I'd go a step further though and say that it's worth formalizing the idea that, in the context of any one transform command, there might be a level of strictness of sanity/validation-checking that we should expose to the user via a cmdline flag.

Just to check my understanding, are you saying that it'd be reasonable for each subpart of transform (e.g., BQSR, indel realignment, load from FASTQ, load from BAM, mark dups, etc) to have a validation strictness flag, but that we should only expose a single flag? If so, I think that makes sense in this case, and is probably not a bad default policy in general.

@fnothaft

fnothaft Oct 2, 2015

Member

That sounds reasonable, but I got kinda lost here:

I'd go a step further though and say that it's worth formalizing the idea that, in the context of any one transform command, there might be a level of strictness of sanity/validation-checking that we should expose to the user via a cmdline flag.

Just to check my understanding, are you saying that it'd be reasonable for each subpart of transform (e.g., BQSR, indel realignment, load from FASTQ, load from BAM, mark dups, etc) to have a validation strictness flag, but that we should only expose a single flag? If so, I think that makes sense in this case, and is probably not a bad default policy in general.

This comment has been minimized.

@ryan-williams

ryan-williams Oct 2, 2015

Member

Yea, across many things that transform does, it's reasonable to let the user say "be {strict,lenient,silent} with any applicable validation checks", and that should happen via one flag that is not explicitly namespaced to any one task that the user may already separately be telling transform to do (e.g. BQSR).

@ryan-williams

ryan-williams Oct 2, 2015

Member

Yea, across many things that transform does, it's reasonable to let the user say "be {strict,lenient,silent} with any applicable validation checks", and that should happen via one flag that is not explicitly namespaced to any one task that the user may already separately be telling transform to do (e.g. BQSR).

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

Yeah, that sounds reasonable.

@fnothaft

fnothaft Oct 2, 2015

Member

Yeah, that sounds reasonable.

@@ -580,7 +632,10 @@ class ADAMContext(val sc: SparkContext) extends Serializable with Logging {
def loadAlignments(
filePath: String,
projection: Option[Schema] = None): RDD[AlignmentRecord] = LoadAlignmentRecords.time {
projection: Option[Schema] = None,
filePath2Opt: Option[String] = None,

This comment has been minimized.

@fnothaft

fnothaft Oct 1, 2015

Member

As a minor aside, we should throw an IAE if someone uses filePath2Opt and we detect a file that isn't a FASTQ.

@fnothaft

fnothaft Oct 1, 2015

Member

As a minor aside, we should throw an IAE if someone uses filePath2Opt and we detect a file that isn't a FASTQ.

This comment has been minimized.

@ryan-williams

ryan-williams Oct 2, 2015

Member

What kind of detecting do you think we should do? Inference based on filename extension? I'm not sure I like the sound of that but will defer to you

@ryan-williams

ryan-williams Oct 2, 2015

Member

What kind of detecting do you think we should do? Inference based on filename extension? I'm not sure I like the sound of that but will defer to you

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

I'm thinking of a really small change; I'll shoot you a PR, because it feels somewhat cumbersome to type out not in code.

@fnothaft

fnothaft Oct 2, 2015

Member

I'm thinking of a really small change; I'll shoot you a PR, because it feels somewhat cumbersome to type out not in code.

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

Just shot ryan-williams#1 your way. LMK your thoughts!

@fnothaft

fnothaft Oct 2, 2015

Member

Just shot ryan-williams#1 your way. LMK your thoughts!

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

We will resolve this in #841.

@fnothaft

fnothaft Oct 2, 2015

Member

We will resolve this in #841.

b2i(p.getReadPaired && p.getReadMapped && !p.getMateMapped),
b2i(mateMappedToDiffChromosome),
b2i(mateMappedToDiffChromosome && p.getMapq >= 5),
b2i(b(p.getReadMapped)),

This comment has been minimized.

@fnothaft

fnothaft Oct 1, 2015

Member

These changes look reasonable but I think we should pull these changes out to another PR.

@fnothaft

fnothaft Oct 1, 2015

Member

These changes look reasonable but I think we should pull these changes out to another PR.

This comment has been minimized.

@ryan-williams

ryan-williams Oct 2, 2015

Member

Yea, the genesis of this was that I was running flagstat on my paired-fastq-converted .adams and it was NPEing on several of these fields which were left unset or set to null by FastqRecordConverter.convert.

So given that my paired-fastq-.adams were unable pass a basic read-counting verification step post-conversion, I thought this functionality should be included in this PR.

lmk if you still think it should be in a separate PR, I can go either way.

@ryan-williams

ryan-williams Oct 2, 2015

Member

Yea, the genesis of this was that I was running flagstat on my paired-fastq-converted .adams and it was NPEing on several of these fields which were left unset or set to null by FastqRecordConverter.convert.

So given that my paired-fastq-.adams were unable pass a basic read-counting verification step post-conversion, I thought this functionality should be included in this PR.

lmk if you still think it should be in a separate PR, I can go either way.

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

Ah, I see. Thanks for the additional context. Flagstat is meant to be run on aligned data, not unaligned data, which explains some of the oddness. I don't think it is a bad idea to sanitize the fields though, esp. if the performance impact is small. Any chance you could do a quick head to head on an aligned BAM? Given the reasoning, I'd be OK with including the commit here, but let's not squash it down on merge.

@fnothaft

fnothaft Oct 2, 2015

Member

Ah, I see. Thanks for the additional context. Flagstat is meant to be run on aligned data, not unaligned data, which explains some of the oddness. I don't think it is a bad idea to sanitize the fields though, esp. if the performance impact is small. Any chance you could do a quick head to head on an aligned BAM? Given the reasoning, I'd be OK with including the commit here, but let's not squash it down on merge.

This comment has been minimized.

@ryan-williams

ryan-williams Oct 2, 2015

Member

OK, will h2h on a BAM and let you know how it goes

@ryan-williams

ryan-williams Oct 2, 2015

Member

OK, will h2h on a BAM and let you know how it goes

This comment has been minimized.

@ryan-williams

ryan-williams Oct 2, 2015

Member

Maybe trivial but I validated it on bqsr1.sam, which has a few types of reads represented:

1000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 primary duplicates
0 + 0 primary duplicates - both read and mate mapped
0 + 0 primary duplicates - only read mapped
0 + 0 primary duplicates - cross chromosome
0 + 0 secondary duplicates
0 + 0 secondary duplicates - both read and mate mapped
0 + 0 secondary duplicates - only read mapped
0 + 0 secondary duplicates - cross chromosome
992 + 0 mapped (99.20%:0.00%)
1000 + 0 paired in sequencing
498 + 0 read1
502 + 0 read2
980 + 0 properly paired (98.00%:0.00%)
984 + 0 with itself and mate mapped
8 + 0 singletons (0.80%:0.00%)
4 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

lmk if you want me to find a trickier example to test on

@ryan-williams

ryan-williams Oct 2, 2015

Member

Maybe trivial but I validated it on bqsr1.sam, which has a few types of reads represented:

1000 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 primary duplicates
0 + 0 primary duplicates - both read and mate mapped
0 + 0 primary duplicates - only read mapped
0 + 0 primary duplicates - cross chromosome
0 + 0 secondary duplicates
0 + 0 secondary duplicates - both read and mate mapped
0 + 0 secondary duplicates - only read mapped
0 + 0 secondary duplicates - cross chromosome
992 + 0 mapped (99.20%:0.00%)
1000 + 0 paired in sequencing
498 + 0 read1
502 + 0 read2
980 + 0 properly paired (98.00%:0.00%)
984 + 0 with itself and mate mapped
8 + 0 singletons (0.80%:0.00%)
4 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

lmk if you want me to find a trickier example to test on

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

SGTM. I think this was the last issue pending on this PR? Are we good to merge from your end?

@fnothaft

fnothaft Oct 2, 2015

Member

SGTM. I think this was the last issue pending on this PR? Are we good to merge from your end?

This comment has been minimized.

@ryan-williams

ryan-williams Oct 2, 2015

Member

yea I think this is good to go from my end, thanks for the reviews!

@ryan-williams

ryan-williams Oct 2, 2015

Member

yea I think this is good to go from my end, thanks for the reviews!

This comment has been minimized.

@fnothaft

fnothaft Oct 2, 2015

Member

np! thanks for the good changes. I will squash and merge.

@fnothaft

fnothaft Oct 2, 2015

Member

np! thanks for the good changes. I will squash and merge.

ryan-williams added some commits Sep 30, 2015

"-concat" flag to merge two AlignmentRecord RDDs
useful for merging paired-fastq files and a corresponding "unpaired"
reads file
make flagstat safe for some null fields
previously crashed on fastq reads
@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 2, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/950/

Build result: FAILURE

GitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision bde7cee058a6d19eaeea23d399733bec10ac6b1f (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f bde7cee058a6d19eaeea23d399733bec10ac6b1f > git rev-list e95c488f38b45d75bd7ef94a97c1f11fb856bc27 # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Oct 2, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/950/

Build result: FAILURE

GitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision bde7cee058a6d19eaeea23d399733bec10ac6b1f (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f bde7cee058a6d19eaeea23d399733bec10ac6b1f > git rev-list e95c488f38b45d75bd7ef94a97c1f11fb856bc27 # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 2, 2015

Member

Jenkins, test this please.

Member

fnothaft commented Oct 2, 2015

Jenkins, test this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 2, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/951/

Build result: FAILURE

GitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision bde7cee058a6d19eaeea23d399733bec10ac6b1f (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f bde7cee058a6d19eaeea23d399733bec10ac6b1f > git rev-list bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Oct 2, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/951/

Build result: FAILURE

GitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision bde7cee058a6d19eaeea23d399733bec10ac6b1f (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f bde7cee058a6d19eaeea23d399733bec10ac6b1f > git rev-list bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 2, 2015

Member

Jenkins, test this please.

Member

fnothaft commented Oct 2, 2015

Jenkins, test this please.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 2, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/952/

Build result: FAILURE

GitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains 06b4230 # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision 06b4230 (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 06b4230 > git rev-list bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

AmplabJenkins commented Oct 2, 2015

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/952/

Build result: FAILURE

GitHub pull request #831 of commit 47ebc12 automatically merged.Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'[EnvInject] - Loading node environment variables.Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prb > git rev-parse --is-inside-work-tree # timeout=10Fetching changes from the remote Git repository > git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > git --version # timeout=10 > git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ > git rev-parse origin/pr/831/merge^{commit} # timeout=10 > git branch -a --contains 06b4230 # timeout=10 > git rev-parse remotes/origin/pr/831/merge^{commit} # timeout=10Checking out Revision 06b4230 (origin/pr/831/merge) > git config core.sparsecheckout # timeout=10 > git checkout -f 06b4230 > git rev-list bde7cee058a6d19eaeea23d399733bec10ac6b1f # timeout=10Triggering ADAM-prb ? 2.6.0,2.11,1.4.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.4.1,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 2, 2015

Member

Jenkins, test this please.

Member

fnothaft commented Oct 2, 2015

Jenkins, test this please.

@fnothaft fnothaft added this to the 0.18.0 milestone Oct 2, 2015

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Oct 2, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/953/
Test PASSed.

AmplabJenkins commented Oct 2, 2015

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/953/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Oct 2, 2015

Member

Thanks @ryan-williams ! I've merged this as 4c9654e and 9db4108.

Member

fnothaft commented Oct 2, 2015

Thanks @ryan-williams ! I've merged this as 4c9654e and 9db4108.

@fnothaft fnothaft closed this Oct 2, 2015

case ValidationStringency.SILENT =>
}
reads1 ++ reads2

This comment has been minimized.

@allenday

allenday Jan 4, 2016

Contributor

If I understand correctly, this puts the _1 reads first, followed by the _2 reads.

If so, is this really what's wanted? Doesn't it make more sense to interleave?

Ideally they will be preserved as pairs in a single record so that if we need to process the pair as a pair or round trip the .adam back out to .fastq (what I'm doing now) it's trivial to do. As is now, again if I understand correctly ++ operator, read pairs are broken up.

How about reads1.zip(reads2)
?

@allenday

allenday Jan 4, 2016

Contributor

If I understand correctly, this puts the _1 reads first, followed by the _2 reads.

If so, is this really what's wanted? Doesn't it make more sense to interleave?

Ideally they will be preserved as pairs in a single record so that if we need to process the pair as a pair or round trip the .adam back out to .fastq (what I'm doing now) it's trivial to do. As is now, again if I understand correctly ++ operator, read pairs are broken up.

How about reads1.zip(reads2)
?

This comment has been minimized.

@fnothaft

fnothaft Jan 5, 2016

Member

If so, is this really what's wanted? Doesn't it make more sense to interleave?

How about reads1.zip(reads2)

That's a fair point, but I generally try to write any Spark transformations without assuming the relative order of records (or more precisely, I only make assumptions about the relative order of records after sorting).

Ideally they will be preserved as pairs in a single record so that if we need to process the pair as a pair or round trip the .adam back out to .fastq (what I'm doing now) it's trivial to do. As is now, again if I understand correctly ++ operator, read pairs are broken up.

We've added the Fragment schema recently to cover this use case. However, I don't think we've connected the FASTQ loading code to load FASTQ data as Fragments.

@fnothaft

fnothaft Jan 5, 2016

Member

If so, is this really what's wanted? Doesn't it make more sense to interleave?

How about reads1.zip(reads2)

That's a fair point, but I generally try to write any Spark transformations without assuming the relative order of records (or more precisely, I only make assumptions about the relative order of records after sorting).

Ideally they will be preserved as pairs in a single record so that if we need to process the pair as a pair or round trip the .adam back out to .fastq (what I'm doing now) it's trivial to do. As is now, again if I understand correctly ++ operator, read pairs are broken up.

We've added the Fragment schema recently to cover this use case. However, I don't think we've connected the FASTQ loading code to load FASTQ data as Fragments.

@shibuvp

This comment has been minimized.

Show comment
Hide comment
@shibuvp

shibuvp May 29, 2017

im trying to convert fastq files into adam .bt it is not working
my command------------------
./adam-submit -- transform hdfs://master.hdp:8020/Sample/NewFile1.fastq -paired_fastq hdfs://master.hdp:8020/Sample/NewFile2.fastq -record_group hdfs://master.hdp:8020/Sample/adamresult001.adam

shibuvp commented May 29, 2017

im trying to convert fastq files into adam .bt it is not working
my command------------------
./adam-submit -- transform hdfs://master.hdp:8020/Sample/NewFile1.fastq -paired_fastq hdfs://master.hdp:8020/Sample/NewFile2.fastq -record_group hdfs://master.hdp:8020/Sample/adamresult001.adam

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment