Fix read groups mapping and add Y as base type #48

arahuja · 2014-01-15T18:44:22Z

The current readGroups map is broken in that individual workers develop their own recordGroupId -> ID mapping. which leads to incorrect aggregation.

Example:

Worker 1:
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062635 --> 0
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062634 --> 1
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062641 --> 2

Worker 2:

2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062635 --> 0
2014-01-14 17:29:13 INFO QualByRG:33 - QUALBRG : RG : SRR062641 --> 1
2014-01-14 17:29:14 INFO QualByRG:33 - QUALBRG : RG : SRR062634 --> 2

There was a note that the .distinct().collect() operation was slow, but we saw that this finished in ~30secs (on HG00096, 16GB file, w/ 96 worker tasks) to generate a proper readGroup map

Also added Y as a base that appears in MD tags, examples: MD:Z:96Y3, MD:Z:27Y72, MD:Z:25Y74

AmplabJenkins · 2014-01-15T18:53:06Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/31/

nealsid · 2014-01-15T19:01:55Z

adam-core/src/main/scala/edu/berkeley/cs/amplab/adam/rdd/recalibration/StandardCovariate.scala

@@ -22,13 +22,9 @@ import org.apache.spark.rdd.RDD
 // this class is required, not just standard. Baked in to recalibration.
 class QualByRG(rdd: RDD[ADAMRecord]) extends Serializable {
  // need to get the unique read groups todo --- this is surprisingly slow
-  //val readGroups = rdd.map(_.getRecordGroupId.toString).distinct().collect().sorted.zipWithIndex.toMap
-  var readGroups = Map[String, Int]()
+  val readGroups = rdd.map(_.getRecordGroupId.toString).distinct().collect().sorted.zipWithIndex.toMap


I was curious if you need to rerun the sort. Doesn't distinct already sort the collection?

For readability, can we split this to several lines? This line is a bit long.

AmplabJenkins · 2014-01-15T20:03:02Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/32/

massie · 2014-01-15T22:18:30Z

adam-core/src/main/scala/edu/berkeley/cs/amplab/adam/rdd/recalibration/StandardCovariate.scala

+  val readGroups = rdd.map(_.getRecordGroupId.toString)
+                      .distinct()
+                      .collect()
+                      .sorted.zipWithIndex.toMap


Is the sort necessary?

No not particularly, at least not here. It was there before so I left it.

It would allow us to build the same map every time, so if the RecalTable was ever saved and we wanted to recreate the readgroups from it, having them in sorted order would allow us to do that.

Taking ~30sec on a small BAM file is a bit worrisome since that implies it will take >10 minutes on a whole genome.

Maybe a better solution here is to:

Rename RecordGroupId to RecordGroupName

Have the RecordGroupId be an Int instead of the String type

Use the BAM header to map the RecordGroupName:String -> RecordGroupId:Int

Write the new RecordGroupId:Int for each ADAMRecord (RLE will make this very inexpensive in terms of storage)

With these changes, there will be no need to generate the intermediate map because the RecordGroupId:Int would be available for all records.

Matt, this is starting to feel (to me, and to Carl) a little like it might be isomorphic to the problem we think we've solved in the SequenceDictionary work.

Is there any chance that we should be looking at sharing some code, or at least a general approach, to that situation? Just wondering out loud here, that's a real (not rhetorical) question.

@tdanford Yes I think it is, but much more low weight. I put together something that mirrors a piece of what you did there, though not as expansive since we already pushing all the record group information separately into the ADAM file. I'll commit it shortly, my question (and what I'm testing out) is does changing the type of recordGroupId break already created ADAM files?

Another case I'd like us to keep in mind is the "fragmented BAM" case -- I'm not sure if your changes here, or the ones you're about to commit, are able to handle it or not but I thought I'd spell it out just in case.

I suspect most of us are used to dealing with "unitary" BAMs, where all the reads (and read groups) for a single sample are in a single BAM. Maybe I'm wrong on that assumption?

But one of the things Carl and I are starting to understand is that a lot of "custom" processing pipelines, which we (at least, I) would like to interact with using ADAM, are producing fragmented BAMs where different BAMs hold different read group data (and even cases where the BAMs are sharded from completely shuffled data, and there may be read groups that are scattered across multiple BAMs). For the pipeline comparison and validation work we're doing (again, work that still needs to show up here soon as a pull request), we pay attention to whether (e.g.) the Sequence Dictionaries are the same across the fragments and we use the SequenceDictionary code to unify them when they're not, as we build RDD[ADAMRecords] that are really unions of multiple underlying ADAM directories.

Just wanted to put that out here, because if we're looking at building a similar structure for read groups, we might need to think about this situation too. Just want to be sure that we're not implicitly assuming that (say) read groups with teh same name will have the same ID across different adam files, etc.

Would have to think about it more, but I believe the current setup would work (as it creates the readGroup Id mapping on the fly from given ADAM records) but the new change would not since it fixes a mapping based on whats in the header and different ADAM records could have overlapping ids from different read groups.

But my question is, right now this is being used in BQSR to optimize the way we store counts per read group. Is BQSR a well defined operation across multiple BAMs/ADAM directories?

Yeah, I don't see why it wouldn't be. You run BQSR over a single ADAM directory (we need a different name for this, maybe "ADAM RDD" or "ADAM set"?) -- and you get particular model and a particular re-calibrations of the QSes.

Now, if you arbitrarily split the BAM N ways, and then recombined them before running BQSR, I don't see why that should affect the recalibration.

I don't have any reason to think that your solution here doesn't work, I just was wondering out loud. It seems like one of these kinds of requirements that Carl and I have been coming to grips with, and I wasn't sure if it had already been part of everyone else's thinking or if it deserved a space to be made more explicit.

Changing the type of a schema field will definitely break backward compatibility.

I gave @tdanford misinformation about schema compatibility before. To keep the schema compatible by making sure the same field is not defined with different types or incompatible repetition (meaning the field needs to be at the same nesting level).

Since ADAM is still such a new project, we should feel free to break things to make them better for the long-term. Once we mature and our used more broadly in production, we won't have the luxury.

I agree that we're starting to see a theme emerging here about how to best (de-)normalize the data. On the ADAM mailing list, I sent the following:

We could address this limitation by creating a separate sequence dictionary file during the conversion from
BAM to ADAM. I think that Timothy and Carl at GenomeBridge would have a more informed idea of how to
manage sequence dictionaries than I do. As far as I see it, we can: (1) keep records fully de-normalized, (2)
add a sequence dictionary file separate from the de-normalized records with its own schema or (3) switch to
normalized records with a separate sequence dictionary or (4) have a separate dictionary and allow users
decide whether to normalize records.
Which approach do you guys like best?

But it was buried deep inside a thread.

My preference is for (1) right now but I'm always happy to be told I'm wrong. Even if we keep everything de-normalized, we'll still be the same size as BAM. I don't think that compression is as important as correctness, performance and ease-of-use. If compression becomes an issue, it would be easy to add an ADAM "archive" format later that normalizes all data and uses maximum compression for the most compact size.

Re: performance. Having a unique "RecordGroupId" will make the Mark Duplicates code much faster (not that it's very slow now). Instead of doing a groupBy on Strings, we can do a groupBy on Ints. Any other transforms that group by record group will also be much faster (e.g. BQSR).

In order words, the order of the schema isn't important. The name, type and nesting of each field is.

arahuja · 2014-01-16T19:57:56Z

I checked in the commit just to show what I was planning, but this does not address Timothy's point around multiple fragmented ADAM sets, since each would have their own read group mapping and that would be incompatible. Right now BQSR is dependent on sequential ids for 0 to number of read groups, but we can change that as well.

AmplabJenkins · 2014-01-16T19:58:07Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/39/

AmplabJenkins · 2014-01-16T20:03:06Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/40/

massie · 2014-01-16T20:25:09Z

Assuming that @tdanford agrees with this approach, I think this is ready to merge. Really clean commit @arahuja, I love it.

One small thing, it appears that this will not cleanly merge. It is possible you can rebase this commit to latest master? If you don't feel like doing it, I'm happy to do it for you.

massie · 2014-01-16T20:27:01Z

adam-core/src/main/scala/edu/berkeley/cs/amplab/adam/converters/SAMRecordConverter.scala

@@ -126,7 +126,8 @@ class SAMRecordConverter extends Serializable {
        case None =>
      }
      recordGroup.getId
-      builder.setRecordGroupId(recordGroup.getReadGroupId)
+      builder.setRecordGroupId(readGroups(recordGroup.getReadGroupId))
+        .setRecordGroupName(recordGroup.getReadGroupId)


I think this isn't correct. Shouldn't this be

.setRecordGroupName(recordGroup.getReadGroupName)

You might want to do a "git grep setRecordGroupName" on the ADAM repo to make sure that this is correct elsewhere too.

There is no .getRecordGroupName, maybe I'm not understanding the available fields, but recordGroud.getRecordGroupId, currently returns things of the form: SRR062634, is there something that should be referred to as the name. Or we should leave Id as is as have a new field that is ... something named better

My mistake. I thought that was being pulled from an ADAMRecord (which has a recordGroupName).

This does raise a question about naming. In general, an ID tends to imply an Integer type whereas a NAME tends to imply a string. I think it's better to be consistent with the other fields in the ADAM schema (e.g. referenceName/referenceId, mateReferenceName/mateReferenceId, etc). It's just something we'll have to remember moving forward.

Sorry for the comment noise.

Meaning: let's keep things as they are. *Id is the Int and *Name is the String and understand that the BAM specification isn't as consistent on this.

AmplabJenkins · 2014-01-21T16:57:53Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/42/

massie · 2014-01-21T17:06:04Z

This pull request looks good to me but I want to have a second pair of eyes on it. @tdanford ? @carlyeks ? This is somewhat related to your sequence dictionary work so I want to make sure it look ok to you.

arahuja · 2014-01-21T17:11:57Z

@tdanford had some comments earlier which I can't seem to find now after the rebase, but I moved around the constructor w/ Picard dependency and pulled out the mdtag change to another pull request.

On the issue, of fragmented adam processing, this doesn't address that. Right now BQSR is the only case for readGroupId (the int version), but for example bqsr on multiple adam dir would not give the correct result. This needs to be addressed by either (1) saving more information on the mapping up front, not sure exactly what this would look like (2) probably reworking BQSR a bit

tdanford · 2014-01-21T18:05:52Z

Arun, I think they were over here: #48 (comment)

Matt, Carl and I are reviewing now...

massie · 2014-01-21T18:06:33Z

Having the readGroupId (the int version) is also very helpful for the Mark Duplicates code as well. Any code that needs to group by record group will benefit from your work, Arun.

tdanford · 2014-01-21T18:14:41Z

Carl and I are cool with this request and think (as long as it's rebased off of master and the tests are passing) it's ready to merge.

I additionally have two requests --

I think that we should focus on supporting fragmented BAMs for this kind of code sooner rather than later, and
Carl and I are still working on a pull request which supports merging multiple ADAM files into a single RDD[ADAMRecord] with one, coherent SequenceDictionary. I suspect that we'll have to write similar code for RecordGroupDictionary and, when we do, I think it'll become apparent that there's a lot of logic shared between the two. We'll want to compare and probably pull out common code as a single parent class that both can inherit.

Do those points make sense? If they do, someone should make them into issues so that we can keep track of them in the future...

arahuja · 2014-01-21T18:44:18Z

Makes sense - though (1) might be many issues - curious to know if there are other areas you have seen lacking support for fragmented BAMs?

Agree on (2) as well, happy to refactor around that once you have the changes in.

Looking at the diff I seem to have inadvertently botched MdTag though ... let me fix that.

AmplabJenkins · 2014-01-21T18:52:55Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/45/

tdanford · 2014-01-21T18:55:15Z

Can you rebase off the latest master, Arun?

arahuja · 2014-01-21T19:15:47Z

Sure, just did, unless you are seeing issues?

massie · 2014-01-22T00:12:42Z

Jenkins, test this please.

AmplabJenkins · 2014-01-22T00:17:52Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/46/

Parse read groups and create mapping from bam/sam header

massie · 2014-01-22T00:24:21Z

Thanks, Arun!

Completely agree with you @tdanford that we'll likely have code overlap between the SequenceDictionary and RecordGroup work moving forward. Not sure, though, what exact issues we should create to track it. I'll (and I'm sure others) will definitely be keeping our eyes out for it moving forward.

tdanford · 2014-01-22T03:36:25Z

@massie I guess I don't know how to turn my (1) into an issue -- you're right, we'll just have to keep an eye on it -- but I took a shot at (2). Let me know what you think.

nealsid reviewed Jan 15, 2014
View reviewed changes

massie reviewed Jan 15, 2014
View reviewed changes

massie reviewed Jan 16, 2014
View reviewed changes

Parse read groups and create mapping from bam/sam header

f4e30a5

massie added a commit that referenced this pull request Jan 22, 2014

Merge pull request #48 from arahuja/readgroupsmapping

23446dd

Parse read groups and create mapping from bam/sam header

massie merged commit 23446dd into bigdatagenomics:master Jan 22, 2014

tdanford mentioned this pull request Jan 22, 2014

BQSR should support recalibration across multiple ADAM files #58

Closed

arahuja deleted the readgroupsmapping branch February 23, 2014 03:04

Fix read groups mapping and add Y as base type #48

Fix read groups mapping and add Y as base type #48

Conversation

arahuja commented Jan 15, 2014

AmplabJenkins commented Jan 15, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Jan 15, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arahuja commented Jan 16, 2014

AmplabJenkins commented Jan 16, 2014

AmplabJenkins commented Jan 16, 2014

massie commented Jan 16, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AmplabJenkins commented Jan 21, 2014

massie commented Jan 21, 2014

arahuja commented Jan 21, 2014

tdanford commented Jan 21, 2014

massie commented Jan 21, 2014

tdanford commented Jan 21, 2014

arahuja commented Jan 21, 2014

AmplabJenkins commented Jan 21, 2014

tdanford commented Jan 21, 2014

arahuja commented Jan 21, 2014

massie commented Jan 22, 2014

AmplabJenkins commented Jan 22, 2014

massie commented Jan 22, 2014

tdanford commented Jan 22, 2014