Don't flatten optional SAM tags into a string #240

Closed
iskandr opened this Issue May 9, 2014 · 22 comments

Comments

Projects
None yet
6 participants
@iskandr

iskandr commented May 9, 2014

When converting from a SAM record to an ADAM record, we flatten the already parsed collection of tags into a string. This incurs a (seemingly significant) memory/performance overhead and leaves the tag info in a less accessible form. It would be preferably to simply store the tags in a Map[String,String]. I think this is also a simpler solution to #37, since we don't need to figure out which tags deserve "top-level" inclusion ahead of time, just present them in a uniform manner.

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford May 29, 2014

Contributor

Well, we've already moved a number of the technically "optional" fields in the SAM/BAM format into the ADAMRecord itself -- e.g. PU, LB, RG, MD -- for reasons probably having mostly to do with performance. So I think there's still a discussion to be had here.

Personally, I think the fields that should most obviously be made "top-level" are those which will be repeated, identically, many times across the ADAMRecords, such as RG, PU, and LB. These'll be the values which Parquet will do the best with, compression-wise...

Contributor

tdanford commented May 29, 2014

Well, we've already moved a number of the technically "optional" fields in the SAM/BAM format into the ADAMRecord itself -- e.g. PU, LB, RG, MD -- for reasons probably having mostly to do with performance. So I think there's still a discussion to be had here.

Personally, I think the fields that should most obviously be made "top-level" are those which will be repeated, identically, many times across the ADAMRecords, such as RG, PU, and LB. These'll be the values which Parquet will do the best with, compression-wise...

@karenfeng

This comment has been minimized.

Show comment
Hide comment
@karenfeng

karenfeng Jun 18, 2014

Contributor

Do people still think this is important? Is the Map[String,String] approach something that we want, due to incompatibility with Shark?

Contributor

karenfeng commented Jun 18, 2014

Do people still think this is important? Is the Map[String,String] approach something that we want, due to incompatibility with Shark?

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 18, 2014

Member

I would prefer the Map[String, String] approach @iskandr suggested over the current approach.

Member

fnothaft commented Jun 18, 2014

I would prefer the Map[String, String] approach @iskandr suggested over the current approach.

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Jun 18, 2014

Member

So we're ok moving from a flat to nested schema for the ADAMRecord? I'm ok with that if everyone else is.

Member

massie commented Jun 18, 2014

So we're ok moving from a flat to nested schema for the ADAMRecord? I'm ok with that if everyone else is.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jun 18, 2014

Member

@massie The ADAMRecord schema is already nested via the inclusion of ADAMContig.

Member

fnothaft commented Jun 18, 2014

@massie The ADAMRecord schema is already nested via the inclusion of ADAMContig.

@calvertj

This comment has been minimized.

Show comment
Hide comment
@calvertj

calvertj Jan 11, 2015

Contributor

Want me to take this one?

Contributor

calvertj commented Jan 11, 2015

Want me to take this one?

@karenfeng

This comment has been minimized.

Show comment
Hide comment
@karenfeng

karenfeng Jan 11, 2015

Contributor

Yes, thanks.

On Saturday, January 10, 2015, calvertj notifications@github.com wrote:

Want me to take this one?


Reply to this email directly or view it on GitHub
#240 (comment)
.

Contributor

karenfeng commented Jan 11, 2015

Yes, thanks.

On Saturday, January 10, 2015, calvertj notifications@github.com wrote:

Want me to take this one?


Reply to this email directly or view it on GitHub
#240 (comment)
.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jan 11, 2015

Member

+1, that'd be wonderful.

Member

fnothaft commented Jan 11, 2015

+1, that'd be wonderful.

@calvertj

This comment has been minimized.

Show comment
Hide comment
@calvertj

calvertj Jan 11, 2015

Contributor

Cool, working on it.

Contributor

calvertj commented Jan 11, 2015

Cool, working on it.

@calvertj

This comment has been minimized.

Show comment
Hide comment
@calvertj

calvertj Jan 23, 2015

Contributor

Sorry for the delay, will work on it this weekend.

Contributor

calvertj commented Jan 23, 2015

Sorry for the delay, will work on it this weekend.

@calvertj

This comment has been minimized.

Show comment
Hide comment
@calvertj

calvertj Feb 28, 2015

Contributor

I'm done shoveling, back in progress.

Contributor

calvertj commented Feb 28, 2015

I'm done shoveling, back in progress.

@calvertj

This comment has been minimized.

Show comment
Hide comment
@calvertj

calvertj Mar 1, 2015

Contributor

Hey guys, (@tdanford, @fnothaft)

From my (limited) understanding of the SAM Format specification ( http://samtools.github.io/hts-specs/SAMv1.pdf ) there can be multiple lines with same tag, does that seem correct to you? A Map[String,String] could have collisions, I am guessing you don't want to throw an exception if there are collisions?

There are plenty of options:
Map[String,String] - after grouping and concatenating like tag values
Map[String,List[String]]
List[(String, String)]
...

Also are you sure you don't want to keep the type information that was so nicely parse out?

I am looking at this bit of code in org.bdgenomics.adam.converters.SAMRecordConverter:

I assume these are the tags I'm looking for.

Thanks and sorry for the delay,

J

Contributor

calvertj commented Mar 1, 2015

Hey guys, (@tdanford, @fnothaft)

From my (limited) understanding of the SAM Format specification ( http://samtools.github.io/hts-specs/SAMv1.pdf ) there can be multiple lines with same tag, does that seem correct to you? A Map[String,String] could have collisions, I am guessing you don't want to throw an exception if there are collisions?

There are plenty of options:
Map[String,String] - after grouping and concatenating like tag values
Map[String,List[String]]
List[(String, String)]
...

Also are you sure you don't want to keep the type information that was so nicely parse out?

I am looking at this bit of code in org.bdgenomics.adam.converters.SAMRecordConverter:

I assume these are the tags I'm looking for.

Thanks and sorry for the delay,

J

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Mar 1, 2015

Contributor

I wrote some version of this code back in the day, I remember thinking, "we should probably be keeping the type information around." It's required at least insofar as we will ever need to go back to BAM/SAM format.

Also, J, I'm confused by what your question is about tag-uniqueness. A TAG (e.g. "RG" or "MD" etc) can definitely be repeated across lines -- for example, every record for every alignment of a read from a single lane of an Illumina sequencer will typically have an RG (= "Read Group") tag and the same value for that tag.

On the other hand, the spec says: "Each TAG can only appear once in one alignment line." -- so I interpret that as saying that we won't ever find two "RG" tags (for instance) on the same line.

Contributor

tdanford commented Mar 1, 2015

I wrote some version of this code back in the day, I remember thinking, "we should probably be keeping the type information around." It's required at least insofar as we will ever need to go back to BAM/SAM format.

Also, J, I'm confused by what your question is about tag-uniqueness. A TAG (e.g. "RG" or "MD" etc) can definitely be repeated across lines -- for example, every record for every alignment of a read from a single lane of an Illumina sequencer will typically have an RG (= "Read Group") tag and the same value for that tag.

On the other hand, the spec says: "Each TAG can only appear once in one alignment line." -- so I interpret that as saying that we won't ever find two "RG" tags (for instance) on the same line.

@calvertj

This comment has been minimized.

Show comment
Hide comment
@calvertj

calvertj Mar 1, 2015

Contributor

@tdanford

oops, I was thinking the SAMRecord was an iterator over the file, not a line. I guess I should not watch house of cards while looking at code.

As for what to store it as, are you sure you don't just want to keep the List[Attribute]?

case class Attribute(tag: String, tagType: TagType.Value, value: Any) {

Finally should I change the attributes in the format or create a samTags or something?
https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl#L218

Thanks,

J

Contributor

calvertj commented Mar 1, 2015

@tdanford

oops, I was thinking the SAMRecord was an iterator over the file, not a line. I guess I should not watch house of cards while looking at code.

As for what to store it as, are you sure you don't just want to keep the List[Attribute]?

case class Attribute(tag: String, tagType: TagType.Value, value: Any) {

Finally should I change the attributes in the format or create a samTags or something?
https://github.com/bigdatagenomics/bdg-formats/blob/master/src/main/resources/avro/bdg.avdl#L218

Thanks,

J

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Mar 1, 2015

Contributor

Jason, I'm confused by what you mean, when you write

As for what to store it as, are you sure you don't just want to keep the List[Attribute]?
Contributor

tdanford commented Mar 1, 2015

Jason, I'm confused by what you mean, when you write

As for what to store it as, are you sure you don't just want to keep the List[Attribute]?
@calvertj

This comment has been minimized.

Show comment
Hide comment
@calvertj

calvertj Mar 2, 2015

Contributor

Currently the List[Attribute] is turned into a tab delimited string:

At least I think I am looking at the right bit of code. Is this the "tags" I am looking to update?
I was wondering why not just keep the List[Attribute] instead of a Map[String,String]? Or maybe a Map[String,Attribute], that is all.

[Edit], you don't have to give me an answer to the "why not", just let me know what you wish to do.

J

Contributor

calvertj commented Mar 2, 2015

Currently the List[Attribute] is turned into a tab delimited string:

At least I think I am looking at the right bit of code. Is this the "tags" I am looking to update?
I was wondering why not just keep the List[Attribute] instead of a Map[String,String]? Or maybe a Map[String,Attribute], that is all.

[Edit], you don't have to give me an answer to the "why not", just let me know what you wish to do.

J

@calvertj

This comment has been minimized.

Show comment
Hide comment
@calvertj

calvertj Mar 4, 2015

Contributor

Well, I am just going to use a Map[String,String] as we won't need a custom record in Avro. Path of least resistance. So is something like this ok?
("tagName -> "tagType:TagValue") ?

Contributor

calvertj commented Mar 4, 2015

Well, I am just going to use a Map[String,String] as we won't need a custom record in Avro. Path of least resistance. So is something like this ok?
("tagName -> "tagType:TagValue") ?

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Mar 4, 2015

Contributor

Jason, I think it sounds fine -- but we'll probably need to see the code :-)

Why don't you whip something up, and we'll review it in detail there?

Contributor

tdanford commented Mar 4, 2015

Jason, I think it sounds fine -- but we'll probably need to see the code :-)

Why don't you whip something up, and we'll review it in detail there?

@calvertj

This comment has been minimized.

Show comment
Hide comment
@calvertj

calvertj Mar 4, 2015

Contributor

Yea, less talk, more action.

Contributor

calvertj commented Mar 4, 2015

Yea, less talk, more action.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 4, 2015

Member

("tagName -> "tagType:TagValue") ?

SGTM!

Member

fnothaft commented Mar 4, 2015

("tagName -> "tagType:TagValue") ?

SGTM!

@tdanford

This comment has been minimized.

Show comment
Hide comment
@tdanford

tdanford Mar 4, 2015

Contributor

Yeah, that sounds like the right direction -- but again, eager to see code :-)

Contributor

tdanford commented Mar 4, 2015

Yeah, that sounds like the right direction -- but again, eager to see code :-)

@fnothaft fnothaft added the wontfix label Jul 20, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 20, 2016

Member

Closing as won't fix. See #1080 for most recent discussion.

Member

fnothaft commented Jul 20, 2016

Closing as won't fix. See #1080 for most recent discussion.

@fnothaft fnothaft closed this Jul 20, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment