New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore if SeqDict data can be factored out more aggressively #983

Closed
jpdna opened this Issue Mar 27, 2016 · 7 comments

Comments

Projects
None yet
3 participants
@jpdna
Member

jpdna commented Mar 27, 2016

Hoping to get feedback/discussion on below:

There remains a substantial chunk of repeating Contig/SequenceRecord metadata that we are carrying around within AlignmentRecord RDDs that I am curious if we can factor out more aggressively in the same way that we did for the RecordGroupMetadata.

Both RecordGroup and SequneceDictionary data have been factored out into separate dictionary files that we write alongside the main ADAM AlignmentRecord parquet files. For the RecordGroup only the recordGroupName remains in the AligimentRecord. I understand this to have been done to reduce the size of the AlignmentRecord which gets created in memory and goes through shuffle - as we lose the Parquet dictionary encoding once we read it into an RDD.

The situation looks different though for SequenceDictionary and how Contig and mateContig continue to exist in their entirety in AlignmentRecord.
Both contig and mateContig continue to exist in every AlignmentRecord each with their redundant fields that can be pretty large - especially the url.

  name: String,
    length: Long,
    url: Option[String],
    md5: Option[String],
    refseq: Option[String],
    genbank: Option[String],
    assembly: Option[String],
    species: Option[String]

, when these instead could be represented just as well in AlignmentRecord with only by a sequenceRecordName which links out to the SequenceDictionary. Or do I misunderstand?

I'm curious to know if there is a reason why we are not normalizing this Contig and mateContig data in the same way was we do for the RecordGroup.

Also curious why the RecordGroupMetadata gets an entry in bdg.avdl
while SequenceRecord for the SequeneDictionary does not.

@jpdna jpdna self-assigned this Mar 27, 2016

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Mar 29, 2016

Member

when these instead could be represented just as well in AlignmentRecord with only by a sequenceRecordName which links out to the SequenceDictionary.

I would +1 that change. (And, that same change for all other records, e.g., Genotype, Variant, etc).

I'm curious to know if there is a reason why we are not normalizing this Contig and mateContig data in the same way was we do for the RecordGroup.

I don't remember entirely why I prioritized the RecordGroup data over the Contig data, but I think it was something along the lines of only having the time to change one, and the RecordGroup data being "heavier".

Also curious why the RecordGroupMetadata gets an entry in bdg.avdl
while SequenceRecord for the SequeneDictionary does not.

I think Contig and SequenceRecord are pretty similar.

Member

fnothaft commented Mar 29, 2016

when these instead could be represented just as well in AlignmentRecord with only by a sequenceRecordName which links out to the SequenceDictionary.

I would +1 that change. (And, that same change for all other records, e.g., Genotype, Variant, etc).

I'm curious to know if there is a reason why we are not normalizing this Contig and mateContig data in the same way was we do for the RecordGroup.

I don't remember entirely why I prioritized the RecordGroup data over the Contig data, but I think it was something along the lines of only having the time to change one, and the RecordGroup data being "heavier".

Also curious why the RecordGroupMetadata gets an entry in bdg.avdl
while SequenceRecord for the SequeneDictionary does not.

I think Contig and SequenceRecord are pretty similar.

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Mar 29, 2016

Member

Super - I will work on a PR for doing that.

Member

jpdna commented Mar 29, 2016

Super - I will work on a PR for doing that.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Mar 31, 2016

Member

I'm still not entirely happy with the separate SequenceDictionary/RecordGroupDictionary stuff in our API; for some context expand the comments in closed PR #906.

A lot of our API is based on RDD[SomeADAMType] + implict methods whereas the methods that use SequenceDictionary/RecordGroupDictionary return AlignmentRDD which is close to but not exactly the same thing. Or more precisely, it is the same thing but sometimes the Scala compiler can't figure it out.

Member

heuermh commented Mar 31, 2016

I'm still not entirely happy with the separate SequenceDictionary/RecordGroupDictionary stuff in our API; for some context expand the comments in closed PR #906.

A lot of our API is based on RDD[SomeADAMType] + implict methods whereas the methods that use SequenceDictionary/RecordGroupDictionary return AlignmentRDD which is close to but not exactly the same thing. Or more precisely, it is the same thing but sometimes the Scala compiler can't figure it out.

@heuermh

This comment has been minimized.

Show comment
Hide comment
@heuermh

heuermh Apr 7, 2016

Member

@jpdna close this via #988, or do you think there are remaining issues?

Member

heuermh commented Apr 7, 2016

@jpdna close this via #988, or do you think there are remaining issues?

@jpdna

This comment has been minimized.

Show comment
Hide comment
@jpdna

jpdna Apr 7, 2016

Member

There are potential remaining issues that contig could also be potentially factored out in same way from Variant and Feature

Member

jpdna commented Apr 7, 2016

There are potential remaining issues that contig could also be potentially factored out in same way from Variant and Feature

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Apr 7, 2016

Member

+1 to keep this open, for the same reason that @jpdna suggested.

Member

fnothaft commented Apr 7, 2016

+1 to keep this open, for the same reason that @jpdna suggested.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Jul 6, 2016

Member

Closed by 8abb47a and #1015.

Member

fnothaft commented Jul 6, 2016

Closed by 8abb47a and #1015.

@fnothaft fnothaft closed this Jul 6, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment