added predicate option to loadCoverage #1156

Closed
wants to merge 1 commit into
from

Conversation

Projects
None yet
4 participants
@akmorrow13
Contributor

akmorrow13 commented Sep 8, 2016

No description provided.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 8, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1474/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1474/
Test PASSed.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 9, 2016

Member

Ah, so this is a philosophical decision articulated many moons in the past by @laserson, but we don't have the predicates on the "automagical" functions (i.e., loadX, where X is not a specific file format). IIRC, the reasoning here is that the way the predicate is executed is hard to reason about without knowing the underlying file format, so thus it isn't good practice to "move" this underneath the user. We used to have another function that would apply a predicate to raw Avro, but we don't have this now, which also means that the records loaded from this function would be different, since the predicate would be applied to a Parquet file, but not to coverage loaded from (e.g.) BED.

What's the use case here? I know you had one in mind, so I'm wondering what's the best way to move this forward.

Member

fnothaft commented Sep 9, 2016

Ah, so this is a philosophical decision articulated many moons in the past by @laserson, but we don't have the predicates on the "automagical" functions (i.e., loadX, where X is not a specific file format). IIRC, the reasoning here is that the way the predicate is executed is hard to reason about without knowing the underlying file format, so thus it isn't good practice to "move" this underneath the user. We used to have another function that would apply a predicate to raw Avro, but we don't have this now, which also means that the records loaded from this function would be different, since the predicate would be applied to a Parquet file, but not to coverage loaded from (e.g.) BED.

What's the use case here? I know you had one in mind, so I'm wondering what's the best way to move this forward.

@Georgehe4

This comment has been minimized.

Show comment
Hide comment
@Georgehe4

Georgehe4 Sep 10, 2016

Contributor

The use case here is to enable loading in custom coverage files in Mango, since users might only be interested in viewing coverage information.

Contributor

Georgehe4 commented Sep 10, 2016

The use case here is to enable loading in custom coverage files in Mango, since users might only be interested in viewing coverage information.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 11, 2016

Member

The use case here is to enable loading in custom coverage files in Mango, since users might only be interested in viewing coverage information.

Those are just stored as Parquet Feature files, right? If so, then I'd do loadParquetFeatures with the predicate. I'm guessing that the goal is to have a region based predicate? If so, perhaps we should refactor the signature of the changed method (to take a region to filter on), and then add a filterByOverlappingReferenceRegion in the non-Parquet fork of the code (https://github.com/bigdatagenomics/adam/pull/1156/files#diff-d36ea7d0742decd0b040a73a96af06e9R972). How does that sound?

Member

fnothaft commented Sep 11, 2016

The use case here is to enable loading in custom coverage files in Mango, since users might only be interested in viewing coverage information.

Those are just stored as Parquet Feature files, right? If so, then I'd do loadParquetFeatures with the predicate. I'm guessing that the goal is to have a region based predicate? If so, perhaps we should refactor the signature of the changed method (to take a region to filter on), and then add a filterByOverlappingReferenceRegion in the non-Parquet fork of the code (https://github.com/bigdatagenomics/adam/pull/1156/files#diff-d36ea7d0742decd0b040a73a96af06e9R972). How does that sound?

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Sep 12, 2016

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1482/
Test PASSed.

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1482/
Test PASSed.

@akmorrow13

This comment has been minimized.

Show comment
Hide comment
@akmorrow13

akmorrow13 Sep 13, 2016

Contributor

I added a loadParquetCoverage and loadCoverage, the first which accepts a predicate. This is how loadFeatures is implemented.

Contributor

akmorrow13 commented Sep 13, 2016

I added a loadParquetCoverage and loadCoverage, the first which accepts a predicate. This is how loadFeatures is implemented.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 13, 2016

Member

Thanks for updating this @akmorrow13! This LGTM. I will leave open for a day for any other review comments and will merge if there are no objections.

Member

fnothaft commented Sep 13, 2016

Thanks for updating this @akmorrow13! This LGTM. I will leave open for a day for any other review comments and will merge if there are no objections.

@fnothaft

This comment has been minimized.

Show comment
Hide comment
@fnothaft

fnothaft Sep 15, 2016

Member

Thanks @akmorrow13! I've merged this as 5e2853e.

Member

fnothaft commented Sep 15, 2016

Thanks @akmorrow13! I've merged this as 5e2853e.

@fnothaft fnothaft closed this Sep 15, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment