New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Predicate to filter conversion #234

Merged
merged 1 commit into from May 5, 2014

Conversation

Projects
None yet
3 participants
@arahuja
Contributor

arahuja commented Apr 29, 2014

This PR is for issue #62

ADAMPredicate derives from UnboundRecordFilter and can be used to set ParquetInputFormat.setUnboundRecordFilter. It also has an apply method to filter an existing RDD. This will allow to use predicates on parquet files for predicate pushdown but also on an already loaded RDD (if we load from BAM/SAM file, or use the same filters after some processing (removing duplicates after mark_duplicates before proceeding to the other read-prep stages))

I added a few examples - HighQualityReadPredicate, UniqueMappedRead and GenotypeRecordPASSPredicate.

Also, in ADAMRecordConditions and ADAMGenotypeConditions are utility predicates which can be composed using AND and OR to create a new predicate. We can also specify non-equality predicates easily as well.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins commented Apr 29, 2014

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/310/

@@ -48,7 +48,7 @@ class PileupAggregator(protected val args: PileupAggregatorArgs)
val companion = PileupAggregator
def run(sc: SparkContext, job: Job) {
val pileups: RDD[ADAMPileup] = sc.adamLoad(args.readInput, predicate = Some(classOf[LocusPredicate]))

This comment has been minimized.

@arahuja

arahuja Apr 29, 2014

Contributor

Not sure why this was using LocusPredicate before?

@arahuja

arahuja Apr 29, 2014

Contributor

Not sure why this was using LocusPredicate before?

This comment has been minimized.

@massie

massie Apr 30, 2014

Member

That predicate is necessary. Pileups are created from mapped reads only.

@massie

massie Apr 30, 2014

Member

That predicate is necessary. Pileups are created from mapped reads only.

This comment has been minimized.

@massie

massie Apr 30, 2014

Member

Maybe we should put a comment to keep others from tripping up on this too?

@massie

massie Apr 30, 2014

Member

Maybe we should put a comment to keep others from tripping up on this too?

This comment has been minimized.

@arahuja

arahuja Apr 30, 2014

Contributor

Sure - but for this is after pileup creation right? Also the fields that LocusPredicate will check against are not defined for an ADAMPileup. I was going to substitute the MappedReadPredicate, but wasn't able to actually because of the typing as that is only applicable on ADAMRecord

@arahuja

arahuja Apr 30, 2014

Contributor

Sure - but for this is after pileup creation right? Also the fields that LocusPredicate will check against are not defined for an ADAMPileup. I was going to substitute the MappedReadPredicate, but wasn't able to actually because of the typing as that is only applicable on ADAMRecord

This comment has been minimized.

@massie

massie May 5, 2014

Member

You're right, Arun. This is after pileup creation so the predicate isn't needed.

@massie

massie May 5, 2014

Member

You're right, Arun. This is after pileup creation so the predicate isn't needed.

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins commented Apr 29, 2014

Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/311/

@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie Apr 30, 2014

Member

Please run scalariform, e.g. mvn org.scalariform:scalariform-maven-plugin:format

Member

massie commented Apr 30, 2014

Please run scalariform, e.g. mvn org.scalariform:scalariform-maven-plugin:format

@AmplabJenkins

This comment has been minimized.

Show comment
Hide comment
@AmplabJenkins

AmplabJenkins Apr 30, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/312/

AmplabJenkins commented Apr 30, 2014

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/ADAM-prb/312/

massie added a commit that referenced this pull request May 5, 2014

Merge pull request #234 from hammerlab/predicate-bam
Predicate to filter conversion

@massie massie merged commit 78bc6c1 into bigdatagenomics:master May 5, 2014

1 check passed

default Merged build finished.
Details
@massie

This comment has been minimized.

Show comment
Hide comment
@massie

massie May 5, 2014

Member

Thanks, Arun! I really liked seeing all the tests you put in this pull request. 👍

Member

massie commented May 5, 2014

Thanks, Arun! I really liked seeing all the tests you put in this pull request. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment