Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1978] Add additional filter by convenience methods. #1983

Merged
merged 2 commits into from Jun 27, 2018

Conversation

@heuermh
Copy link
Member

@heuermh heuermh commented Apr 13, 2018

Fixes #1978.

Thank you to @antonkulaga, the methods here are inspired by those in
https://github.com/antonkulaga/adam-playground

@coveralls
Copy link

@coveralls coveralls commented Apr 13, 2018

Coverage Status

Coverage increased (+0.1%) to 79.292% when pulling d708474 on heuermh:filter-by into 429680e on bigdatagenomics:master.

@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Apr 13, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2751/
Test PASSed.

@heuermh
Copy link
Member Author

@heuermh heuermh commented Apr 18, 2018

@fnothaft Milestone as 0.24.1?

@heuermh heuermh requested a review from fnothaft Apr 18, 2018
Copy link
Member

@fnothaft fnothaft left a comment

I like this general direction! Would love to see this extend out to the other datatypes (and Python and R).

@heuermh heuermh force-pushed the heuermh:filter-by branch from 62fb35a to c2bc183 Apr 24, 2018
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Apr 24, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2760/
Test PASSed.

@heuermh heuermh force-pushed the heuermh:filter-by branch from c2bc183 to daf9f1a May 21, 2018
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 21, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2767/
Test PASSed.

@heuermh heuermh force-pushed the heuermh:filter-by branch from daf9f1a to a705965 May 21, 2018
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented May 21, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2768/
Test PASSed.

@heuermh heuermh added this to the 0.24.1 milestone Jun 6, 2018
Copy link
Member

@fnothaft fnothaft left a comment

LGTM in general, but I think the quantitative methods should clarify how the filter is applied. I.e., filterQualityGreaterThan or filterReadDepthLessThan etc.

transformDataset(dataset => dataset.filter(dataset.col("featureType").eqNullSafe(featureType)))
}

override def filterToGene(geneId: String): FeatureRDD = {

This comment has been minimized.

@fnothaft

fnothaft Jun 22, 2018
Member

I would do filterGenes and etc

This comment has been minimized.

@heuermh

heuermh Jun 22, 2018
Author Member

E.g.?

def filterToGenes(geneIds: Seq[String]): FeatureRdd

This comment has been minimized.

@heuermh

heuermh Jun 23, 2018
Author Member

Added in most recent commit

transformDataset(dataset => dataset.filter(dataset.col("exonId").eqNullSafe(exonId)))
}

override def filterByScore(minimumScore: Double): FeatureRDD = {

This comment has been minimized.

@fnothaft

fnothaft Jun 22, 2018
Member

filterScoreGreaterThan?

This comment has been minimized.

@heuermh

heuermh Jun 22, 2018
Author Member

No, I want to be careful not to go overboard and include too many methods. I'd like to only filter out/remove numeric values strictly less than a minimum value, consistently.

}

/**
* Filter this FeatureRDD by gene.

This comment has been minimized.

@fnothaft

fnothaft Jun 22, 2018
Member

by gene -> to features that are genes.?

This comment has been minimized.

@heuermh

heuermh Jun 22, 2018
Author Member

No, that would be filterByFeatureType("gene") or filterByFeatureType("SO:0000704") (Sequence Ontology term, for GFF3)

@@ -280,6 +280,42 @@ case class DatasetBoundAlignmentRecordRDD private[rdd] (
newProcessingSteps: Seq[ProcessingStep]): AlignmentRecordRDD = {
copy(processingSteps = newProcessingSteps)
}

override def filterByMapq(minimumMapq: Int): AlignmentRecordRDD = {

This comment has been minimized.

@fnothaft

fnothaft Jun 22, 2018
Member

filterMapqGreaterThan

@heuermh
Copy link
Member Author

@heuermh heuermh commented Jun 22, 2018

I tried to be careful with method names.

Methods that filter out/remove matching records are named filterXxx

def filterUnalignedReads(): AlignmentRecordRDD

Methods that filter out/remove records that do not match are named filterToXxx(s).

def filterToParent(parentId: String): FeatureRDD
def filterToPrimaryAlignments(): AlignmentRecordRDD

Methods that filter on numeric values are named filterByXxx and the method parameter is consistently minimumXxx

def filterByScore(minimumScore: Double): FeatureRDD
@AmplabJenkins
Copy link

@AmplabJenkins AmplabJenkins commented Jun 23, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2775/
Test PASSed.

@fnothaft fnothaft merged commit 559ea13 into bigdatagenomics:master Jun 27, 2018
2 checks passed
2 checks passed
Codacy/PR Quality Review Up to standards. A positive pull request.
Details
@AmplabJenkins
default Merged build finished.
Details
@fnothaft
Copy link
Member

@fnothaft fnothaft commented Jun 27, 2018

Merged! Thanks @heuermh!

@heuermh heuermh deleted the heuermh:filter-by branch Sep 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants