New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1978] Add additional filter by convenience methods. #1983

Merged
merged 2 commits into from Jun 27, 2018

Conversation

Projects
None yet
4 participants
@heuermh
Copy link
Member

heuermh commented Apr 13, 2018

Fixes #1978.

Thank you to @antonkulaga, the methods here are inspired by those in
https://github.com/antonkulaga/adam-playground

@coveralls

This comment has been minimized.

Copy link

coveralls commented Apr 13, 2018

Coverage Status

Coverage increased (+0.1%) to 79.292% when pulling d708474 on heuermh:filter-by into 429680e on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Apr 13, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2751/
Test PASSed.

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Apr 18, 2018

@fnothaft Milestone as 0.24.1?

@heuermh heuermh requested a review from fnothaft Apr 18, 2018

@fnothaft
Copy link
Member

fnothaft left a comment

I like this general direction! Would love to see this extend out to the other datatypes (and Python and R).

@heuermh heuermh force-pushed the heuermh:filter-by branch from 62fb35a to c2bc183 Apr 24, 2018

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Apr 24, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2760/
Test PASSed.

@heuermh heuermh force-pushed the heuermh:filter-by branch from c2bc183 to daf9f1a May 21, 2018

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented May 21, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2767/
Test PASSed.

@heuermh heuermh force-pushed the heuermh:filter-by branch from daf9f1a to a705965 May 21, 2018

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented May 21, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2768/
Test PASSed.

@heuermh heuermh added this to the 0.24.1 milestone Jun 6, 2018

@fnothaft
Copy link
Member

fnothaft left a comment

LGTM in general, but I think the quantitative methods should clarify how the filter is applied. I.e., filterQualityGreaterThan or filterReadDepthLessThan etc.

transformDataset(dataset => dataset.filter(dataset.col("featureType").eqNullSafe(featureType)))
}

override def filterToGene(geneId: String): FeatureRDD = {

This comment has been minimized.

@fnothaft

fnothaft Jun 22, 2018

Member

I would do filterGenes and etc

This comment has been minimized.

@heuermh

heuermh Jun 22, 2018

Member

E.g.?

def filterToGenes(geneIds: Seq[String]): FeatureRdd

This comment has been minimized.

@heuermh

heuermh Jun 23, 2018

Member

Added in most recent commit

transformDataset(dataset => dataset.filter(dataset.col("exonId").eqNullSafe(exonId)))
}

override def filterByScore(minimumScore: Double): FeatureRDD = {

This comment has been minimized.

@fnothaft

fnothaft Jun 22, 2018

Member

filterScoreGreaterThan?

This comment has been minimized.

@heuermh

heuermh Jun 22, 2018

Member

No, I want to be careful not to go overboard and include too many methods. I'd like to only filter out/remove numeric values strictly less than a minimum value, consistently.

}

/**
* Filter this FeatureRDD by gene.

This comment has been minimized.

@fnothaft

fnothaft Jun 22, 2018

Member

by gene -> to features that are genes.?

This comment has been minimized.

@heuermh

heuermh Jun 22, 2018

Member

No, that would be filterByFeatureType("gene") or filterByFeatureType("SO:0000704") (Sequence Ontology term, for GFF3)

@@ -280,6 +280,42 @@ case class DatasetBoundAlignmentRecordRDD private[rdd] (
newProcessingSteps: Seq[ProcessingStep]): AlignmentRecordRDD = {
copy(processingSteps = newProcessingSteps)
}

override def filterByMapq(minimumMapq: Int): AlignmentRecordRDD = {

This comment has been minimized.

@fnothaft

fnothaft Jun 22, 2018

Member

filterMapqGreaterThan

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Jun 22, 2018

I tried to be careful with method names.

Methods that filter out/remove matching records are named filterXxx

def filterUnalignedReads(): AlignmentRecordRDD

Methods that filter out/remove records that do not match are named filterToXxx(s).

def filterToParent(parentId: String): FeatureRDD
def filterToPrimaryAlignments(): AlignmentRecordRDD

Methods that filter on numeric values are named filterByXxx and the method parameter is consistently minimumXxx

def filterByScore(minimumScore: Double): FeatureRDD
@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jun 23, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2775/
Test PASSed.

@fnothaft fnothaft merged commit 559ea13 into bigdatagenomics:master Jun 27, 2018

2 checks passed

Codacy/PR Quality Review Up to standards. A positive pull request.
Details
default Merged build finished.
Details
@fnothaft

This comment has been minimized.

Copy link
Member

fnothaft commented Jun 27, 2018

Merged! Thanks @heuermh!

@heuermh heuermh deleted the heuermh:filter-by branch Sep 4, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment