Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add pipe API in and out formatters for Features #1378

Merged
merged 1 commit into from Mar 14, 2017

Conversation

@heuermh
Copy link
Member

heuermh commented Jan 27, 2017

Work in progress, opened pull request for review.

Fixes #1374

.collect
.toVector)
// create sequence records based on largest end coordinate
val featuresByContigName = rdd.keyBy(_.getContigName)

This comment has been minimized.

Copy link
@heuermh

heuermh Jan 27, 2017

Author Member

Not sure this is a good idea, but it (partially) solves the trying to partition a sequence of length 1L problem in GenomicRDD.pipe

@AmplabJenkins
Copy link

AmplabJenkins commented Jan 27, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1763/

Build result: FAILURE

[...truncated 57 lines...] at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200) at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392) at java.net.Socket.connect(Socket.java:579) at sun.security.ssl.SSLSocketImpl.connect(SSLSocketImpl.java:618) at sun.net.NetworkClient.doConnect(NetworkClient.java:175) at sun.net.www.http.HttpClient.openServer(HttpClient.java:432) at sun.net.www.http.HttpClient.openServer(HttpClient.java:527) at sun.net.www.protocol.https.HttpsClient.(HttpsClient.java:275) at sun.net.www.protocol.https.HttpsClient.New(HttpsClient.java:371) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.getNewHttpClient(AbstractDelegateHttpsURLConnection.java:191) at sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:932) at sun.net.www.protocol.https.AbstractDelegateHttpsURLConnection.connect(AbstractDelegateHttpsURLConnection.java:177) at sun.net.www.protocol.https.HttpsURLConnectionImpl.connect(HttpsURLConnectionImpl.java:153) at com.tikal.hudson.plugins.notification.Protocol$3.send(Protocol.java:99) at com.tikal.hudson.plugins.notification.Phase.handle(Phase.java:45) at com.tikal.hudson.plugins.notification.JobListener.onCompleted(JobListener.java:36) at hudson.model.listeners.RunListener.fireCompleted(RunListener.java:201) at hudson.model.Run.execute(Run.java:1783) at hudson.matrix.MatrixBuild.run(MatrixBuild.java:306) at hudson.model.ResourceController.execute(ResourceController.java:98) at hudson.model.Executor.run(Executor.java:410)Failed to notify endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8' - java.net.SocketTimeoutException: connect timed out
Test FAILed.

@heuermh
Copy link
Member Author

heuermh commented Jan 30, 2017

I'm not sure why the counts don't add up for GTF and GFF3 when they are fine for BED and narrowPeak

- don't lose any features when piping as GTF format *** FAILED ***
  95 did not equal 114 (FeatureRDDSuite.scala:759)

- don't lose any features when piping as GFF3 format *** FAILED ***
  195 did not equal 199 (FeatureRDDSuite.scala:772)

Appears we are duplicating features when partitioning. Is this reasonable? Should the unit tests do a distinct after piping back in?

scala> val gtf = sc.loadGtf("src/test/resources/Homo_sapiens.GRCh37.75.trun100.gtf")
gtf2: org.bdgenomics.adam.rdd.feature.FeatureRDD =
FeatureRDD(MapPartitionsRDD[28] at flatMap at ADAMContext.scala:1186,SequenceDictionary{
1->36081})

scala> gtf.rdd.count
res2: Long = 95

scala> val pipedRdd: FeatureRDD = gtf.pipe("tee /dev/null") 
pipedRdd: org.bdgenomics.adam.rdd.feature.FeatureRDD =
FeatureRDD(MapPartitionsRDD[11] at mapPartitionsWithIndex at GenomicRDD.scala:336,SequenceDictionary{
1->36081})

scala> pipedRdd.rdd.count
res0: Long = 129

scala> pipedRdd.rdd.distinct.count
res3: Long = 95
scala> val gff3 = sc.loadGff3("src/test/resources/dvl1.200.gff3")
gff3: org.bdgenomics.adam.rdd.feature.FeatureRDD =
FeatureRDD(MapPartitionsRDD[21] at flatMap at ADAMContext.scala:1166,SequenceDictionary{
1->1363541})

scala> gff3.rdd.count
res1: Long = 195

scala> gff3.rdd.distinct.count
res3: Long = 181

scala> val pipedRdd: FeatureRDD = gff3.pipe("tee /dev/null") 
pipedRdd: org.bdgenomics.adam.rdd.feature.FeatureRDD =
FeatureRDD(MapPartitionsRDD[16] at mapPartitionsWithIndex at GenomicRDD.scala:336,SequenceDictionary{
1->1363541})

scala> pipedRdd.rdd.count
res0: Long = 199

scala> pipedRdd.rdd.distinct.count
res1: Long = 181
@AmplabJenkins
Copy link

AmplabJenkins commented Jan 30, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1765/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1378/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 06893683001603d0decd286abb1ceca7fc021d73 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1378/merge^{commit} # timeout=10Checking out Revision 06893683001603d0decd286abb1ceca7fc021d73 (origin/pr/1378/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 06893683001603d0decd286abb1ceca7fc021d73First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins
Copy link

AmplabJenkins commented Jan 30, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1766/

Build result: FAILURE

[...truncated 3 lines...]Building remotely on amp-jenkins-worker-05 (centos spark-test) in workspace /home/jenkins/workspace/ADAM-prbWiping out workspace first.Cloning the remote Git repositoryCloning repository https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git -c core.askpass=true fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1378/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 6b78a58 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1378/merge^{commit} # timeout=10Checking out Revision 6b78a58 (origin/pr/1378/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 6b78a5814a64fc2414df00e87df31003ceca4b8cFirst time build. Skipping changelog.Triggering ADAM-prb ? 2.6.0,2.11,1.5.2,centosTriggering ADAM-prb ? 2.6.0,2.10,1.5.2,centosTouchstone configurations resulted in FAILURE, so aborting...Notifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

implicit val tFormatter = BEDInFormatter
implicit val uFormatter = new BEDOutFormatter

val pipedRdd: FeatureRDD = frdd.pipe("tee /dev/null")

This comment has been minimized.

Copy link
@devin-petersohn

devin-petersohn Feb 2, 2017

Member

Why do we use tee /dev/null in the test?

This comment has been minimized.

Copy link
@fnothaft

fnothaft Feb 2, 2017

Member

It was the easiest one-liner I could think of that would pipe standard in to standard out unmodified without creating any artifacts.

@heuermh
Copy link
Member Author

heuermh commented Mar 7, 2017

Rebased to pull in #1411, two unit tests still fail as discussed above.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 7, 2017

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1843/

Build result: FAILURE

[...truncated 16 lines...] > /home/jenkins/git2/bin/git rev-parse origin/pr/1378/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a --contains 67905d5 # timeout=10 > /home/jenkins/git2/bin/git rev-parse remotes/origin/pr/1378/merge^{commit} # timeout=10Checking out Revision 67905d5 (origin/pr/1378/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 67905d5772d8457a83894e099926d7b5b45987b5First time build. Skipping changelog.Triggering ADAM-prb ? 2.3.0,2.11,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.3.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.11,2.0.0,centosTriggering ADAM-prb ? 2.6.0,2.10,2.0.0,centosTriggering ADAM-prb ? 2.3.0,2.10,1.6.1,centosTriggering ADAM-prb ? 2.6.0,2.11,1.6.1,centosADAM-prb ? 2.3.0,2.11,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.3.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,2.0.0,centos completed with result FAILUREADAM-prb ? 2.6.0,2.10,2.0.0,centos completed with result FAILUREADAM-prb ? 2.3.0,2.10,1.6.1,centos completed with result FAILUREADAM-prb ? 2.6.0,2.11,1.6.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh heuermh modified the milestone: 0.23.0 Mar 8, 2017
@heuermh heuermh added this to Triage in Release 0.23.0 Mar 8, 2017
@coveralls
Copy link

coveralls commented Mar 10, 2017

Coverage Status

Coverage increased (+0.2%) to 76.61% when pulling 193bab4 on heuermh:feature-formatters into 07c1982 on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 10, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1852/
Test PASSed.

Copy link
Member

fnothaft left a comment

LGTM!

@fnothaft
Copy link
Member

fnothaft commented Mar 14, 2017

@heuermh I just realized that I approved this but forgot to merge this. Is this good to go from your side? If yes, what I propose is:

  • Let's merge #1422
  • Let's update this PR to add docs to the Pipe API docs noting that these new formatters exist
  • Then, let's merge this PR.

How's that sound on your end?

@heuermh
Copy link
Member Author

heuermh commented Mar 14, 2017

Sounds good. I'll push a doc commit and squash after #1422 is merged.

@heuermh heuermh force-pushed the heuermh:feature-formatters branch from 193bab4 to 707567c Mar 14, 2017
@coveralls
Copy link

coveralls commented Mar 14, 2017

Coverage Status

Coverage increased (+0.3%) to 76.659% when pulling 707567c on heuermh:feature-formatters into 1cae769 on bigdatagenomics:master.

@AmplabJenkins
Copy link

AmplabJenkins commented Mar 14, 2017

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/1863/
Test PASSed.

@fnothaft fnothaft merged commit b8477dc into bigdatagenomics:master Mar 14, 2017
2 checks passed
2 checks passed
coverage/coveralls Coverage increased (+0.3%) to 76.659%
Details
default Merged build finished.
Details
@fnothaft
Copy link
Member

fnothaft commented Mar 14, 2017

Merged! Thanks @heuermh!

@heuermh heuermh deleted the heuermh:feature-formatters branch Mar 14, 2017
@heuermh heuermh modified the milestones: 0.22.0, 0.23.0 Mar 14, 2017
@heuermh heuermh moved this from Triage to Completed in Release 0.23.0 Mar 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.