New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-1883] Python and R caching #1885

Closed
wants to merge 1 commit into
base: master
from

Conversation

5 participants
@akmorrow13
Contributor

akmorrow13 commented Jan 24, 2018

No description provided.

@coveralls

This comment has been minimized.

coveralls commented Jan 24, 2018

Coverage Status

Coverage decreased (-0.05%) to 82.728% when pulling b11d7c2 on akmorrow13:python-caching into c5b4ebc on bigdatagenomics:master.

1 similar comment
@coveralls

This comment has been minimized.

coveralls commented Jan 24, 2018

Coverage Status

Coverage decreased (-0.05%) to 82.728% when pulling b11d7c2 on akmorrow13:python-caching into c5b4ebc on bigdatagenomics:master.

@coveralls

This comment has been minimized.

coveralls commented Jan 24, 2018

Coverage Status

Coverage decreased (-0.2%) to 82.561% when pulling 129f617 on akmorrow13:python-caching into eb3528c on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jan 24, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2591/
Test PASSed.

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Jan 24, 2018

Resolves #1883

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Jan 24, 2018

This is tested and works like so:

reads = ac.loadAlignments(readsPath, stringency=LENIENT)
reads.__jvmRdd.cache()
reads.toDF().count() # caching
reads.toDF().count() # is cached, will be fast
reads._jvmRdd.unpersist()

@heuermh

This comment has been minimized.

Member

heuermh commented Jan 24, 2018

I'm not sure I'm comfortable reviewing Python stuff yet. Is it reasonable to expose reads.__jvmRdd. to the caller like this? And isn't toDF().rdd contradictory?

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jan 24, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2594/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 1c40800 # timeout=10Checking out Revision 1c40800 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 1c40800b3b7b8e34bbc320651ff03f1dcc7c1e12First time build. Skipping changelog.Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Jan 24, 2018

@heuermh

  1. _jvmRDD has already been exposed, this PR does not expose that.
  2. I have added no code calling .toDF().rdd, so I am not sure what problem there is there. This is the most straightforward way to access the pyspark RDD currently. We could add a better method for accessing this but that is a different problem then the one we are addressing here.
@heuermh

This comment has been minimized.

Member

heuermh commented Jan 24, 2018

re 2 I'm asking is the toDF() necessary in your comment above? Seems odd to convert to data frame and then back to rdd.

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Jan 24, 2018

I agree @heuermh, but this is currently the only way to access a pyspark rdd from our API. The other rdd option is ._jvmRdd, but this is not a pyspark RDD, just an inaccessible Java Member.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jan 24, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2595/
Test PASSed.

@heuermh

This comment has been minimized.

Member

heuermh commented Jan 24, 2018

I see, thanks

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Jan 24, 2018

@heuermh I updated the count example to exclude .rdd so it isn't so confusing

@fnothaft

fnothaft requested changes Jan 25, 2018 edited

Thanks for doing this @akmorrow13! Two things:

  1. Can you add a persist(sl: StorageLevel) method to GenomicRDD?
  2. Can you add the persist/cache/unpersist methods on the Python (and R) GenomicRDD classes? cache and unpersist should be trivial but persist will take a tiny amount more finesse.

To wit:

_jvmRDD has already been exposed, this PR does not expose that.

_ is Python convention for "protected", so don't expect users to call that. Instead, to GenomicRDD add:

  def cache(self):

     self._jvmRdd.cache()
@@ -137,6 +137,26 @@ trait GenomicRDD[T, U <: GenomicRDD[T, U]] extends Logging {
*/
def replaceSequences(newSequences: SequenceDictionary): U
/**
* Caches underlying RDD in memory. This interfaced is used to

This comment has been minimized.

@fnothaft

fnothaft Jan 25, 2018

Member

Drop comment about R/Python, this method is useful independent of that.

/**
* Unpersists underlying RDD from memory. This interfaced is used to
* access caching functionality from the python and R APIs.

This comment has been minimized.

@fnothaft

fnothaft Jan 25, 2018

Member

Drop comment about R/Python, this method is useful independent of that.

* @return type of RDD that was cached
*/
def cache() = {
rdd.cache()

This comment has been minimized.

@heuermh

heuermh Jan 25, 2018

Member

Asking out of ignorance, does caching make any sense for GenomicDataset.dataset?

This comment has been minimized.

@fnothaft

fnothaft Jan 25, 2018

Member

Ah, yes... These should probably be overridden by the DatasetBound...RDD classes.

This comment has been minimized.

@akmorrow13

akmorrow13 Jan 25, 2018

Contributor

@fnothaft @heuermh I am confused what this means. Do you want me to implement this function in every DatasetBound...RDD function? There seems to be a lot of them, while this approach just requires 1 implementation.

This comment has been minimized.

@fnothaft

fnothaft Feb 3, 2018

Member

Hi @akmorrow13! Sorry for the slow reply here. Yes, that is what I want, but the simpler way to do this is to have RDDBoundGenomicRDD and DatasetBoundGenomicRDD traits that both implement the cache, persist, and unpersist methods. Then RDDBoundAlignmentRecordRDD would extend RDDBoundGenomicRDD, etc. We should do this for saveAsParquet as well.

Also, my preference is that unpersist returns Unit (no return value) and that cache/persist return either U or Unit.

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Jan 25, 2018

I added a persist function, but I have absolutely no idea to handle passing parameters from python to scala. I tried following the example for stringency, like this:

self._jvmRdd.persist(self.sc._jvm.org.apache.spark.storage.StorageLevel.apply(sl_dict["useDisk"], sl_dict["useMemory"],
                             sl_dict["deserialized"], sl_dict["replication"]))

but that didn't seem to work

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jan 25, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2599/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jan 25, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2600/
Test PASSed.

def cache(self):
self._jvmRdd.cache()
def persist(self, sl):

This comment has been minimized.

@fnothaft

fnothaft Jan 25, 2018

Member
  1. sl should be of type StorageLevel
  2. Use the Java StorageLevels.create method
  3. Voila:
self._jvmRdd.persist(self.sc._jvm.org.apache.spark.api.StorageLevels.create(sl.useDisk, sl.useMemory, sl.useOffHeap, sl.useDeserialized, sl.replication)

Py4j lines Python types up with Java types, so you need to line Python bool up with Java boolean, not Scala.Boolean, Python int up with Java int, not Scala.Integer, etc.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jan 25, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2601/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jan 25, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2603/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 52c468a # timeout=10Checking out Revision 52c468a (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 52c468a > /home/jenkins/git2/bin/git rev-list 92d412d # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jan 25, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2604/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 2a5c7d4 # timeout=10Checking out Revision 2a5c7d4 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 2a5c7d4 > /home/jenkins/git2/bin/git rev-list 52c468a # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Jan 26, 2018

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jan 26, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2605/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 2a5c7d4 # timeout=10Checking out Revision 2a5c7d4 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 2a5c7d4 > /home/jenkins/git2/bin/git rev-list 2a5c7d4 # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jan 26, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2606/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 01bd9f7 # timeout=10Checking out Revision 01bd9f7 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 01bd9f7 > /home/jenkins/git2/bin/git rev-list 2a5c7d4 # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jan 26, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2607/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains b38a1c5 # timeout=10Checking out Revision b38a1c5 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b38a1c5 > /home/jenkins/git2/bin/git rev-list 01bd9f7 # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Jan 26, 2018

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Jan 26, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2608/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains b38a1c5 # timeout=10Checking out Revision b38a1c5 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f b38a1c5 > /home/jenkins/git2/bin/git rev-list b38a1c5 # timeout=10Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Feb 6, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2646/
Test PASSed.

*/
def unpersist(): U = {
rdd.unpersist()
replaceRdd(rdd)

This comment has been minimized.

@akmorrow13

akmorrow13 Feb 6, 2018

Contributor

@fnothaft it returns U but idk if there is a better way to return the new rdd

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

prefer replaceRdd(rdd.unpersist())

* @return U cached GenomicRDD
*/
def cache(): U = {
rdd.cache()

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

prefer replaceRdd(rdd.cache())

*/
def unpersist(): U = {
rdd.unpersist()
replaceRdd(rdd)

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

prefer replaceRdd(rdd.unpersist())

*/
def persist(sl: StorageLevel): U = {
rdd.persist(sl)
replaceRdd(rdd)

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

prefer replaceRdd(rdd.persist(sl))

* A trait describing a GenomicDataset that also supports the Spark SQL APIs.
*/
trait DatasetBoundGenomicDataset[T, U <: Product, V <: GenomicDataset[T, U, V]] extends GenomicDataset[T, U, V] {
val uTag: TypeTag[U]

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

shouldn't need val uTag: TypeTag[U] as that comes in from GenomicDataset[T, U, V]

*/
override def cache(): V = {
dataset.cache()
transformDataset(dataset => dataset)

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

prefer transformDataset(_.cache())

*/
override def persist(sl: StorageLevel): V = {
dataset.persist(sl)
transformDataset(dataset => dataset)

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

prefer transformDataset(_.persist(sl))

*/
override def unpersist(): V = {
dataset.unpersist()
transformDataset(dataset => dataset)

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

prefer transformDataset(_.unpersist())

Caches underlying RDD in memory.
"""
self._jvmRdd.cache()

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

Similarly, these should replace the RDD.

:param sl new StorageLevel
"""
self._jvmRdd.persist(self.sc._jvm.org.apache.spark.api.java.StorageLevels.create(sl.useDisk, \

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

Similarly, these should replace the RDD.

Unpersists underlying RDD from memory.
"""
self._jvmRdd.unpersist()

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

Similarly, these should replace the RDD.

setMethod("cache",
signature(ardd = "GenomicRDD"),
function(ardd) {
sparkR.callJMethod(ardd@jrdd, "cache")

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

Similarly, these should replace the RDD.

sl = "character"),
function(ardd, sl) {
storageLevel <- sparkR.callJStatic("org.apache.spark.storage.StorageLevel", "fromString", sl)
sparkR.callJMethod(ardd@jrdd, "persist", storageLevel)

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

Similarly, these should replace the RDD.

setMethod("unpersist",
signature(ardd = "GenomicRDD"),
function(ardd) {
sparkR.callJMethod(ardd@jrdd, "unpersist")

This comment has been minimized.

@fnothaft

fnothaft Feb 6, 2018

Member

Similarly, these should replace the RDD.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Feb 7, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2648/
Test PASSed.

@akmorrow13 akmorrow13 added this to the 0.24.0 milestone Feb 7, 2018

@fnothaft

Just a couple of stylistic nits. Looks good otherwise; thanks @akmorrow13!

JavaSaveArgs,
SAMHeaderWriter
}
import org.bdgenomics.adam.rdd.{ DatasetBoundGenomicDataset, AvroGenomicRDD, JavaSaveArgs, SAMHeaderWriter }

This comment has been minimized.

@fnothaft

fnothaft Feb 7, 2018

Member

Nit: Please split import across multiple lines.

@@ -30,7 +30,7 @@ import org.bdgenomics.adam.models.{
SequenceDictionary
}
import org.bdgenomics.adam.rdd.ADAMContext._
import org.bdgenomics.adam.rdd.{ AvroRecordGroupGenomicRDD, JavaSaveArgs }
import org.bdgenomics.adam.rdd.{ DatasetBoundGenomicDataset, AvroRecordGroupGenomicRDD, JavaSaveArgs }

This comment has been minimized.

@fnothaft

fnothaft Feb 7, 2018

Member

Please split long import line across multiple lines.

AvroGenomicRDD,
VCFHeaderUtils
}
import org.bdgenomics.adam.rdd.{ DatasetBoundGenomicDataset, AvroGenomicRDD, VCFHeaderUtils }

This comment has been minimized.

@fnothaft

fnothaft Feb 7, 2018

Member

Nit: please split long import across multiple lines.

:param sl new StorageLevel
"""
return self._replaceRdd(self._jvmRdd.persist(self.sc._jvm.org.apache.spark.api.java.StorageLevels.create(sl.useDisk, \

This comment has been minimized.

@fnothaft

fnothaft Feb 7, 2018

Member

This line is really long. Can you split it out as:

jsl = self.sc._jvm.org.apache.spark.api.java.StorageLevels.create(sl.useDisk,
  sl.useMemory,
  sl.useOffHeap,
  sl.deserialized,
  sl.replication)

return self._replaceRdd(self._jvmRdd.persist(jsl))

Indented parameters should align with the opening delimiter.

@@ -201,3 +202,33 @@ def test_filterByOverlappingRegions(self):
filtered = reads.filterByOverlappingRegions(querys)
self.assertEquals(filtered.toDF().count(), 2)
def test_caching(self):

This comment has been minimized.

@fnothaft

fnothaft Feb 7, 2018

Member

Nit: 2 lines of space before each new function definition.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Feb 7, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2650/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Feb 7, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2651/
Test PASSed.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Feb 7, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2652/
Test PASSed.

@heuermh

Looks good, some minor doc fixes, and a question about GenomicDatset.

}
/**
* Persists underlying RDD in memory.

This comment has been minimized.

@heuermh

heuermh Feb 7, 2018

Member

Persists underlying RDD in memory. → Persists underlying RDD in memory or disk.

}
/**
* Unpersists underlying RDD from memory.

This comment has been minimized.

@heuermh

heuermh Feb 7, 2018

Member

Unpersists underlying RDD from memory. → Unpersists underlying RDD from memory or disk.

/**
* A trait describing a GenomicDataset that also supports the Spark SQL APIs.
*/
trait DatasetBoundGenomicDataset[T, U <: Product, V <: GenomicDataset[T, U, V]] extends GenomicDataset[T, U, V] {

This comment has been minimized.

@heuermh

heuermh Feb 7, 2018

Member

@fnothaft I think you may have suggested this, doesn't GenomicDataset already imply the RDD has been bound to a dataset? Couldn't these be moved there?

I'm not very sure what GenomicDataset is for, it isn't used everywhere I would've thought
https://github.com/bigdatagenomics/adam/search?utf8=%E2%9C%93&q=GenomicDataset&type=

This comment has been minimized.

@fnothaft

fnothaft Feb 7, 2018

Member

No, GenomicDataset is any GenomicRDD where the underlying genomic data can be viewed as both an RDD or a Dataset, while the base GenomicRDD trait only provides RDDs. This distinction will go away soon.

This comment has been minimized.

@fnothaft

fnothaft Feb 7, 2018

Member

ParquetUnbound/RDDBound/DatasetBound implies how the data is currently physically represented. Parquet unbound means that the data hasn't been materialized into memory yet and that the data is represented using a Parquet file, while RDD/DatasetBound means that the current (as of the last transformation) representation is an RDD/Dataset.

Since cache impacts the physical materialization of the data, we want the new DatasetBoundGenomicDataset abstract class to override cache, and not the GenomicDataset abstract class.

}
/**
* Persists underlying dataset in memory.

This comment has been minimized.

@heuermh

heuermh Feb 7, 2018

Member

Persists underlying dataset in memory. → Persists underlying dataset in memory or disk.

* Persists underlying dataset in memory.
*
* @param sl new StorageLevel
* @return Unit

This comment has been minimized.

@heuermh

heuermh Feb 7, 2018

Member

@return Unit → @return V cached GenomicDataset

}
/**
* Unpersists underlying dataset from memory.

This comment has been minimized.

@heuermh

heuermh Feb 7, 2018

Member

Unpersists underlying dataset from memory. → Unpersists underlying dataset from memory or disk.

/**
* Unpersists underlying dataset from memory.
*
* @return Unit

This comment has been minimized.

@heuermh

heuermh Feb 7, 2018

Member

@return Unit → @return V uncached GenomicDataset

This comment has been minimized.

@fnothaft

fnothaft Feb 7, 2018

Member

unpersisted, not uncached

def persist(self, sl):
"""
Persists underlying RDD in memory.

This comment has been minimized.

@heuermh

heuermh Feb 7, 2018

Member

Persists underlying RDD in memory. → Persists underlying RDD in memory or disk.

def unpersist(self):
"""
Unpersists underlying RDD from memory.

This comment has been minimized.

@heuermh

heuermh Feb 7, 2018

Member

Unpersists underlying RDD from memory. → Unpersists underlying RDD from memory or disk.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Feb 8, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2654/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains c71ccf4 # timeout=10Checking out Revision c71ccf4 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c71ccf4 > /home/jenkins/git2/bin/git rev-list e719c88 # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Feb 8, 2018

Jenkins, retest this please.

1 similar comment
@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Feb 8, 2018

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Feb 8, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2655/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains c71ccf4 # timeout=10Checking out Revision c71ccf4 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f c71ccf4 > /home/jenkins/git2/bin/git rev-list c71ccf4 # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

:return: Returns a new, persisted RDD.
"""
jsl = self.sc._jvm.org.apache.spark.api.java.StorageLevels.create(sl.useDisk, \

This comment has been minimized.

@fnothaft

fnothaft Feb 9, 2018

Member

Do not use \ to break lines, and please indent to the opening delimiter. I.e., every sl should be at the same character position on the line.

return self._replaceRdd(self._jvmRdd.persist(jsl))
def unpersist(self):

This comment has been minimized.

@fnothaft

fnothaft Feb 9, 2018

Member

Two spaces above.

@fnothaft

This comment has been minimized.

Member

fnothaft commented Feb 9, 2018

From looking at Jenkins, needs a ./scripts/format-sources.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Feb 9, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2656/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1885/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 89131e6 # timeout=10Checking out Revision 89131e6 (origin/pr/1885/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 89131e6 > /home/jenkins/git2/bin/git rev-list c71ccf4 # timeout=10Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Feb 9, 2018

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Feb 9, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2659/
Test PASSed.

@heuermh heuermh added this to Triage in Release 0.24.0 Feb 10, 2018

@akmorrow13

This comment has been minimized.

Contributor

akmorrow13 commented Feb 14, 2018

I think this is all set @fnothaft let me know if there is anything else that is needed

@akmorrow13 akmorrow13 moved this from Triage to Completed in Release 0.24.0 Feb 14, 2018

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Feb 14, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2667/
Test PASSed.

@fnothaft

This comment has been minimized.

Member

fnothaft commented Feb 14, 2018

Thanks @akmorrow13! Merged manually as 6051321.

@fnothaft fnothaft closed this Feb 14, 2018

@heuermh heuermh changed the title from Python and R caching to [ADAM-1883] Python and R caching Feb 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment