Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hive partitioned(v4) rebased #1864

Closed
wants to merge 21 commits into from

Conversation

Projects
5 participants
@jpdna
Copy link
Member

jpdna commented Jan 6, 2018

The "hive style" partitioned PR rebased against current ADAM master.

It will only compile against Spark2, hence POMs have been moved to Spark 2.
Is there a specific strategy wrt to handelling that for Spark2 with a profile you suggest? model after how the R code works?

Please let me know, especially if you have comments/changes wrt to the partitioning related code.

todo:

  1. its looks like there is some cleanup related to adam-cli/src/test/resources/artificial.fa that I need to look at
  2. a few small formatting issues in ADAMContext - will fix so that this PR does not modify where its arbitrary

Paschall and others added some commits Jul 21, 2017

Paschall Paschall
add hive style partitioning for contigName
Add partitioning by 1 megabase bin under each chromosome

moved pom back to scala 2_10 and spark 1

added hive partitioning for all types

add loading hive paritions support to all types

cleanup hive paritions code

individual ParttionedParquet save and load function, removed reference to Hive, fixed partitioned flag marker file to work with HDFS

updated tests, fixed boolean error in partition flag test

fixed typo

Address review comments

Addressed review comments, whitespace and misc.

fixed pom

allow paritioned alignment load by ReferenceRegion

move to spark2 and scala 2.11 as otherwise will not compile
Paschall Paschall
@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 6, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2552/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1864/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 94c10b3 # timeout=10Checking out Revision 94c10b3 (origin/pr/1864/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 94c10b30f9baf8b53d5b9558c6ceb0202c40a831First time build. Skipping changelog.Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@heuermh

This comment has been minimized.

Copy link
Member

heuermh commented Jan 6, 2018

Is there a specific strategy wrt to handelling that for Spark2 with a profile you suggest?

Just wait until #1861 lands.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jan 17, 2018

Coverage Status

Coverage increased (+0.03%) to 82.847% when pulling 6e59b8d on jpdna:hive_partitioned_v4 into 4223f56 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 17, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2572/
Test PASSed.

@@ -216,7 +216,7 @@ case class ParquetUnboundAlignmentRecordRDD private[rdd] (
sc.loadParquet(parquetFilename)
}

lazy val dataset = {
lazy val dataset: Dataset[AlignmentRecordProduct] = {

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Jan 17, 2018

Contributor

Is this needed?

@@ -1 +0,0 @@
../../../../adam-core/src/test/resources/artificial.fa

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Jan 17, 2018

Contributor

Why were there changes to this file?

@@ -211,7 +211,7 @@
<dependency>
<groupId>org.scala-lang</groupId>
<artifactId>scala-library</artifactId>
<scope>provided</scope>
<scope>compile</scope>

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Jan 17, 2018

Contributor

What is this change for?

* @param pathName The path name to load alignment records from.
* Globs/directories are supported.
* @param regions
* @return

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Jan 17, 2018

Contributor

Descriptions of param and return types

*
* @param pathName
* @param regions
* @return

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Jan 17, 2018

Contributor

Description of param and return types

* Test if Parquet files are partitioned
*
* @param filePath
* @return

This comment has been minimized.

Copy link
@akmorrow13

akmorrow13 Jan 17, 2018

Contributor

param and return type description

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 17, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2573/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1864/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains f526989 # timeout=10Checking out Revision f526989 (origin/pr/1864/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f f5269892a2bdbbc1c1de9244ccc6ec7b416fa3f7First time build. Skipping changelog.Triggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jan 17, 2018

Coverage Status

Coverage increased (+0.03%) to 82.847% when pulling 4442736 on jpdna:hive_partitioned_v4 into 4223f56 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 17, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2574/
Test PASSed.

Paschall Paschall
@coveralls

This comment has been minimized.

Copy link

coveralls commented Jan 17, 2018

Coverage Status

Coverage decreased (-0.2%) to 82.6% when pulling 1a75ed5 on jpdna:hive_partitioned_v4 into 4223f56 on bigdatagenomics:master.

@jpdna

This comment has been minimized.

Copy link
Member Author

jpdna commented Jan 17, 2018

Question about rebase/history:
So - after having some troubles resolving conflicts during rebasing, I ended up using the "resolve" tool on github to merge master into my branch.
This seems to have worked in terms of the final product is what I want with only intended changes on top of Master.

HOWEVER - I don't like the way that the history has worked out with, with minor incremental changes which I would have liked to have squashed interleaved with the commits on Master.

@fnothaft @heuermh or others can you advise me on if there is a need - and what is the best way, to ideally go from where this PR branch is now, to a cleaner history?

Here is what I am considering: Should I make a new branch from master, compare it to this PR branch, then move the intended changes over manually to that new branch, to consolidate all the changes into a single commit on top of current Master?

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 17, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2575/
Test PASSed.

Paschall added some commits Jan 19, 2018

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jan 19, 2018

Coverage Status

Coverage decreased (-0.2%) to 82.6% when pulling 1aefa72 on jpdna:hive_partitioned_v4 into 4223f56 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 19, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2576/
Test PASSed.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Jan 19, 2018

Coverage Status

Coverage decreased (-0.2%) to 82.6% when pulling 1aefa72 on jpdna:hive_partitioned_v4 into 4223f56 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

Copy link

AmplabJenkins commented Jan 19, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2577/
Test PASSed.

@jpdna

This comment has been minimized.

Copy link
Member Author

jpdna commented Jan 19, 2018

Closed in favor a clean single commit for these changes at: #1878
The comments on this issue I believe have been addressed.

@jpdna jpdna closed this Jan 19, 2018

@heuermh heuermh added this to the 0.24.0 milestone Jan 19, 2018

@heuermh heuermh added this to Completed in Release 0.24.0 Feb 10, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.