New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ADAM-651] Implement Hive-style partitioning by genomic range of Parquet backed datasets #1922

Closed
wants to merge 3 commits into
base: master
from

Conversation

Projects
None yet
5 participants
@jpdna
Member

jpdna commented Feb 25, 2018

Replaces #1911

Implements hive-style partitioning by genomic range of parquet backed datasets, improving latency of region filtering for genomic coordinate based data types.
Refactored such that filterByOverlappingRegions function has been pulled up to the DatasetBoundGenomicDataset trait.

@jpdna

This comment has been minimized.

Member

jpdna commented Feb 25, 2018

I believe that all comments from #1911 have been addressed.
I made a new PR here because rebase was messy, and I didn't want to lose the comments on commits in the earlier PR.

@akmorrow13 @heuermh @fnothaft please review.

@coveralls

This comment has been minimized.

coveralls commented Feb 25, 2018

Coverage Status

Coverage increased (+0.06%) to 82.643% when pulling e9e85a2 on jpdna:hive_partitioned_v11 into 26f0608 on bigdatagenomics:master.

1 similar comment
@coveralls

This comment has been minimized.

coveralls commented Feb 25, 2018

Coverage Status

Coverage increased (+0.06%) to 82.643% when pulling e9e85a2 on jpdna:hive_partitioned_v11 into 26f0608 on bigdatagenomics:master.

@coveralls

This comment has been minimized.

coveralls commented Feb 25, 2018

Coverage Status

Coverage decreased (-0.05%) to 82.529% when pulling 221a6e2 on jpdna:hive_partitioned_v11 into 26f0608 on bigdatagenomics:master.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Feb 25, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2679/

Build result: FAILURE

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1922/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains a7c47ba # timeout=10Checking out Revision a7c47ba (origin/pr/1922/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f a7c47ba458a354bc62d7b2a7de4e2b654c44b991First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result FAILUREADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result FAILURENotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@jpdna

This comment has been minimized.

Member

jpdna commented Feb 25, 2018

I just double checked that e9e85a2 passes tests for me locally on on amp-bdg-master, I don't know why it is failing on Jenkins above.

@jpdna jpdna referenced this pull request Feb 25, 2018

Closed

Use Hive-style partitioning #370

hive-style partitioning
fixed typo

updated scaladoc

change var partitioning info in DatastBoundGenomicDataset to val and use override in constructor

whitespace
@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Feb 26, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2680/
Test PASSed.

@jpdna

This comment has been minimized.

Member

jpdna commented Feb 28, 2018

pinging for comment here

@jpdna

This comment has been minimized.

Member

jpdna commented Mar 2, 2018

Any action items for @jpdna here?

* @param pathName The path name to load alignment records from.
* Globs/directories are supported.
* @param regions Optional list of genomic regions to load.
* @param optQueryBinNumLookback Number of partitions to lookback to find beginning of an overlapping

This comment has been minimized.

@heuermh

heuermh Mar 5, 2018

Member

optQueryBinNumLookbackoptLookbackPartitions ?

This comment has been minimized.

@fnothaft
val partitionedBinSize = getPartitionedBinSize(pathName)
val reads = loadParquetAlignments(pathName)
val datasetBoundAlignmentRecordRDD = if (regions.nonEmpty) {

This comment has been minimized.

@heuermh

heuermh Mar 5, 2018

Member

Could this be rewritten as

val alignments = DatasetBoundAlignmentRecordRDD(reads.dataset,
  reads.sequences,
  reads.recordGroups,
  reads.processingSteps,
  true,
  Some(partitionedBinSize),
  optLookbackPartitions
)

if (regions.nonEmpty) alignments.filterByOverlappingRegions(regions) else alignments

This comment has been minimized.

@fnothaft
*/
private def getPartitionedBinSize(pathName: String): Int = {
val partitionSizes = getFsAndFilesWithFilter(pathName, new FileFilter("_isPartitionedByStartPos")).map(f => {

This comment has been minimized.

@heuermh

heuermh Mar 5, 2018

Member

_isPartitionedByStartPos_partitionedByStartPos

* If a glob is used, all directories within the blog must be partitioned, and must have been saved
* using the same partitioned bin size. Behavior is undefined if this requirement is not met.
*/
private def getPartitionedBinSize(pathName: String): Int = {

This comment has been minimized.

@heuermh

heuermh Mar 5, 2018

Member

getPartitionedBinSizegetPartitionBinSize

val partitionSizes = getFsAndFilesWithFilter(pathName, new FileFilter("_isPartitionedByStartPos")).map(f => {
val is = f.getFileSystem(sc.hadoopConfiguration).open(f)
val partitionSize = is.readInt
is.close()

This comment has been minimized.

@heuermh

heuermh Mar 5, 2018

Member

Might want to try with resources this IO block. I hacked up TryWith.scala in another repo, and was thinking of bringing it over for #1719.

This comment has been minimized.

@fnothaft

fnothaft Mar 5, 2018

Member

I'm fine with this code as is.

@heuermh perhaps you could move TryWith into bdg-utils and we could modify this as part of the push towards #1719 in 0.25.0?

This comment has been minimized.

@heuermh
getPartitionedBinSize(pathName)
true
} catch {
case e: FileNotFoundException => false

This comment has been minimized.

@heuermh

heuermh Mar 5, 2018

Member

Try catch is probably not the most appropriate here, one could get clever with Success/Failure... I'm not that clever so I'll have to think a bit to come up with a good suggestion.

This comment has been minimized.

@fnothaft

fnothaft Mar 5, 2018

Member

I think this is OK for now, but agree that we should revisit it later.

* @param filePath Path to save the file at.
*/
private def writePartitionedParquetFlag(filePath: String, partitionSize: Int): Unit = {
val path = new Path(filePath, "_isPartitionedByStartPos")

This comment has been minimized.

@heuermh

heuermh Mar 5, 2018

Member

_isPartitionedByStartPos_partitionedByStartPos

val fs: FileSystem = path.getFileSystem(rdd.context.hadoopConfiguration)
val f = fs.create(path)
f.writeInt(partitionSize)
f.close()

This comment has been minimized.

@heuermh

heuermh Mar 5, 2018

Member

Might want to try with resources this IO block, see above.

This comment has been minimized.

@fnothaft

fnothaft Mar 5, 2018

Member

Let's go with this for now and defer changing the block until #1719.

sequences: SequenceDictionary) extends NucleotideContigFragmentRDD
sequences: SequenceDictionary,
override val isPartitioned: Boolean = true,
override val optPartitionedBinSize: Option[Int] = Some(1000000),

This comment has been minimized.

@heuermh

heuermh Mar 5, 2018

Member

optPartitionedBinSizeoptPartitionBinSize

sequences: SequenceDictionary,
override val isPartitioned: Boolean = true,
override val optPartitionedBinSize: Option[Int] = Some(1000000),
override val optQueryLookbackNum: Option[Int] = Some(1)) extends NucleotideContigFragmentRDD

This comment has been minimized.

@heuermh

heuermh Mar 5, 2018

Member

optQueryLookbackNumoptLookbackPartitions

@@ -131,6 +131,8 @@ class TransformAlignmentsArgs extends Args4jBase with ADAMSaveAnyArgs with Parqu
var storageLevel: String = "MEMORY_ONLY"
@Args4jOption(required = false, name = "-disable_pg", usage = "Disable writing a new @PG line.")
var disableProcessingStep = false
@Args4jOption(required = false, name = "-save_as_dataset", usage = "EXPERIMENTAL: Use the provided bin size in base pairs to save the data partitioned by genomic range bins using Hive-style partitioning.")

This comment has been minimized.

@heuermh

heuermh Mar 5, 2018

Member

How about two args, -partition_by_start_pos Boolean flag and -partition_bin_size Int with a default value?

This comment has been minimized.

@fnothaft

@heuermh heuermh requested a review from fnothaft Mar 5, 2018

@fnothaft fnothaft added this to the 0.24.0 milestone Mar 5, 2018

@fnothaft

Few small changes, but very close!

@@ -131,6 +131,8 @@ class TransformAlignmentsArgs extends Args4jBase with ADAMSaveAnyArgs with Parqu
var storageLevel: String = "MEMORY_ONLY"
@Args4jOption(required = false, name = "-disable_pg", usage = "Disable writing a new @PG line.")
var disableProcessingStep = false
@Args4jOption(required = false, name = "-save_as_dataset", usage = "EXPERIMENTAL: Use the provided bin size in base pairs to save the data partitioned by genomic range bins using Hive-style partitioning.")

This comment has been minimized.

@fnothaft
maybeSort(maybeCoalesce(genotypes)).saveAsParquet(args)
if (args.partitionedBinSize > 0) {
if (genotypes.sequences.isEmpty) {
log.warn("This dataset is not aligned and therefore will not benefit from being saved as a partitioned dataset")

This comment has been minimized.

@fnothaft

fnothaft Mar 5, 2018

Member

Nit: Genotypes are aligned by definition, no? I don't have a strong preference as to what we do here, but this warning just reads a bit odd.

* @param pathName The path name to load alignment records from.
* Globs/directories are supported.
* @param regions Optional list of genomic regions to load.
* @param optQueryBinNumLookback Number of partitions to lookback to find beginning of an overlapping

This comment has been minimized.

@fnothaft
val partitionedBinSize = getPartitionedBinSize(pathName)
val reads = loadParquetAlignments(pathName)
val datasetBoundAlignmentRecordRDD = if (regions.nonEmpty) {

This comment has been minimized.

@fnothaft
@@ -3051,4 +3242,44 @@ class ADAMContext(@transient val sc: SparkContext) extends Serializable with Log
loadParquetFragments(pathName, optPredicate = optPredicate, optProjection = optProjection)
}
}
/**
* Return integer size of partitions if the specified path of Parquet + Avro files is partitioned.

This comment has been minimized.

@fnothaft

fnothaft Mar 5, 2018

Member

Suggest clarifying from:

Return integer size of partitions if the specified path of Parquet + Avro files is partitioned.

to:

Return length of partitions in base pairs if file is partitioned.

Otherwise, it is unclear as to wether this returns the number of partitions or the size of a single partition (and in the latter case, what the "size" means).

This comment has been minimized.

@jpdna

jpdna Mar 6, 2018

Member

Done

val partitionSizes = getFsAndFilesWithFilter(pathName, new FileFilter("_isPartitionedByStartPos")).map(f => {
val is = f.getFileSystem(sc.hadoopConfiguration).open(f)
val partitionSize = is.readInt
is.close()

This comment has been minimized.

@fnothaft

fnothaft Mar 5, 2018

Member

I'm fine with this code as is.

@heuermh perhaps you could move TryWith into bdg-utils and we could modify this as part of the push towards #1719 in 0.25.0?

getPartitionedBinSize(pathName)
true
} catch {
case e: FileNotFoundException => false

This comment has been minimized.

@fnothaft

fnothaft Mar 5, 2018

Member

I think this is OK for now, but agree that we should revisit it later.

val fs: FileSystem = path.getFileSystem(rdd.context.hadoopConfiguration)
val f = fs.create(path)
f.writeInt(partitionSize)
f.close()

This comment has been minimized.

@fnothaft

fnothaft Mar 5, 2018

Member

Let's go with this for now and defer changing the block until #1719.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Mar 6, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2697/
Test PASSed.

@fnothaft

Looks great! Thanks @jpdna! The one thing I just realized that we're missing is user facing docs (as opposed to API docs). Do you think you'd be able to write 1-2 paragraphs about that for the ADAMContext and GenomicRDD docs?

@@ -131,6 +131,10 @@ class TransformAlignmentsArgs extends Args4jBase with ADAMSaveAnyArgs with Parqu
var storageLevel: String = "MEMORY_ONLY"
@Args4jOption(required = false, name = "-disable_pg", usage = "Disable writing a new @PG line.")
var disableProcessingStep = false
@Args4jOption(required = false, name = "-partition_by_start_pos", usage = "EXPERIMENTAL: Save the data partitioned by genomic range bins based on start pos using Hive-style partitioning.")

This comment has been minimized.

@heuermh

heuermh Mar 6, 2018

Member

I would vote for dropping EXPERIMENTAL from these, one could argue that everything in ADAM is experimental until we hit version 1.0 and start semantic versioning (#338, #123).

@@ -74,7 +76,7 @@ private[rdd] object GenomicRDD {
* @see pipe
*
* @param cmd Command to replace references in.
* @param files List of paths to files.
* @param files List of paths to files.f

This comment has been minimized.

@heuermh

heuermh Mar 6, 2018

Member

small typo

@heuermh heuermh changed the title from hive-style partitioning to [ADAM-651] Implement Hive-style partitioning by genomic range of Parquet backed datasets Mar 7, 2018

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Mar 7, 2018

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2709/

Build result: ABORTED

[...truncated 7 lines...] > /home/jenkins/git2/bin/git init /home/jenkins/workspace/ADAM-prb # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git --version # timeout=10 > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/heads/:refs/remotes/origin/ # timeout=15 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10 > /home/jenkins/git2/bin/git config --add remote.origin.fetch +refs/heads/:refs/remotes/origin/ # timeout=10 > /home/jenkins/git2/bin/git config remote.origin.url https://github.com/bigdatagenomics/adam.git # timeout=10Fetching upstream changes from https://github.com/bigdatagenomics/adam.git > /home/jenkins/git2/bin/git fetch --tags --progress https://github.com/bigdatagenomics/adam.git +refs/pull/:refs/remotes/origin/pr/ # timeout=15 > /home/jenkins/git2/bin/git rev-parse origin/pr/1922/merge^{commit} # timeout=10 > /home/jenkins/git2/bin/git branch -a -v --no-abbrev --contains 281c7f9 # timeout=10Checking out Revision 281c7f9 (origin/pr/1922/merge) > /home/jenkins/git2/bin/git config core.sparsecheckout # timeout=10 > /home/jenkins/git2/bin/git checkout -f 281c7f96a48edec25e520266772c200def947a09First time build. Skipping changelog.Triggering ADAM-prb ? 2.6.2,2.10,2.2.1,centosTriggering ADAM-prb ? 2.6.2,2.11,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.10,2.2.1,centosTriggering ADAM-prb ? 2.7.3,2.11,2.2.1,centosADAM-prb ? 2.6.2,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.6.2,2.11,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.10,2.2.1,centos completed with result SUCCESSADAM-prb ? 2.7.3,2.11,2.2.1,centos completed with result ABORTEDNotifying endpoint 'HTTP:https://webhooks.gitter.im/e/ac8bb6e9f53357bc8aa8'
Test FAILed.

@fnothaft

This comment has been minimized.

Member

fnothaft commented Mar 7, 2018

Jenkins, retest this please.

@AmplabJenkins

This comment has been minimized.

AmplabJenkins commented Mar 7, 2018

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/ADAM-prb/2712/
Test PASSed.

@heuermh

This comment has been minimized.

Member

heuermh commented Mar 14, 2018

Close in favor of #1948.

@heuermh heuermh closed this Mar 14, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment