Parallel segmentation #19

ilarischeinin · 2016-01-19T08:53:50Z

segmentBins() currently uses serial computing and can be very slow. I'm working on a parallel implementation (with package parallel) that should give a nice speedup.

It's in branch "parallel-segmentation" of my fork.

The text was updated successfully, but these errors were encountered:

HenrikBengtsson · 2016-02-01T17:41:38Z

I was just to suggest a similar thing. I highly recommend that you consider PSCBS::segmentCBS() which supports automagic parallel processing since PSCBS 0.60.0 (2015-11-17). Then you don't have to do anything but changing from using DNAcopy::segment() to PSCBS::segmentCBS() (which uses the former internally).

PSCBS uses the future package as the backend and that will allow users to use whatever parallel backends they want with a single-change in settings, e.g.

future::plan("multicore")

On Windows, where R does not support multicore processing, you can use multiple background R sessions instead;

future::plan("multisession")

It also allow you to run things on a cluster etc. I'm finally going all in every where using futures, cf. www.aroma-project.org/howtos/parallel_processing/

If you want to roll your own, I still highly recommend that you look at the future package. Really.

HenrikBengtsson · 2016-02-01T17:51:49Z

Ah... I see from above commit notes that you've already used parallel::mclapply(). So, you can use a single flapply() instead of the two parallel::mclapply() and lapply() by using the future package, where

flapply <- function(x, FUN, ...) {
  res <- list()
  for (ii in seq_along(x)) res[[ii]] <- future(FUN(x[[ii]], ...))
  names(res) <- names(x)
  values(res)
}

If a user uses plan("multicore") or plan("multisession"), the above will automatically detect how many cores are available/assign (see ?future::availableCores). Again, another user may use plan("cluster", cluster=cl) where cl <- makeCluster(...) and so on. The code works the same regardless of backend.

This way you don't have to hard code to only use multicore processing (cf. Windows users or cluster users). You can also get rid of explicit mc.cores arguments, which I think should belong to the internals and nothing that should be exposed at every single function.

HenrikBengtsson · 2016-02-02T00:51:36Z

FYI, in next release of future, instructions only need to mention:

plan(multiprocess)

which will use multicore processing if supported, otherwise multisession.

ilarischeinin · 2016-02-03T13:04:26Z

You should've been here earlier! I just looked at PSCBS::segmentByCBS() and I think it could've been worth it even just for allowing you to use segment medians instead of means.

Anyways, like you saw I implemented it via parallel and forking. When one wants to add support for Windows (or clusters), going with your suggestion could make sense.

HenrikBengtsson · 2016-02-03T20:29:57Z

Have a look at ilarischeinin#1 so you see what it takes.

ilarischeinin · 2016-02-04T07:38:23Z

@daoud-sie, can you take a look at the discussion over there. Since you're the package maintainer, I think it's your call which option to take:

parallel, part of base R so no external dependencies, current code (in my PR) adds parallelization for OS X and Linux, but Windows support would need some more work
Henrik's future, which would add a dependency, but in addition to OS X and Linux, would add support for Windows, and also bigger clusters, for absolutely free

If you say 2, I'll merge Henrik's PR, which will then automatically include the changes in my PR.

HenrikBengtsson · 2016-02-04T17:04:52Z

Just to reiterate: The future package is really light weight by design, easy and quick to install everywhere, and so it will remain. Although I'm biased, by using the future package the code will be cleaner and easier to maintain, much less if-this-then-that-otherwise-this coding.

I'd also like to point out that the future package will also support full control of how nested futures are evaluated. For instance, in the QDNAseq case you can imagine processing each sample on a separate machine and then each chromosome in a separate process. The syntax for controlling this would be something like plan(list(cluster, multiprocess)). I've got an issue set up for discussing this at HenrikBengtsson/future#37

HenrikBengtsson · 2016-02-24T14:10:24Z

FYI, future 0.12.0 is now on CRAN. Regardless of OS, everyone can now use future::plan("multiprocess") for parallel processing, cf. http://www.aroma-project.org/howtos/parallel_processing/

ilarischeinin · 2016-02-24T17:26:32Z

Thanks!

HenrikBengtsson · 2016-05-01T21:36:01Z

Another update: If you have access to a cluster or similar, you can use the future.BatchJobs package (now public on GitHub) to automatically do the segmentation on the cluster:

library("future.BatchJobs")
plan(batchjobs)

This requires regular BatchJobs configuration, which is quite straightforward.

If you have an ad-hoc cluster (ssh w/ key-pair login but no fancy Slurm/PBS scheduler) you can use what's already available in the future package, e.g.

library("future")
cl <- parallel::makeCluster(c("machine2", "machine5", "machine6", "machine6", "machine9"))
plan(cluster, cluster=cl)

* master: (351 commits) Fix noisePlot() for paired end data Bump R version number dependency (to what IRanges already requires) Add option to specify random seeds Bump development version number to 1.7.3 Make package future optional Update vignette to use BiocStyle Add base package imports to fix Travis NOTEs Fix travis package installs Update NEWS, fix #18 Update NEWS, close #20 Move calculation of expected variance to its own function Smarter handling of user-provided cutoff values Grammar Deprecating argument 'ncpus' [#19] Fix newline in verbose messages Using futures for parallel processing [#19] Update NEWS. Close #19 Add parallel loess correction estimation Add homozygous deletions and amplifications to cutoff calling Implement parallel segmentation also when using smoothing ... From: Daoud Sie <daoud@Daouds-MacBook-Air.local> git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/QDNAseq@113827 bc3139a8-67e5-0310-9ffc-ced21a209358

* master: (351 commits) Fix noisePlot() for paired end data Bump R version number dependency (to what IRanges already requires) Add option to specify random seeds Bump development version number to 1.7.3 Make package future optional Update vignette to use BiocStyle Add base package imports to fix Travis NOTEs Fix travis package installs Update NEWS, fix #18 Update NEWS, close #20 Move calculation of expected variance to its own function Smarter handling of user-provided cutoff values Grammar Deprecating argument 'ncpus' [#19] Fix newline in verbose messages Using futures for parallel processing [#19] Update NEWS. Close #19 Add parallel loess correction estimation Add homozygous deletions and amplifications to cutoff calling Implement parallel segmentation also when using smoothing ... From: Daoud Sie <daoud@Daouds-MacBook-Air.local> git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/QDNAseq@113827 bc3139a8-67e5-0310-9ffc-ced21a209358

ilarischeinin added the enhancement label Jan 19, 2016

ilarischeinin self-assigned this Jan 19, 2016

daoud-sie mentioned this issue Jan 25, 2016

Implement parallel computing #21

Merged

ilarischeinin closed this as completed in a63c211 Feb 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel segmentation #19

Parallel segmentation #19

ilarischeinin commented Jan 19, 2016

HenrikBengtsson commented Feb 1, 2016

HenrikBengtsson commented Feb 1, 2016

HenrikBengtsson commented Feb 2, 2016

ilarischeinin commented Feb 3, 2016

HenrikBengtsson commented Feb 3, 2016

ilarischeinin commented Feb 4, 2016

HenrikBengtsson commented Feb 4, 2016

HenrikBengtsson commented Feb 24, 2016

ilarischeinin commented Feb 24, 2016

HenrikBengtsson commented May 1, 2016

Parallel segmentation #19

Parallel segmentation #19

Comments

ilarischeinin commented Jan 19, 2016

HenrikBengtsson commented Feb 1, 2016

HenrikBengtsson commented Feb 1, 2016

HenrikBengtsson commented Feb 2, 2016

ilarischeinin commented Feb 3, 2016

HenrikBengtsson commented Feb 3, 2016

ilarischeinin commented Feb 4, 2016

HenrikBengtsson commented Feb 4, 2016

HenrikBengtsson commented Feb 24, 2016

ilarischeinin commented Feb 24, 2016

HenrikBengtsson commented May 1, 2016