Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel segmentation #19

Closed
ilarischeinin opened this issue Jan 19, 2016 · 10 comments
Closed

Parallel segmentation #19

ilarischeinin opened this issue Jan 19, 2016 · 10 comments
Assignees

Comments

@ilarischeinin
Copy link
Member

segmentBins() currently uses serial computing and can be very slow. I'm working on a parallel implementation (with package parallel) that should give a nice speedup.

It's in branch "parallel-segmentation" of my fork.

@HenrikBengtsson
Copy link
Collaborator

I was just to suggest a similar thing. I highly recommend that you consider PSCBS::segmentCBS() which supports automagic parallel processing since PSCBS 0.60.0 (2015-11-17). Then you don't have to do anything but changing from using DNAcopy::segment() to PSCBS::segmentCBS() (which uses the former internally).

PSCBS uses the future package as the backend and that will allow users to use whatever parallel backends they want with a single-change in settings, e.g.

future::plan("multicore")

On Windows, where R does not support multicore processing, you can use multiple background R sessions instead;

future::plan("multisession")

It also allow you to run things on a cluster etc. I'm finally going all in every where using futures, cf. www.aroma-project.org/howtos/parallel_processing/

If you want to roll your own, I still highly recommend that you look at the future package. Really.

@HenrikBengtsson
Copy link
Collaborator

Ah... I see from above commit notes that you've already used parallel::mclapply(). So, you can use a single flapply() instead of the two parallel::mclapply() and lapply() by using the future package, where

flapply <- function(x, FUN, ...) {
  res <- list()
  for (ii in seq_along(x)) res[[ii]] <- future(FUN(x[[ii]], ...))
  names(res) <- names(x)
  values(res)
}

If a user uses plan("multicore") or plan("multisession"), the above will automatically detect how many cores are available/assign (see ?future::availableCores). Again, another user may use plan("cluster", cluster=cl) where cl <- makeCluster(...) and so on. The code works the same regardless of backend.

This way you don't have to hard code to only use multicore processing (cf. Windows users or cluster users). You can also get rid of explicit mc.cores arguments, which I think should belong to the internals and nothing that should be exposed at every single function.

@HenrikBengtsson
Copy link
Collaborator

FYI, in next release of future, instructions only need to mention:

plan(multiprocess)

which will use multicore processing if supported, otherwise multisession.

@ilarischeinin
Copy link
Member Author

You should've been here earlier! I just looked at PSCBS::segmentByCBS() and I think it could've been worth it even just for allowing you to use segment medians instead of means.

Anyways, like you saw I implemented it via parallel and forking. When one wants to add support for Windows (or clusters), going with your suggestion could make sense.

@HenrikBengtsson
Copy link
Collaborator

Have a look at ilarischeinin#1 so you see what it takes.

@ilarischeinin
Copy link
Member Author

@daoud-sie, can you take a look at the discussion over there. Since you're the package maintainer, I think it's your call which option to take:

  1. parallel, part of base R so no external dependencies, current code (in my PR) adds parallelization for OS X and Linux, but Windows support would need some more work
  2. Henrik's future, which would add a dependency, but in addition to OS X and Linux, would add support for Windows, and also bigger clusters, for absolutely free

If you say 2, I'll merge Henrik's PR, which will then automatically include the changes in my PR.

@HenrikBengtsson
Copy link
Collaborator

Just to reiterate: The future package is really light weight by design, easy and quick to install everywhere, and so it will remain. Although I'm biased, by using the future package the code will be cleaner and easier to maintain, much less if-this-then-that-otherwise-this coding.

I'd also like to point out that the future package will also support full control of how nested futures are evaluated. For instance, in the QDNAseq case you can imagine processing each sample on a separate machine and then each chromosome in a separate process. The syntax for controlling this would be something like plan(list(cluster, multiprocess)). I've got an issue set up for discussing this at HenrikBengtsson/future#37

@HenrikBengtsson
Copy link
Collaborator

FYI, future 0.12.0 is now on CRAN. Regardless of OS, everyone can now use future::plan("multiprocess") for parallel processing, cf. http://www.aroma-project.org/howtos/parallel_processing/

@ilarischeinin
Copy link
Member Author

Thanks!

@HenrikBengtsson
Copy link
Collaborator

Another update: If you have access to a cluster or similar, you can use the future.BatchJobs package (now public on GitHub) to automatically do the segmentation on the cluster:

library("future.BatchJobs")
plan(batchjobs)

This requires regular BatchJobs configuration, which is quite straightforward.

If you have an ad-hoc cluster (ssh w/ key-pair login but no fancy Slurm/PBS scheduler) you can use what's already available in the future package, e.g.

library("future")
cl <- parallel::makeCluster(c("machine2", "machine5", "machine6", "machine6", "machine9"))
plan(cluster, cluster=cl)

daoud-sie pushed a commit that referenced this issue Apr 11, 2017
* master: (351 commits)
  Fix noisePlot() for paired end data
  Bump R version number dependency (to what IRanges already requires)
  Add option to specify random seeds
  Bump development version number to 1.7.3
  Make package future optional
  Update vignette to use BiocStyle
  Add base package imports to fix Travis NOTEs
  Fix travis package installs
  Update NEWS, fix #18
  Update NEWS, close #20
  Move calculation of expected variance to its own function
  Smarter handling of user-provided cutoff values
  Grammar
  Deprecating argument 'ncpus' [#19]
  Fix newline in verbose messages
  Using futures for parallel processing [#19]
  Update NEWS. Close #19
  Add parallel loess correction estimation
  Add homozygous deletions and amplifications to cutoff calling
  Implement parallel segmentation also when using smoothing
  ...

From: Daoud Sie <daoud@Daouds-MacBook-Air.local>

git-svn-id: https://hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/QDNAseq@113827 bc3139a8-67e5-0310-9ffc-ced21a209358
HenrikBengtsson pushed a commit that referenced this issue Aug 31, 2019
* master: (351 commits)
  Fix noisePlot() for paired end data
  Bump R version number dependency (to what IRanges already requires)
  Add option to specify random seeds
  Bump development version number to 1.7.3
  Make package future optional
  Update vignette to use BiocStyle
  Add base package imports to fix Travis NOTEs
  Fix travis package installs
  Update NEWS, fix #18
  Update NEWS, close #20
  Move calculation of expected variance to its own function
  Smarter handling of user-provided cutoff values
  Grammar
  Deprecating argument 'ncpus' [#19]
  Fix newline in verbose messages
  Using futures for parallel processing [#19]
  Update NEWS. Close #19
  Add parallel loess correction estimation
  Add homozygous deletions and amplifications to cutoff calling
  Implement parallel segmentation also when using smoothing
  ...

From: Daoud Sie <daoud@Daouds-MacBook-Air.local>

git-svn-id: file:///home/git/hedgehog.fhcrc.org/bioconductor/trunk/madman/Rpacks/QDNAseq@113827 bc3139a8-67e5-0310-9ffc-ced21a209358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants