Skip to content

How to minimise retransmission of objects when I use SSH connections? #346

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
seonghobae opened this issue Sep 2, 2019 · 1 comment
Open
Labels
Backend API Part of the Future API that only backend package developers rely on enhancement feature request feature/sticky-globals globals

Comments

@seonghobae
Copy link

Hello,
I’m using future apply function with ssh connections, cluster plan.
However, future apply function makes retransmission of data frame or objects every mapping procedures even I use same datasets or objects when I’m just changing parameters of estimations to find optimal condition.

Pseudo code is here.

data <- mirt::Science
nFactors <- 1:4
future::plan('cluster', workers = paste0('s',1:2))
future.apply::futute_lapply(X=nFactors, FUN = function(X, data){mirt::mirt(data = data, X)}, data = data)

After run this code, let’s watch traffic status. It seems do retransmission of data, even I don’t change any data for the parameter estimation.

How can I reduce data retransmission? That’s make me hard to operate HPC computing on some VPS provider, they makes QoS limit every my calculation.

Best,
Seongho

@HenrikBengtsson
Copy link
Collaborator

HenrikBengtsson commented Nov 2, 2019

Short answer: The future framework does not really support "life-time" global variables, which stems from the design that futures are meant to be independent of each other.

Long answer: However, one could imagine that parallel backends, such as PSOCK clusters, that server multiple futures, could hold "life-time" globals. We can actually already do things such as:

library(future)
cl <- future::makeClusterPSOCK(2)
A <- data.frame(a=1:3, b=4:6)
parallel::clusterExport(cl, "A")
plan(cluster, workers = cl, persistent=TRUE)

to export object data to each parallel workers upfront. So far so good. However, if we would just do:

y0 <- lapply(colnames(data), FUN = function(name) {
  sum(data[[name]])
})

y1 <- future.apply::future_lapply(colnames(data), FUN = function(name) {
  sum(data[[name]])
})

stopifnot(identical(y1, y0))

we would still export data in each iteration (and overwrite the ones we exported manually). To avoid this, one could tell the future framework to ignore data even if it finds it to be a global variable;

y2 <- future.apply::future_lapply(colnames(data), FUN = function(name) {
  sum(data[[name]])
}, future.globals = structure(TRUE, ignore = "data"))

stopifnot(identical(y2, y0))

WARNING: Now, the above is really hacky and should not be used in production. It has two main problems, which are against the philosophy of how futures should be used:

  1. It relies on persistent = TRUE for the cluster backend. I don't recommend to use that because it is unsafe and risk messing up reproducibility.

  2. The use of future.globals = structure(TRUE, ignore = "data") relies on the data object already existing on the parallel worker. If you change to, say, plan(future.callr::callr), your code will break. So, that is also not recommended; ignore should really only be used to ignore false-positive globals and not the way it is used here.

So, take-home message, unfortunately, the future framework does not support what you're asking for as it stands now. However, it might be that we can introduce the concept of "lifetime globals", or "worker globals", that one can set up once and if there is a regular global variable to be exported that matches an existing "worker global", then the export will be skipped. Now, I doubt such a feature will not be implemented in the future framework anytime soon.

@HenrikBengtsson HenrikBengtsson transferred this issue from futureverse/future.apply Nov 2, 2019
@HenrikBengtsson HenrikBengtsson added Backend API Part of the Future API that only backend package developers rely on enhancement feature request labels Nov 2, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Backend API Part of the Future API that only backend package developers rely on enhancement feature request feature/sticky-globals globals
Projects
None yet
Development

No branches or pull requests

2 participants