Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upPredicting on tiled rasters, in parallel #93
Comments
|
Just eliminating the obvious, but are you running this line:
...before doing your calculations? As that line deletes all the VMs you started. :) Its meant to be after you've done all your stuff. |
|
Always good to eliminate the obvious first!! But no, I run that after I should be done with the VMs. The code below that line is just me running the function locally to ensure it should in fact work. |
|
Ok, moving on then :) are you sure the installation of I would suggest its better to install |
|
I agree. I am/have struggled a bit understanding the Docker code. If I want to install raster and future, would by Docker look like this? Modified from (https://cloudyr.github.io/googleComputeEngineR/articles/docker.html#dockerfiles)
Or like this:
|
|
The first is I think easier to maintain, and takes advantage of An example from here FROM rocker/tidyverse
MAINTAINER Mark Edmondson (r@sunholo.com)
# install R package dependencies
# only needed if the R packages in the second RUN need them
RUN apt-get update && apt-get install -y \
## here you would add an unix dependencies needed by R packages below
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/ \
&& rm -rf /tmp/downloaded_packages/ /tmp/*.rds
## Install packages from CRAN
RUN install2.r --error \
-r 'http://cran.rstudio.com' \
googleAuthR \
googleComputeEngineR \
googleAnalyticsR \
searchConsoleR \
googleCloudStorageR \
bigQueryR \
## install Github packages
&& installGithub.r MarkEdmondson1234/youtubeAnalyticsR \
MarkEdmondson1234/googleID \
## clean up
&& rm -rf /tmp/downloaded_packages/ /tmp/*.rds But it may not be dependencies, so I would first launch just one Docker container with your script first to test if it works ok. Once it works on one then you can add the complication of multiple VMs |
|
OK, so I have edited the docker code below. It is not clear to me whether I can just call this as a text file, or if I need to upload to Docker. I first tried to upload to the Google Container Registry as per instructions here (https://cloudyr.github.io/googleComputeEngineR/articles/docker.html), where it appeared I could make a trigger and upload my docker code and then link to this file. But I had trouble setting up the trigger, the mirror never seemed to complete between GoogleCompute and GitHub. I tried a second approach: logged onto Docker and created a repository, but it wasn't clear how I uploaded the code below. So in short, I appear not be succeeding in my Docker endeavour..
|
|
The link between the Dockerfile and GitHub is the build trigger setting, I think you are just missing setting that up - https://cloud.google.com/container-builder/docs/running-builds/automate-builds |
|
I have set up what I think is my docker image (here) Which I set up in the Google container registry: I added the dockerfile as per (here:
But I am still not getting the raster package loaded onto my vms. |
|
Ah, strikes me although you installed the raster package on the image your function isn't explictly loading it - specify the function using The function you send up you can think of as only existing in the Docker image's environment, not the script you are sending it from. So, your function will also need to send up the data its computing upon:
|
|
Thanks for your dedicated assistance and patience, Mark! The vm still seems to be having issues finding the "raster" package. So must be an issue with my docker image (here)? This is just a text file I uploaded to Github then linked with Container Registry as shown above. Can I check from googleComputeEngineR whether Raster has been loaded? |
|
Your build trigger is probably not working correctly - looking at your Docker repo, the Docker file needs to be called exactly |
|
I feel like I am very close just not quite there.. Edited repo name to = "Dockerfile" here |
|
I just forked the Dockerfile and building it so I can reproduce and hopefully fix |
|
Found that although we create the VMs with your image, we're not using that in the plan cluster so still defaulting back to plan(cluster, workers = as.cluster(
vms,
docker_image=gce_tag_container("raster_docker", project = "LambEcoResearch"),
rscript=c("docker", "run", c("--net=host"),
gce_tag_container("raster_docker", project = "LambEcoResearch"),
"Rscript"))) |
|
oooo, good catch! I assume that I edited your above script correctly to represent my dockerfile?
Throwing a capitalization error now? Did you not get that error? |
|
Yes that is a case of updating your build trigger name, I'm just working on it now so hopefully post a working example by end of today. In container registry you can look at your build history to see if they succeed ok. |
|
OK, will check back in tomorrow. Thank you! I've changed all names in Container Registry to lower case. The only two upper case strings are my project name "LambEcoResearch" and the name of "Dockerfile" from github. When I try to run the trigger from within the Container Registry, it fails, but not sure if that is expected or not. Once again, really appreciate your help on this. |
|
This is working up to some error about connections, but I have verified that raster is installed on the cluster: #Test to split up raster and predict
library(raster)
library(googleComputeEngineR)
library(future)
library(SpaDES.tools)
gce_global_project("xxxx")
##create raster
row <- 8
col <- 8
r <- raster(nrows=row, ncols=col,xmn=0, xmx=row, ymn=0, ymx=col, vals=c(1:(row*col)))
plot(r)
##Split
r_split <- splitRaster(r, nx=2, ny=2)
##create model
df <- data.frame(y=c(1:10),layer=c(1:5,7,6,8:10))
mod <- glm(y~layer, data=df)
## auto auth to GCE via environment file arguments
## create CPUs names
vm_names <- paste0("my-server",1:2)
## make sure vms won't get shut off
preemptible = list(preemptible = FALSE)
my_image <- gce_tag_container("your-docker-name", project = "xxxx")
## start up VMs with R base on them (can also customise via Dockerfiles using gce_vm_template instead)
vms <- lapply(vm_names, gce_vm,
predefined_type = "n1-standard-1",
template = "r-base",
scheduling = preemptible,
dynamic_image = my_image)
## add any ssh details, username etc.
vms <- lapply(vms, gce_ssh_setup)
## once all launched, add to cluster
plan(cluster, workers = as.cluster(
vms,
docker_image=my_image,
rscript=c("docker", "run", c("--net=host"),
my_image,
"Rscript")))
## the action you want to perform via cluster
##function
my_single_function <- function(x, r_split, mod){
raster::predict(r_split[[x]], mod)
}
#parallel
save3 <- future_lapply(1:2, my_single_function, r_split = r_split, mod = mod)
#> Error: cannot open the connection |
|
Hmmm, I set up my build trigger config same as yours, but I am getting captialization errors. Haven't made it to the connection issues yet
|
|
Ok the source of my connection error is the raster package writes files to the working directory (in I'm not familiar with the package so not sure best wya to do that, but you can see the file locations in the test r_split obj: r_split
[[1]]
class : RasterLayer
dimensions : 4, 4, 16 (nrow, ncol, ncell)
resolution : 1, 1 (x, y)
extent : 0, 4, 0, 4 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
data source : /Users/mark/dev/R/xxx/layer/layer_tile1.grd
names : layer
values : 33, 60 (min, max)
[[2]]
class : RasterLayer
dimensions : 4, 4, 16 (nrow, ncol, ncell)
resolution : 1, 1 (x, y)
extent : 0, 4, 4, 8 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
data source : /Users/mark/dev/R/xxx/layer/layer_tile2.grd
names : layer
values : 1, 28 (min, max)
[[3]]
class : RasterLayer
dimensions : 4, 4, 16 (nrow, ncol, ncell)
resolution : 1, 1 (x, y)
extent : 4, 8, 0, 4 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
data source : /Users/mark/dev/R/xxxx/layer/layer_tile3.grd
names : layer
values : 37, 64 (min, max)
[[4]]
class : RasterLayer
dimensions : 4, 4, 16 (nrow, ncol, ncell)
resolution : 1, 1 (x, y)
extent : 4, 8, 4, 8 (xmin, xmax, ymin, ymax)
coord. ref. : +proj=longlat +datum=WGS84 +ellps=WGS84 +towgs84=0,0,0
data source : /Users/mark/dev/R/xxx/layer/layer_tile4.grd
names : layer
values : 5, 32 (min, max) |
|
You can pull each of those RasterLayer objects in the list out of the working directory and into memory using
|
|
Thanks for that, got it working! You can use #Test to split up raster and predict
library(raster)
library(googleComputeEngineR)
library(future)
library(SpaDES.tools)
gce_global_project("xxxx")
##create raster
row <- 8
col <- 8
r <- raster(nrows=row, ncols=col,xmn=0, xmx=row, ymn=0, ymx=col, vals=c(1:(row*col)))
plot(r)
##Split
r_split <- splitRaster(r, nx=2, ny=2)
##create model
df <- data.frame(y=c(1:10),layer=c(1:5,7,6,8:10))
mod <- glm(y~layer, data=df)
## auto auth to GCE via environment file arguments
## create CPUs names
vm_names <- paste0("my-server",1:2)
# made via build triggers, installed raster on custom Docker image
my_image <- gce_tag_container("raster", project = "xxxx")
## start up VMs with custom Dockerfile
vms <- lapply(vm_names, gce_vm,
predefined_type = "n1-standard-1",
template = "r-base",
dynamic_image = my_image)
## add any ssh details, username etc.
vms <- lapply(vms, gce_ssh_setup)
## once all launched, add to cluster with custom Dockerfile
# use plan(sequential) for local testing
plan(cluster, workers = as.cluster(
vms,
docker_image=my_image,
rscript=c("docker", "run", c("--net=host"),
my_image,
"Rscript")))
## make the vector of stuff to send to nodes
o <- lapply(r_split, readAll)
## the action you want to perform on the elements in the cluster
my_single_function <- function(x){
raster::predict(x, mod)
}
#parallel - working!
result <- future_lapply(o, my_single_function)
## tidy up
lapply(vms, FUN = gce_vm_stop) |
|
If you don't mind I'll add this example to the documentation? |
|
Thank you, Mark! Very excting! Please do add to the documentation. To ensure it works for others, we may want to figure out my captialization error first, as I still can't get past the
|
|
Whats the project ID on the Google console home screen? I'm surprised if it allows uppercase in that, perhaps you are using project name instead? |
|
Yep you need the |
|
Very good!! Seems I need to allow permission somehow
|
|
That will happen if your VMs are in a different project than your docker images, which may be the case if your project IDs were different. |
|
Oh that’s an old bug but I thought fixed on GitHub version. But sadly if not easiest is to make a new project without numbers. |
|
I’m not 100% positive if you start the VMs using gcloud if it will work, since the R launcher customises the VM, and one of those is to allow auth with all cloud services so that may be it. |
|
I made and authenticated a new project w/ name and ID =lambspatialgrid. However, I still get a similar error:
|
|
You have to be so close! Is the raster build trigger on the same project too? Just to check, I have put the raster image I built in the public directory which should have no authentication problems, so is available via googleComputeEngineR::gce_tag_container("raster", project = "gcer-public")
#[1] "gcr.io/gcer-public/raster"But part of the power is having your own private images, so its a workaround. |
|
This is working!! Thank you!! |
|
Great, thanks for your patience in getting it up and running. |
















My goal is to speed up the model predictions on large RasterLayers. I had hoped I could break a RasterLayer into many tiles, predict the model on each of these smaller tiles in parallel using googleComputeEngineR, then reassemble.
I however seem to be stumbling. Below I provide a simplified, reproducible example. It seems that the function fails to find ".filetype" after loading the raster package. I can run the function with no issues locally, as shown at end of script.