Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gcs_load_all() fails to create/load to subdirectories #112

Open
jasonmhoule opened this issue Nov 20, 2019 · 6 comments
Open

gcs_load_all() fails to create/load to subdirectories #112

jasonmhoule opened this issue Nov 20, 2019 · 6 comments

Comments

@jasonmhoule
Copy link
Contributor

Trying to load from a GCS bucket on startup (using .First as shared here) and directories and subdirectories are not being loaded.

(Assuming a global bucket/auth/etc., I can make a more formal reprex if needed but I believe this will fail across the board):

library(googleCloudStorageR)

dir.create("testdir")
file.create("testdir/testfile.test")

gcs_save_all()

file.remove("testdir/testfile.test")
file.remove("testdir")

gcs_load_all()

I think the offending line is:

mapply(function(x, name){

I believe that the recursive argument on file.copy() will only work when specifying a directory, not file-by-file as the mapply() call does.

jasonmhoule added a commit to jasonmhoule/googleCloudStorageR that referenced this issue Nov 20, 2019
@jasonmhoule
Copy link
Contributor Author

I don't think this has fixed the problem. Couple issues I'm seeing:

  1. gcs_load_all() depends on the side effect of unzip() which I moved into the if(list == TRUE). It will need to come out of there. (There may be a better way using zip package to get this desired info than how it's currently implemented.
  2. When updating to zip::zipr() it looks like gcs_save_all() wasn't updated for the new behavior in zipr(). Namely, the new function takes every file listed as a top-level file, so using file.list() makes everything top-level. It should actually be easier to simply pass the directory to zipr() which will then recursively build out the filepaths so they are there when needed for unzip().

@MarkEdmondson1234
Copy link
Collaborator

Ok it'll be good to get a reprex to reproduce the bad behaviour for the tests, then I'll look at a solution I did for another package tar'ing files and see if that applies.

@jasonmhoule
Copy link
Contributor Author

jasonmhoule commented Dec 13, 2019

Here's a reprex:

I created a VM instance on GCE using the docker.io/rocker/tidyverse image and storage-full scope:

gcloud beta compute --project=<proj_id> instances create-with-container <instance_name> --zone=us-central1-a --machine-type=n1-standard-1 --subnet=default --network-tier=PREMIUM --maintenance-policy=MIGRATE --service-account=652580964228-compute@developer.gserviceaccount.com --scopes=storage-full,logging-write,monitoring-write,service-control	,service-management,trace --tags=http-server --image=cos-stable-78-12499-59-0 --image-project=cos-cloud --boot-disk-size=10GB --boot-disk-type=pd-standard --boot-disk-device-name=instance-1 --container-image=docker.io/rocker/tidyverse --container-restart-policy=always --container-privileged --container-stdin --container-tty --container-env=USER=username,PASSWORD=password --labels=container-vm=cos-stable-78-12499-59-0 --metadata=google-logging-enabled=true

On startup I install GCS package and create a YAML pointing to an existing bucket to use for autosave/load:

install.packages("googleCloudStorageR", repos = "http://cloud.r-project.org")

----- _gcssave.yaml
# The GCS bucket to save/load R workspace from
bucket: <bucket_name>

Then I run:

library(googleCloudStorageR)

dir.create("testdir")
dir.create("testdir/testdir2")
one <- c(1)
two <- c(1, 2)
three <- c(1, 2, 3)
saveRDS(one, "one.rds")
saveRDS(two, "testdir/two.rds")
saveRDS(three, "testdir/testdir2/three.rds")

gcs_last()

unlink("testdir", recursive = T)
file.remove("one.rds")

gcs_first()

For this, I would expect that the same set of files/directories that were set up and then deleted would show up (in the working directory, where they were created and saved from) after running gcs_first(). For me, actually nothing new is created in my working directory.

@jasonmhoule
Copy link
Contributor Author

The above is actually related to gcs_first() and not gcs_load_all() (though the former relies on the latter). It's worth noting that if instead of the last line I run:

gcs_global_bucket("<bucket_name>")
gcs_load_all()

This also doesn't create anything new in my working directory.

I can try to start looking into it but thought I would at least post a reprex for starters.

@MarkEdmondson1234
Copy link
Collaborator

Great thanks, will take a look

@jasonmhoule
Copy link
Contributor Author

jasonmhoule commented Dec 20, 2019

The pull request above reverts back to use of zip::zip() for gcs_save_all. The following is documentation on the function zip::zip():

Relative paths
The different between zipr and zip is how they handle the relative paths of the input files.

For zip (and zip_append), the root of the archive is supposed to be the current working directory. The paths of the files are fully kept in the archive. Absolute paths are also kept. Note that this might result non-portable archives: some zip tools do not handle zip archives that contain absolute file names, or file names that start with ..// or ./. This behavior is kept for compatibility, and we suggest that you use zipr and zipr_append for new code.

E.g. for the following directory structure:

foo
|-bar
|--file1
|-bar2
|--file2
foo2
|-file3

Assuming the current working directory is foo, the following zip entries are created by zip:

zip("x.zip", c("bar/file1", "bar2", "../foo2"))
zip_list("x.zip")$filename
#> bar/file1
#> bar2
#> bar2/file2
#> ../foo2
#> ../foo2/file3

For zipr (and zipr_append), each specified file or directory in files is created as a top-level entry in the zip archive. We suggest that you use zipr and zipr_append for new code, as they don't create non-portable archives. For the same directory structure, these zip entries are created:

zipr("x.zip", c("bar/file1", "bar2", "../foo2"))
zip_list("x.zip")$filename
#> file1
#> bar2
#> bar2/file2
#> foo2
#> foo2/file3

Since gcs_save_all uses a list of relative paths, we should not have to worry about issues with zip and absolute paths (although the doc notes "Absolute paths are also kept").

The problem with using zipr is that if we provide relative paths to that function, the files all become top-level and we lose the directory structure (which is where this issue started). We need to provide the full list of relative paths to preserve the use of pattern to select/deselect files in _gcssave.yaml, which I assume is important functionality for memory management. The alternative would be to provide top-level paths to zipr and let it include the directory structure by recursion, but in that case I don't know how we avoid breaking pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants