Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load .RDS files directly into environment gcs_get_object? #146

Closed
samuel-marsh opened this issue Apr 22, 2021 · 15 comments
Closed

Load .RDS files directly into environment gcs_get_object? #146

samuel-marsh opened this issue Apr 22, 2021 · 15 comments

Comments

@samuel-marsh
Copy link

Hi,

This might be naive question and I might be missing something but wondering if there is way to load file saved as a .RDS file from GCP bucket directly into local R environment without saving to disk first?

I have been currently trying this with objects created with the single cell analysis package Seurat which creates S4 class object (See more info on Seurat Objects format see here: https://github.com/mojaveazure/seurat-object and here: https://github.com/satijalab/seurat/wiki.

When I run:

obj <- gcs_get_object(object_name = "gs://bucket_name/obj.RDS")

It loads into the environment as a "Raw" file that is then unreadable by Seurat. If I add saveToDisk = "obj.RDS" and then subsequently read it into R with readRDS (or wrapper read_rds) then it works just fine and is readable by Seurat.

Wondering whether there is additional parameter I missing specifying that would allow this or if not whether this is feature that could be added?

Thanks!
Sam

@MarkEdmondson1234
Copy link
Collaborator

Yes you can supply a custom parse function to load the object directly into R. You would want something like readRDS().

All the downloads write to disk at least temporarily so it's not more efficient, but a lot more convenient:)

@samuel-marsh
Copy link
Author

Hi Mark,

Thanks for quick response. This must be what I'm not quite understanding because when I run:

obj <- gcs_get_object(object_name = "gs://bucket_name/obj.RDS", parseFunction = readRDS())

I get an error that the parsing failed.

Thanks!
Sam

@MarkEdmondson1234
Copy link
Collaborator

MarkEdmondson1234 commented Apr 23, 2021

Sorry I thought this would be simpler but actually the raw RDS response is harder to deal with than I thought. The best I can come up with is a wrapper to saveToDisk then load it which will do what I thought it should do:

my_parse <- function(obj){
     tmp <- tempfile(fileext = ".rds")
     on.exit(unlink(tmp))
     suppressMessages(gcs_get_object(obj, saveToDisk = tmp))
     readRDS(tmp)
 }
obj <- my_parse("gs://bucket_name/obj.RDS")

I will look at if this can be improved :)

@MarkEdmondson1234
Copy link
Collaborator

MarkEdmondson1234 commented Apr 23, 2021

Rich Fergie found the right functions for parsing RDS without needing to save to disk for you: https://twitter.com/RichardFergie/status/1385531335423447040

f <- function(obj) {
  readRDS(gzcon(rawConnection(httr::content(obj))))
}
gcs_get_object("obj.rds", parseFunction = f)

MarkEdmondson1234 added a commit that referenced this issue Apr 23, 2021
@MarkEdmondson1234
Copy link
Collaborator

I added the function as a helper as it looked useful, so for the GitHub version you can use:

gcs_get_object("obj.rds", parseFunction = gcs_parse_rds)

See ?gcs_parse_rds

@samuel-marsh
Copy link
Author

Hey Mark,

Really appreciate your help on this! Unfortunately still getting errors when I try myself. Although the errors are different depending on whether it is the GitHub branch or CRAN version.

Using github master branch and running the code below results in following error:

test <- gcs_get_object(object_name = "gs://bucket_name/exp17.RDS", parseFunction = gcs_parse_rds)
i Downloading exp17_micro.RDSError: Problem parsing the object with supplied parseFunction.
x Downloading exp17_micro.RDS ... failed

If I revert to the CRAN version and using the custom parse function itself from global env I get following error messages:

f <- function(obj) {
  readRDS(gzcon(rawConnection(httr::content(obj))))
}

test <- gcs_get_object(object_name = "gs://bucket_name/exp17.RDS", parseFunction = gcs_parse_rds)
Downloaded exp17_micro.RDS
Error in readRDS(gzcon(rawConnection(httr::content(obj)))) : 
  too large a block specified
Error in gcs_get_object(object_name = "gs://stevens_data_marsh/exp17_micro.RDS",  : 
  Problem parsing the object with supplied parseFunction.

For reference the RDS object that I'm testing this with is 2.4GB.

Also including sessionInfo below for reference in case it's helpful!

Thanks again so much for all your help on this and quick response!!
Sam

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.3

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] beepr_1.3                 Seurat_3.2.3             
 [3] forcats_0.5.0             stringr_1.4.0            
 [5] dplyr_1.0.5               purrr_0.3.4              
 [7] readr_1.3.1               tidyr_1.1.0              
 [9] tibble_3.0.1              ggplot2_3.3.0            
[11] tidyverse_1.3.0           googleCloudStorageR_0.6.0

loaded via a namespace (and not attached):
  [1] Rtsne_0.15            colorspace_1.4-1      deldir_0.1-28        
  [4] ellipsis_0.3.1        ggridges_0.5.2        fs_1.4.1             
  [7] spatstat.data_1.4-3   rstudioapi_0.11       leiden_0.3.3         
 [10] listenv_0.8.0         remotes_2.1.1         audio_0.1-7          
 [13] ggrepel_0.8.2         lubridate_1.7.8       xml2_1.3.2           
 [16] codetools_0.2-16      splines_3.6.1         polyclip_1.10-0      
 [19] jsonlite_1.6.1        packrat_0.5.0         broom_0.5.6          
 [22] ica_1.0-2             cluster_2.1.0         dbplyr_1.4.3         
 [25] png_0.1-7             uwot_0.1.10           sctransform_0.3.1    
 [28] shiny_1.4.0.2         compiler_3.6.1        httr_1.4.1           
 [31] backports_1.1.7       lazyeval_0.2.2        assertthat_0.2.1     
 [34] Matrix_1.2-18         fastmap_1.0.1         gargle_1.1.0         
 [37] cli_2.4.0             later_1.0.0           htmltools_0.5.1.1    
 [40] tools_3.6.1           rsvd_1.0.3            igraph_1.2.5         
 [43] gtable_0.3.0          glue_1.4.1            reshape2_1.4.4       
 [46] RANN_2.6.1            rappdirs_0.3.1        spatstat_1.64-1      
 [49] Rcpp_1.0.6            scattermore_0.7       cellranger_1.1.0     
 [52] vctrs_0.3.6           nlme_3.1-148          lmtest_0.9-37        
 [55] globals_0.14.0        rvest_0.3.5           mime_0.9             
 [58] miniUI_0.1.1.1        lifecycle_1.0.0       irlba_2.3.3          
 [61] goftest_1.2-2         future_1.21.0         googleAuthR_1.3.1    
 [64] MASS_7.3-51.6         zoo_1.8-8             scales_1.1.1         
 [67] spatstat.utils_1.17-0 hms_0.5.3             promises_1.1.0       
 [70] parallel_3.6.1        RColorBrewer_1.1-2    yaml_2.2.1           
 [73] curl_4.3              gridExtra_2.3         memoise_1.1.0        
 [76] reticulate_1.15       pbapply_1.4-2         rpart_4.1-15         
 [79] stringi_1.4.6         zip_2.0.4             rlang_0.4.10         
 [82] pkgconfig_2.0.3       matrixStats_0.56.0    lattice_0.20-41      
 [85] tensor_1.5            ROCR_1.0-11           patchwork_1.0.0      
 [88] htmlwidgets_1.5.1     cowplot_1.0.0         tidyselect_1.1.0     
 [91] parallelly_1.21.0     RcppAnnoy_0.0.18      plyr_1.8.6           
 [94] magrittr_1.5          R6_2.4.1              generics_0.0.2       
 [97] DBI_1.1.0             mgcv_1.8-31           pillar_1.4.4         
[100] haven_2.3.0           withr_2.2.0           fitdistrplus_1.1-1   
[103] abind_1.4-5           survival_3.1-12       future.apply_1.5.0   
[106] modelr_0.1.8          crayon_1.3.4          KernSmooth_2.23-17   
[109] plotly_4.9.2.1        grid_3.6.1            readxl_1.3.1         
[112] data.table_1.12.8     reprex_0.3.0          digest_0.6.25        
[115] xtable_1.8-4          httpuv_1.5.2          openssl_1.4.1        
[118] munsell_0.5.0         viridisLite_0.3.0     askpass_1.1  

@MarkEdmondson1234
Copy link
Collaborator

Ok cool, seems your RDS is a special case compared to mine ;) May I ask if the RDS files you are using "old" in that they were done before R 3.5? They changed the format type in that release, just trying to eliminate it as a cause.

@MarkEdmondson1234
Copy link
Collaborator

Could you also issue traceback() after your error to see which function is triggering it?

@MarkEdmondson1234
Copy link
Collaborator

And I guess writing to disk should work ok?

my_parse <- function(obj){
     tmp <- tempfile(fileext = ".rds")
     on.exit(unlink(tmp))
     suppressMessages(gcs_get_object(obj, saveToDisk = tmp))
     readRDS(tmp)
 }
obj <- my_parse("gs://bucket_name/obj.RDS")

It may be that 2.4GB is just too big for R to decompress

@LukasWallrich
Copy link

FYI: for me, this works with a 10.2GB .RDS file that is saved without compression (with readr::write_rds). So the file size per se, at least, is not the issue. Thanks for implementing this very convenient parser function!

@MarkEdmondson1234
Copy link
Collaborator

Thanks @LukasWallrich good to know. I think then @samuel-marsh 's rds file must have something unique about it - if it is downloaded locally trying to debug where the readRDS(gzcon(rawConnection(httr::content(obj)))) goes wrong would be a start.

@aldomann
Copy link

aldomann commented Dec 6, 2021

Sorry I thought this would be simpler but actually the raw RDS response is harder to deal with than I thought. The best I can come up with is a wrapper to saveToDisk then load it which will do what I thought it should do:

my_parse <- function(obj){
     tmp <- tempfile(fileext = ".rds")
     on.exit(unlink(tmp))
     suppressMessages(gcs_get_object(obj, saveToDisk = tmp))
     readRDS(tmp)
 }
obj <- my_parse("gs://bucket_name/obj.RDS")

I will look at if this can be improved :)

Somehow unrelated, this strategy also works for parsing UTF-16LE CSV files, which I haven't managed to do by just using read.csv(x, fileEncoding = "UTF-16LE") as the parseFunction.

@MarkEdmondson1234
Copy link
Collaborator

I forgot to put here that gce_parse_rds() in now in the dev version vai this commit d912d0c

If there are other useful parsing functions I'd be glad to put them in.

@lifedeathandtech
Copy link

@MarkEdmondson1234 - I think you might have meant to type gcs_parse_rds().

Thank you so much for your contributions! googleCloudStorageR and googleCloudRunner are incredibly useful tools.

@MarkEdmondson1234
Copy link
Collaborator

Ah yes that is it gcs_ vs gce_ - got confusing sometimes working on the packages at same time ;) glad they are helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants