Merge pull request #44 from ddotta/ddotta/issue40

Add user_na argument in table_to_parquet function
ddotta · May 25, 2023 · d147121 · d147121
2 parents 84d4d57 + 933aadc
commit d147121
Show file tree

Hide file tree

Showing 4 changed files with 37 additions and 16 deletions.
diff --git a/R/table_to_parquet.R b/R/table_to_parquet.R
@@ -18,7 +18,7 @@
 #'
 #' To avoid overcharging R's RAM, the conversion can be done by chunk. One of arguments `max_memory` or `max_rows` must then be used.
 #' This is very useful for huge tables and for computers with little RAM because the conversion is then done
-#' with less memory consumption. For more information, see [here](https://ddotta.github.io/parquetize/articles/aa-conversions.html).
+#' with less memory consumption. For more information, see \href{https://ddotta.github.io/parquetize/articles/aa-conversions.html}{here}.
 #'
 #' @param path_to_file String that indicates the path to the input file (don't forget the extension).
 #' @param path_to_parquet String that indicates the path to the directory where the parquet files will be stored.
@@ -37,6 +37,8 @@
 #' @param encoding String that indicates the character encoding for the input file.
 #' @param compression compression algorithm. Default "snappy".
 #' @param compression_level compression level. Meaning depends on compression algorithm.
+#' @param user_na If `TRUE` variables with user defined missing will be read
+#' into [haven::labelled_spss()] objects. If `FALSE`, the default, user-defined missings will be converted to `NA`.
 #' @param ... Additional format-specific arguments,  see \href{https://arrow.apache.org/docs/r/reference/write_parquet.html}{arrow::write_parquet()}
 #'  and \href{https://arrow.apache.org/docs/r/reference/write_dataset.html}{arrow::write_dataset()} for more informations.
 #'
@@ -82,7 +84,7 @@
 #' table_to_parquet(
 #'   path_to_file = system.file("examples","iris.sav", package = "haven"),
 #'   path_to_parquet = tempfile(),
-#'   max_memory = 5 / 1024,
+#'   max_memory = 5 / 1024
 #' )
 #'
 #' # Reading SAS file by chunk of 50 lines with encoding
@@ -141,6 +143,7 @@ table_to_parquet <- function(
     chunk_memory_sample_lines = 10000,
     compression = "snappy",
     compression_level = NULL,
+    user_na = FALSE,
     ...
 ) {
   if (!missing(by_chunk)) {
@@ -201,17 +204,31 @@ table_to_parquet <- function(
 
 
   # Closure to create read data
-  closure_read_method <- function(encoding, columns) {
+  closure_read_method <- function(encoding, columns, user_na) {
     method <- get_haven_read_function_for_file(path_to_file)
     function(path, n_max = Inf, skip = 0L) {
-      method(path,
-             n_max = n_max,
-             skip = skip,
-             encoding = encoding,
-             col_select = if (identical(columns,"all")) everything() else all_of(columns))
+
+      ext <- tools::file_ext(path_to_file)
+
+      if (ext != "sav") {
+        method(path,
+               n_max = n_max,
+               skip = skip,
+               encoding = encoding,
+               col_select = if (identical(columns,"all")) everything() else all_of(columns))
+
+      } else if (ext == "sav") {
+        method(path,
+               n_max = n_max,
+               skip = skip,
+               encoding = encoding,
+               col_select = if (identical(columns,"all")) everything() else all_of(columns),
+               user_na = user_na)
+      }
     }
   }
-  read_method <- closure_read_method(encoding = encoding, columns = columns)
+
+  read_method <- closure_read_method(encoding = encoding, columns = columns, user_na = user_na)
 
   if (by_chunk) {
     ds <- write_parquet_by_chunk(

diff --git a/R/utilities.R b/R/utilities.R
@@ -81,7 +81,7 @@ is_zip <- function(path) {
 #'
 #' @param ds a dataset/parquet file
 #'
-#' @return a tibble with 3 columns :
+#' @return a tibble with 2 columns :
 #'
 #'   * the column name (string)
 #'   * the arrow type (string)

diff --git a/man/table_to_parquet.Rd b/man/table_to_parquet.Rd
diff --git a/vignettes/aa-conversions.Rmd b/vignettes/aa-conversions.Rmd
@@ -22,7 +22,7 @@ library(parquetize)
 
 ## With `table_to_parquet()`
 
-For **huge input files in SAS, SPSS and Stata formats**, the parquetize package allows you to perform a clever conversion by using `chunk_memory_size` or `chunk_size` in the [`table_to_parquet()`](https://ddotta.github.io/parquetize/reference/table_to_parquet.html) function. 
+For **huge input files in SAS, SPSS and Stata formats**, the parquetize package allows you to perform a clever conversion by using `max_memory` or `max_rows` in the [`table_to_parquet()`](https://ddotta.github.io/parquetize/reference/table_to_parquet.html) function. 
 The native behavior of this function (and all other functions in the package) is to load the entire table to be converted into R and then write it to disk (in a single file or a partitioned directory).  
 
 When handling very large files, the risk that frequently occurs is that the R session aborts because it cannot load the entire database into memory.
@@ -43,7 +43,7 @@ Here are examples from the documentation using the iris table. There's two ways
 ### Spliting data by memory consumption
 
 `table_to_parquet` can guess the number of lines to put in a file based on the
-memory consuption with the argument `chunk_memory_size` expressed in Mb.
+memory consuption with the argument `max_memory` expressed in Mb.
 
 Here we cut the 150 rows into chunks of roughly 5 Kb when a file is loaded as a
 tibble.  
@@ -58,8 +58,8 @@ table_to_parquet(
 )
 ```
 
-In real life, you should use a `chunk_memory_size` in the Gb range, for example
-with a SAS file of 50 000 000 lines and using chunk_memory_size of 5000 Mb :
+In real life, you should use a `max_memory` in the Gb range, for example
+with a SAS file of 50 000 000 lines and using `max_memory` of 5000 Mb :
 
 
 ```{r real-memory-example, eval=FALSE}