Skip to content

Commit

Permalink
Merge pull request #44 from ddotta/ddotta/issue40
Browse files Browse the repository at this point in the history
Add user_na argument in table_to_parquet function
  • Loading branch information
ddotta committed May 25, 2023
2 parents 84d4d57 + 933aadc commit d147121
Show file tree
Hide file tree
Showing 4 changed files with 37 additions and 16 deletions.
35 changes: 26 additions & 9 deletions R/table_to_parquet.R
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
#'
#' To avoid overcharging R's RAM, the conversion can be done by chunk. One of arguments `max_memory` or `max_rows` must then be used.
#' This is very useful for huge tables and for computers with little RAM because the conversion is then done
#' with less memory consumption. For more information, see [here](https://ddotta.github.io/parquetize/articles/aa-conversions.html).
#' with less memory consumption. For more information, see \href{https://ddotta.github.io/parquetize/articles/aa-conversions.html}{here}.
#'
#' @param path_to_file String that indicates the path to the input file (don't forget the extension).
#' @param path_to_parquet String that indicates the path to the directory where the parquet files will be stored.
Expand All @@ -37,6 +37,8 @@
#' @param encoding String that indicates the character encoding for the input file.
#' @param compression compression algorithm. Default "snappy".
#' @param compression_level compression level. Meaning depends on compression algorithm.
#' @param user_na If `TRUE` variables with user defined missing will be read
#' into [haven::labelled_spss()] objects. If `FALSE`, the default, user-defined missings will be converted to `NA`.
#' @param ... Additional format-specific arguments, see \href{https://arrow.apache.org/docs/r/reference/write_parquet.html}{arrow::write_parquet()}
#' and \href{https://arrow.apache.org/docs/r/reference/write_dataset.html}{arrow::write_dataset()} for more informations.
#'
Expand Down Expand Up @@ -82,7 +84,7 @@
#' table_to_parquet(
#' path_to_file = system.file("examples","iris.sav", package = "haven"),
#' path_to_parquet = tempfile(),
#' max_memory = 5 / 1024,
#' max_memory = 5 / 1024
#' )
#'
#' # Reading SAS file by chunk of 50 lines with encoding
Expand Down Expand Up @@ -141,6 +143,7 @@ table_to_parquet <- function(
chunk_memory_sample_lines = 10000,
compression = "snappy",
compression_level = NULL,
user_na = FALSE,
...
) {
if (!missing(by_chunk)) {
Expand Down Expand Up @@ -201,17 +204,31 @@ table_to_parquet <- function(


# Closure to create read data
closure_read_method <- function(encoding, columns) {
closure_read_method <- function(encoding, columns, user_na) {
method <- get_haven_read_function_for_file(path_to_file)
function(path, n_max = Inf, skip = 0L) {
method(path,
n_max = n_max,
skip = skip,
encoding = encoding,
col_select = if (identical(columns,"all")) everything() else all_of(columns))

ext <- tools::file_ext(path_to_file)

if (ext != "sav") {
method(path,
n_max = n_max,
skip = skip,
encoding = encoding,
col_select = if (identical(columns,"all")) everything() else all_of(columns))

} else if (ext == "sav") {
method(path,
n_max = n_max,
skip = skip,
encoding = encoding,
col_select = if (identical(columns,"all")) everything() else all_of(columns),
user_na = user_na)
}
}
}
read_method <- closure_read_method(encoding = encoding, columns = columns)

read_method <- closure_read_method(encoding = encoding, columns = columns, user_na = user_na)

if (by_chunk) {
ds <- write_parquet_by_chunk(
Expand Down
2 changes: 1 addition & 1 deletion R/utilities.R
Original file line number Diff line number Diff line change
Expand Up @@ -81,7 +81,7 @@ is_zip <- function(path) {
#'
#' @param ds a dataset/parquet file
#'
#' @return a tibble with 3 columns :
#' @return a tibble with 2 columns :
#'
#' * the column name (string)
#' * the arrow type (string)
Expand Down
8 changes: 6 additions & 2 deletions man/table_to_parquet.Rd

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 4 additions & 4 deletions vignettes/aa-conversions.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ library(parquetize)

## With `table_to_parquet()`

For **huge input files in SAS, SPSS and Stata formats**, the parquetize package allows you to perform a clever conversion by using `chunk_memory_size` or `chunk_size` in the [`table_to_parquet()`](https://ddotta.github.io/parquetize/reference/table_to_parquet.html) function.
For **huge input files in SAS, SPSS and Stata formats**, the parquetize package allows you to perform a clever conversion by using `max_memory` or `max_rows` in the [`table_to_parquet()`](https://ddotta.github.io/parquetize/reference/table_to_parquet.html) function.
The native behavior of this function (and all other functions in the package) is to load the entire table to be converted into R and then write it to disk (in a single file or a partitioned directory).

When handling very large files, the risk that frequently occurs is that the R session aborts because it cannot load the entire database into memory.
Expand All @@ -43,7 +43,7 @@ Here are examples from the documentation using the iris table. There's two ways
### Spliting data by memory consumption

`table_to_parquet` can guess the number of lines to put in a file based on the
memory consuption with the argument `chunk_memory_size` expressed in Mb.
memory consuption with the argument `max_memory` expressed in Mb.

Here we cut the 150 rows into chunks of roughly 5 Kb when a file is loaded as a
tibble.
Expand All @@ -58,8 +58,8 @@ table_to_parquet(
)
```

In real life, you should use a `chunk_memory_size` in the Gb range, for example
with a SAS file of 50 000 000 lines and using chunk_memory_size of 5000 Mb :
In real life, you should use a `max_memory` in the Gb range, for example
with a SAS file of 50 000 000 lines and using `max_memory` of 5000 Mb :


```{r real-memory-example, eval=FALSE}
Expand Down

0 comments on commit d147121

Please sign in to comment.