Releases · ddotta/parquetize

04 Mar 13:39

ddotta

v0.5.7

8026da3

v0.5.7 Latest

Latest

This release includes :

bugfix by @leungi: remove single quotes in SQL statement thatgenerates incorrect SQL syntax for connection of type Microsoft SQL Server #45
{parquetize} now has a minimal version (2.4.0) for {haven} dependency package to ensure that conversions are performed correctly from SAS files compressed in BINARY mode #46
csv_to_parquet now has a read_delim_args argument, allowing passing of arguments to read_delim (added by @nikostr).
table_to_parquet can now convert files with uppercase extensions (.SAS7BDAT, .SAV, .DTA)

What's Changed

Add user_na argument in table_to_parquet function by @ddotta in #44
fix: remove single quotes in SQL statement by @leungi in #45
Specify minimal version for haven by @ddotta in #47
Improves documentation for csv_to_parquet() for txt files by @ddotta in #48
Adds argument read_delim_args to csv_to_parquet by @nikostr in #49
table_to_parquet() can now convert files with uppercase extensions by @ddotta in #54

New Contributors

@leungi made their first contribution in #45
@nikostr made their first contribution in #49

Full Changelog: v0.5.6.1...v0.5.7

Contributors

nikostr, ddotta, and leungi

Assets 2

10 May 13:32

ddotta

v0.5.6.1

84d4d57

v0.5.6.1

This release includes :

fst_to_parquet function

a new fst_to_parquet function that converts a fst file to parquet format.

Other

Rely more on @inheritParams to simply documentation of functions arguments #38. This leads to some renaming of arguments (e.g path_to_csv -> path_to_file...)
Arguments compression and compression_level are now passed to write_parquet_at_once and write_parquet_by_chunk functions and now available in main conversion functions of parquetize #36
Group @importFrom in a file to facilitate their maintenance #37
work on download_extract tests #43

Full Changelog: v0.5.6...v0.5.6.1

Assets 2

21 Apr 11:50

ddotta

v0.5.6

8701f35

parquetize 0.5.6

This release includes :

Possibility to use a RDBMS as source

You can convert to parquet any query you want on any DBI compatible RDBMS :

dbi_connection <- DBI::dbConnect(RSQLite::SQLite(),
  system.file("extdata","iris.sqlite",package = "parquetize"))
  
# Reading iris table from local sqlite database
# and conversion to one parquet file :
dbi_to_parquet(
  conn = dbi_connection,
  sql_query = "SELECT * FROM iris",
  path_to_parquet = tempdir(),
  parquetname = "iris"
)

You can find more information on dbi_to_parquet documentation.

check_parquet function

a new check_parquet function that check if a dataset/file is valid and return columns and arrow type

Deprecations

Two arguments are deprecated to avoid confusion with arrow concept and keep consistency

chunk_size is replaced by max_rows (chunk size is an arrow concept).
chunk_memory_size is replaced by max_memory for consistency

Other

refactoring : extract the logic to write parquet files as chunk or at once in write_parquet_by_chunk and write_parquet_at_once
a big test's refactoring : all _to_parquet output files are formally validate (readable as parquet, number of lines, partitions, number of files).
use cli_abort instead of cli_alert_danger with stop("") everywhere
some minors changes
bugfix: table_to_parquet did not select columns as expected
bugfix: skip_if_offline tests with download

Thanks a lot 🙏 to @nbc for these new improvements 🚀

Full Changelog: v0.5.5...v0.5.6

Contributors

nbc

Assets 2

28 Mar 13:09

ddotta

v0.5.5

f5f655d

parquetize 0.5.5

This release includes :

A very important new contributor to `parquetize` !

Due to these numerous contributions, @nbc is now officially part of the project authors !

Three arguments deprecation

After a big refactoring, three arguments are deprecated :

by_chunk : table_to_parquet will automatically chunked if you use one of chunk_memory_size or chunk_size.
csv_as_a_zip: csv_to_table will detect if file is a zip by the extension.
url_to_csv : use path_to_csv instead, csv_to_table will detect if the file is remote with the file path.

They will raise a deprecation warning for the moment.

Chunking by memory size

The possibility to chunk parquet by memory size with table_to_parquet():
table_to_parquet() takes a chunk_memory_size argument to convert an input
file into parquet file of roughly chunk_memory_size Mb size when data are
loaded in memory.

Argument by_chunk is deprecated (see above).

Example of use of the argument chunk_memory_size:

table_to_parquet(
  path_to_table = system.file("examples","iris.sas7bdat", package = "haven"),
  path_to_parquet = tempdir(),
  chunk_memory_size = 5000, # this will create files of around 5Gb when loaded in memory
)

Passing argument like compression to `write_parquet` when chunking

The functionality for users to pass argument to write_parquet() when
chunking argument (in the ellipsis). Can be used for example to pass
compression and compression_level.

Example:

table_to_parquet(
  path_to_table = system.file("examples","iris.sas7bdat", package = "haven"),
  path_to_parquet = tempdir(),
  compression = "zstd",
  compression_level = 10,
  chunk_memory_size = 5000
)

A new function `download_extract`

This function is added to ... download and unzip file if needed.

file_path <- download_extract(
  "https://www.nomisweb.co.uk/output/census/2021/census2021-ts007.zip",
  filename_in_zip = "census2021-ts007-ctry.csv"
)
csv_to_parquet(
  file_path,
  path_to_parquet = tempdir()
)

Other

Under the cover, this release has hardened tests

Contributors

nbc

Assets 2

13 Mar 10:36

ddotta

v0.5.4

689b26e

parquetize 0.5.4

This release includes the correction made by @nchuche on the chunks #21 🎉

Assets 2

20 Feb 15:36

ddotta

v0.5.3

6e4b80d

parquetize 0.5.3

This release includes :

Added columns selection to table_to_parquet() and csv_to_parquet() functions #20
The example files in parquet format of the iris table have been migrated to the inst/extdata directory.

Assets 2

31 Jan 13:16

ddotta

v0.5.2

f578897

parquetize 0.5.2

This release fixes the behaviour of table_to_parquet() function when the argument by_chunk is TRUE.

Assets 2

23 Jan 12:06

ddotta

v0.5.1

509f04a

parquetize 0.5.1

This release removes duckdb_to_parquet() function on the advice of Brian Ripley from CRAN.
Indeed, the storage of DuckDB is not yet stable. The storage will be stabilized when version 1.0 releases.

Assets 2

17 Jan 08:59

ddotta

v0.5.0

bb9f8b2

parquetize 0.5.0

This release corresponds to the one available on CRAN (0.5.0) 🎉

Assets 2

12 Dec 21:12

ddotta

v0.4.0

975e000

parquetize 0.4.0

This release includes an important feature :

The table_to_parquet() function can now convert tables to parquet format with less memory consumption.
Useful for huge tables and for computers with little RAM. (#15)
A vignette has been written about it. See here.

Removal of the nb_rows argument in the table_to_parquet() function
Replaced by new arguments by_chunk, chunk_size and skip (see documentation)
Progress bars and alerts are now managed with {cli} package

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What's Changed

New Contributors

Contributors

fst_to_parquet function

Other

Possibility to use a RDBMS as source

check_parquet function

Deprecations

Other

Contributors

A very important new contributor to `parquetize` !

Three arguments deprecation

Chunking by memory size

Passing argument like compression to `write_parquet` when chunking

A new function `download_extract`

Other

Contributors

Releases: ddotta/parquetize

v0.5.7

What's Changed

New Contributors

Contributors

v0.5.6.1

fst_to_parquet function

Other

parquetize 0.5.6

Possibility to use a RDBMS as source

check_parquet function

Deprecations

Other

Contributors

parquetize 0.5.5

A very important new contributor to parquetize !

Three arguments deprecation

Chunking by memory size

Passing argument like compression to write_parquet when chunking

A new function download_extract

Other

Contributors

parquetize 0.5.4

parquetize 0.5.3

parquetize 0.5.2

parquetize 0.5.1

parquetize 0.5.0

parquetize 0.4.0

A very important new contributor to `parquetize` !

Passing argument like compression to `write_parquet` when chunking

A new function `download_extract`