add kwarg chunksize for default data partitioning for write #400

svilupp · 2023-03-12T15:46:39Z

This PR proposes to introduce automated partitioning of the provided tables when writing. It follows my findings from benchmarking against PyArrow

Nowadays, most machines are multithreaded and Arrow.write() provides multithreaded writing for partitioned data. However, a user must explicitly partition their data.
Unfortunately, most users do not realize that both their write and subsequent read operations will not be multithreaded without such partitioning (there is an issue to improve the docs).

This PR defaults to partitioning data if it's larger than 64K rows (should be beneficial on most systems) to enable better Arrow.jl performance on both read and write.

Implementation:

the new kwarg is called chunksize (maps to PyArrow and should be broadly understood)
uses default chunksize of 64000 rows, as per PyArrow.write_feather
allows users to opt-out by providing chunksize=nothing
~~partitioning is done via Iterators.partition(Tables.rows(tbl),chunksize) for all Tables.jl-compatible sources (checks Tables.istable)~~ changed to Iterators.partition(tbl,chunksize) to avoid missingness getting lost (eg, for DataFrames)

Some resources:

PyArrow write_feather docs
Dataframes.jl introduces overload for Iterators.partition in 1.5 Release
Arrow.jl author's blog post on partioning (sidenote: I really enjoy your posts - please write more :-) )

…ases)

svilupp · 2023-03-13T08:59:22Z

I've changed the condition for automatic partitioning to be Tables.rowaccess()=true as well, to prevent accepting some columntables without row iterators.

In addition, I've changed Iterators.partition(Tables.rows(tbl),chunksize) to Iterators.partition(tbl,chunksize) to avoid missingness type getting lost (eg, for DataFrames)

svilupp · 2023-03-13T09:26:20Z

In addition, I've changed Iterators.partition(Tables.rows(tbl),chunksize) to Iterators.partition(tbl,chunksize) to avoid missingness type getting lost (eg, for DataFrames)

Okay, I was wrong. I misunderstood what rowaccess requirements are -- Iterators.partition() still needs to be defined separately.

I've moved back to Tables.rows to ensure we get rows out.

I'm not sure what the best solution is here.
By far the simplest option would be to pass the schema down - because we have access to it before Tables.columns is called in the arrow construction (that's how we lose the schema, because we "materialize" the chunk as is)

EDIT:

The second simplest option would be to add compat entry for DataFrames>1.5.0 and use the Iterators.partition directly (with some safety check that it indeed chunked rows, not columns... if some unknown type defines it over columns)

svilupp · 2023-03-13T09:45:26Z

Added compat for DataFrames via Extras

baumgold · 2023-03-13T23:28:04Z

By far the simplest option would be to pass the schema down - because we have access to it before Tables.columns is called in the arrow construction (that's how we lose the schema, because we "materialize" the chunk as is)

We could allow users to optionally provide the Schema in the Base.open constructor of the Writer object. If a user makes use of this then we should validate the the actual schema of each chunk matches that of the expected schema.

baumgold · 2023-03-13T23:17:59Z

src/write.jl

 """
 function write end

 write(io_or_file; kw...) = x -> write(io_or_file, x; kw...)

-function write(file_path, tbl; kwargs...)
+function write(file_path, tbl; chunksize::Union{Nothing,Integer}=64000, kwargs...)


I think chunksize should move to be a new field in Writer with default kwarg value set in the Base.open constructor on L170. This would eliminate the code duplication.

baumgold · 2023-03-13T23:24:32Z

src/write.jl

+    if !isnothing(chunksize) && Tables.istable(tbl) && Tables.rowaccess(tbl)
+        @assert chunksize >= 0 "chunksize must be >= 0"
+        if hasmethod(Iterators.partition,(typeof(tbl),))
+            tbl_source = Iterators.partition(tbl, chunksize)


Can we use Iterators.partition from Base rather than DataFrames to prevent adding one more dependency?

https://docs.julialang.org/en/v1/base/iterators/#Base.Iterators.partition

add kwarg chunksize for default partitioning

6497967

This was referenced Mar 12, 2023

[Docs] Improve documentation around partitions/multithreading #401

Open

Benchmark of Arrow.jl vs Pyarrow (/Polars) #393

Open

add Iterators.partition for DataFrameRows JuliaData/DataFrames.jl#3299

Merged

require rowaccess and avoid Tables.rows (we can lose types in cases c…

2fdf4ba

…ases)

change back to Iterators.partition(Tables.rows(df),chunksize)

025cc60

add DataFrames compat

586887f

baumgold reviewed Mar 13, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add kwarg chunksize for default data partitioning for write #400

add kwarg chunksize for default data partitioning for write #400

svilupp commented Mar 12, 2023 •

edited

svilupp commented Mar 13, 2023

svilupp commented Mar 13, 2023 •

edited

svilupp commented Mar 13, 2023

baumgold commented Mar 13, 2023

baumgold Mar 13, 2023

baumgold Mar 13, 2023

add kwarg chunksize for default data partitioning for write #400

Are you sure you want to change the base?

add kwarg chunksize for default data partitioning for write #400

Conversation

svilupp commented Mar 12, 2023 • edited

svilupp commented Mar 13, 2023

svilupp commented Mar 13, 2023 • edited

svilupp commented Mar 13, 2023

baumgold commented Mar 13, 2023

baumgold Mar 13, 2023

Choose a reason for hiding this comment

baumgold Mar 13, 2023

Choose a reason for hiding this comment

svilupp commented Mar 12, 2023 •

edited

svilupp commented Mar 13, 2023 •

edited