ARROW-9658: [Python] Python bindings for dataset writing #7921

jorisvandenbossche · 2020-08-10T12:15:35Z

No description provided.

github-actions · 2020-08-10T12:24:37Z

https://issues.apache.org/jira/browse/ARROW-9658

bkietz

This is looking great, thanks for doing this!

python/pyarrow/_dataset.pyx

python/pyarrow/includes/libarrow_dataset.pxd

bkietz · 2020-08-13T13:38:44Z

python/pyarrow/_dataset.pyx

+        # data is list of batches
+        for batch in data:
+            c_batches.push_back((<RecordBatch> batch).sp_batch)
+
+        c_fragment = shared_ptr[CFragment](
+            new CInMemoryFragment(c_batches, _true.unwrap()))
+        c_fragment_vector.push_back(c_fragment)
+
+        with nogil:
+            check_status(
+                CFileSystemDataset.Write(
+                    c_schema,
+                    c_format,
+                    c_filesystem,
+                    c_base_dir,
+                    c_partitioning,
+                    c_context,
+                    MakeVectorIterator(c_fragment_vector)


Currently we only parallelize over written fragments, so since this only creates a single written fragment use_threads is effectively ignored. I'd recommend creating a fragment wrapping each batch. This will result in one or more files per input record batch however.

Suggested change

# data is list of batches

for batch in data:

c_batches.push_back((<RecordBatch> batch).sp_batch)

c_fragment = shared_ptr[CFragment](

new CInMemoryFragment(c_batches, _true.unwrap()))

c_fragment_vector.push_back(c_fragment)

with nogil:

check_status(

CFileSystemDataset.Write(

c_schema,

c_format,

c_filesystem,

c_base_dir,

c_partitioning,

c_context,

MakeVectorIterator(c_fragment_vector)

# data is list of batches

for batch in data:

c_batches.push_back((<RecordBatch> batch).sp_batch)

c_fragment = shared_ptr[CFragment](

new CInMemoryFragment(c_batches, _true.unwrap()))

c_batches.clear()

c_fragment_vector.push_back(c_fragment)

with nogil:

check_status(

CFileSystemDataset.Write(

c_schema,

c_format,

c_filesystem,

c_base_dir,

c_partitioning,

c_context,

MakeVectorIterator(move(c_fragment_vector))

Yes, this is one of the aspects I wanted to ask: indeed right now by creating a single Fragment, it gets written to a single file. If you have a big table, you might want to make a file per batch. But on the other hand, that might also result in a lot of small files (certainly if you also split on some partition columns)? Eg in Parquet, you typically have files with multiple row groups (which might be somewhat comparable to multiple record batches)

So we should maybe have some configurability?

@bkietz So I updated this to enable the use case of writing multiple fragments, but by explicit choice by the user -> passing a single Table/RecordBatch = writing a single fragment (even if the Table has multiple batches), passing a list of Table/RecordBatches -> written as multiple fragments.

So if the user has a Table with multiple batches and wants to write this as multiple files instead of a single file (in case there is no partitioning), they can do write_dataset(table.to_batches)

Great, this provides a minimal handle on write parallelism; this will be necessary until the c++ implementation can be made more robustly parallel than "one thread per written fragment".

python/pyarrow/_dataset.pyx

python/pyarrow/dataset.py

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

jorisvandenbossche · 2020-09-01T11:40:58Z

Going to merge this, so I can add Parquet support in a follow-up PR.

ldacey · 2020-09-16T18:23:08Z

Do think it is possible to add in support to repartition datasets? I am facing some issues with many small files just due to the frequency that I need to download data, which is compounded by the partitions.

I asked this on Jira as well but:

I download data every 30 minutes from a source using UUID parquet filenames (each file just contains new or updated records since the last retrieval so I could not think of a good callback function name). This is 48 parquet files per day.
The data is then partitioned based on the created_date which creates even more files (some can be quite small)
When I query the dataset, I need to read in a lot of small files.

I would then want to read the data and repartition the files using a callback function so the dozens of files in partition ("date", "==", "2020-09-15") would become 2020-09-15.parquet, consolidated as a single file to keep things tidy. I know I can do this with Spark, but it would be nice to have a native pyarrow method.

Closes apache#7921 from jorisvandenbossche/ARROW-9658-dataset-writing Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>

jorisvandenbossche force-pushed the ARROW-9658-dataset-writing branch 2 times, most recently from eadb9fd to 5fe91e7 Compare August 10, 2020 14:23

jorisvandenbossche changed the title ~~ARROW-9658: [Python] Initial Python bindings for dataset writing~~ ARROW-9658: [Python] Python bindings for dataset writing Aug 10, 2020

jorisvandenbossche requested review from bkietz and kszucs August 13, 2020 07:57

bkietz requested changes Aug 13, 2020

View reviewed changes

jorisvandenbossche commented Aug 13, 2020

View reviewed changes

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved

jorisvandenbossche force-pushed the ARROW-9658-dataset-writing branch from 5fe91e7 to 9109dd3 Compare August 18, 2020 11:28

bkietz reviewed Aug 18, 2020

View reviewed changes

python/pyarrow/dataset.py Outdated Show resolved Hide resolved

bkietz reviewed Aug 18, 2020

View reviewed changes

python/pyarrow/dataset.py Outdated Show resolved Hide resolved

bkietz reviewed Aug 18, 2020

View reviewed changes

python/pyarrow/dataset.py Outdated Show resolved Hide resolved

jorisvandenbossche and others added 12 commits September 1, 2020 12:03

ARROW-9658: [Python] Initial Python bindings for dataset writing

78bda49

add writing of Table using InMemoryFragment

31498f8

fix pandas mark in tests

a97c816

expose use_threads

ccd4ec3

Update python/pyarrow/includes/libarrow_dataset.pxd

bccaaee

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

Support passing a list to write multiple fragments

3043ca6

update docstring

c96e37c

Update python/pyarrow/dataset.py

e39cccf

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

clarify format default

a7f002b

Update python/pyarrow/dataset.py

7c14f1f

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

Update python/pyarrow/dataset.py

756be58

Co-authored-by: Benjamin Kietzman <bengilgit@gmail.com>

reformat for line length

18f5df4

jorisvandenbossche force-pushed the ARROW-9658-dataset-writing branch from 2db7a25 to 18f5df4 Compare September 1, 2020 10:03

jorisvandenbossche closed this in 8813eac Sep 1, 2020

jorisvandenbossche deleted the ARROW-9658-dataset-writing branch September 1, 2020 11:43

asfimport mentioned this pull request Sep 16, 2020

[Python][Dataset] Bindings for dataset writing #25717

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-9658: [Python] Python bindings for dataset writing #7921

ARROW-9658: [Python] Python bindings for dataset writing #7921

jorisvandenbossche commented Aug 10, 2020

github-actions bot commented Aug 10, 2020

bkietz left a comment

bkietz Aug 13, 2020

jorisvandenbossche Aug 13, 2020

jorisvandenbossche Aug 18, 2020

bkietz Aug 18, 2020

jorisvandenbossche commented Sep 1, 2020

ldacey commented Sep 16, 2020

ARROW-9658: [Python] Python bindings for dataset writing #7921

ARROW-9658: [Python] Python bindings for dataset writing #7921

Conversation

jorisvandenbossche commented Aug 10, 2020

github-actions bot commented Aug 10, 2020

bkietz left a comment

Choose a reason for hiding this comment

bkietz Aug 13, 2020

Choose a reason for hiding this comment

jorisvandenbossche Aug 13, 2020

Choose a reason for hiding this comment

jorisvandenbossche Aug 18, 2020

Choose a reason for hiding this comment

bkietz Aug 18, 2020

Choose a reason for hiding this comment

jorisvandenbossche commented Sep 1, 2020

ldacey commented Sep 16, 2020