Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-9658: [Python] Python bindings for dataset writing #7921

Conversation

jorisvandenbossche
Copy link
Member

No description provided.

@github-actions
Copy link

@jorisvandenbossche jorisvandenbossche force-pushed the ARROW-9658-dataset-writing branch 2 times, most recently from eadb9fd to 5fe91e7 Compare August 10, 2020 14:23
@jorisvandenbossche jorisvandenbossche changed the title ARROW-9658: [Python] Initial Python bindings for dataset writing ARROW-9658: [Python] Python bindings for dataset writing Aug 10, 2020
Copy link
Member

@bkietz bkietz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking great, thanks for doing this!

python/pyarrow/_dataset.pyx Outdated Show resolved Hide resolved
python/pyarrow/includes/libarrow_dataset.pxd Outdated Show resolved Hide resolved
Comment on lines 2068 to 2085
# data is list of batches
for batch in data:
c_batches.push_back((<RecordBatch> batch).sp_batch)

c_fragment = shared_ptr[CFragment](
new CInMemoryFragment(c_batches, _true.unwrap()))
c_fragment_vector.push_back(c_fragment)

with nogil:
check_status(
CFileSystemDataset.Write(
c_schema,
c_format,
c_filesystem,
c_base_dir,
c_partitioning,
c_context,
MakeVectorIterator(c_fragment_vector)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we only parallelize over written fragments, so since this only creates a single written fragment use_threads is effectively ignored. I'd recommend creating a fragment wrapping each batch. This will result in one or more files per input record batch however.

Suggested change
# data is list of batches
for batch in data:
c_batches.push_back((<RecordBatch> batch).sp_batch)
c_fragment = shared_ptr[CFragment](
new CInMemoryFragment(c_batches, _true.unwrap()))
c_fragment_vector.push_back(c_fragment)
with nogil:
check_status(
CFileSystemDataset.Write(
c_schema,
c_format,
c_filesystem,
c_base_dir,
c_partitioning,
c_context,
MakeVectorIterator(c_fragment_vector)
# data is list of batches
for batch in data:
c_batches.push_back((<RecordBatch> batch).sp_batch)
c_fragment = shared_ptr[CFragment](
new CInMemoryFragment(c_batches, _true.unwrap()))
c_batches.clear()
c_fragment_vector.push_back(c_fragment)
with nogil:
check_status(
CFileSystemDataset.Write(
c_schema,
c_format,
c_filesystem,
c_base_dir,
c_partitioning,
c_context,
MakeVectorIterator(move(c_fragment_vector))

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is one of the aspects I wanted to ask: indeed right now by creating a single Fragment, it gets written to a single file. If you have a big table, you might want to make a file per batch. But on the other hand, that might also result in a lot of small files (certainly if you also split on some partition columns)? Eg in Parquet, you typically have files with multiple row groups (which might be somewhat comparable to multiple record batches)

So we should maybe have some configurability?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bkietz So I updated this to enable the use case of writing multiple fragments, but by explicit choice by the user -> passing a single Table/RecordBatch = writing a single fragment (even if the Table has multiple batches), passing a list of Table/RecordBatches -> written as multiple fragments.

So if the user has a Table with multiple batches and wants to write this as multiple files instead of a single file (in case there is no partitioning), they can do write_dataset(table.to_batches)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, this provides a minimal handle on write parallelism; this will be necessary until the c++ implementation can be made more robustly parallel than "one thread per written fragment".

python/pyarrow/dataset.py Outdated Show resolved Hide resolved
python/pyarrow/dataset.py Outdated Show resolved Hide resolved
python/pyarrow/dataset.py Outdated Show resolved Hide resolved
@jorisvandenbossche
Copy link
Member Author

Going to merge this, so I can add Parquet support in a follow-up PR.

@jorisvandenbossche jorisvandenbossche deleted the ARROW-9658-dataset-writing branch September 1, 2020 11:43
@ldacey
Copy link

ldacey commented Sep 16, 2020

Do think it is possible to add in support to repartition datasets? I am facing some issues with many small files just due to the frequency that I need to download data, which is compounded by the partitions.

I asked this on Jira as well but:

  1. I download data every 30 minutes from a source using UUID parquet filenames (each file just contains new or updated records since the last retrieval so I could not think of a good callback function name). This is 48 parquet files per day.
  2. The data is then partitioned based on the created_date which creates even more files (some can be quite small)
  3. When I query the dataset, I need to read in a lot of small files.

I would then want to read the data and repartition the files using a callback function so the dozens of files in partition ("date", "==", "2020-09-15") would become 2020-09-15.parquet, consolidated as a single file to keep things tidy. I know I can do this with Spark, but it would be nice to have a native pyarrow method.

emkornfield pushed a commit to emkornfield/arrow that referenced this pull request Oct 16, 2020
Closes apache#7921 from jorisvandenbossche/ARROW-9658-dataset-writing

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Closes apache#7921 from jorisvandenbossche/ARROW-9658-dataset-writing

Authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants