[Python] write_dataset how to add and update data #14834

phpsxg · 2022-12-05T03:08:35Z

Describe the usage question you have. Please include as many useful details as possible.

First, save the parquet file, there are 5 pieces of data

dataset_name = 'test_update'
df = pd.DataFrame({'one': [-1, 3, 2.5, 2.5, 2.5],
                   'two': ['foo', 'bar', 'baz','foo','foo'],
                   'three': [True, False, True,False,False]},
                  )
table = pa.Table.from_pandas(df)
ds.write_dataset(table, dataset_name,
    existing_data_behavior='overwrite_or_ignore',
    format="parquet")

Then I want to add two new ones, and I want to get a total of 7 results, and the new data is as follows：

df = pd.DataFrame({'one': [1, 2],
                   'two': ['foo-insert1','foo-insert2'],
                   'three': [True, False]},
                  )

table = pa.Table.from_pandas(df)
ds.write_dataset(table, dataset_name,
    # existing_data_behavior='delete_matching',
    existing_data_behavior='overwrite_or_ignore',
    format="parquet")

But this overwrites the original, there are only two data, how to achieve new data on the basis of the original
I have another question, if I want to update the data according to the conditions, how to change how to do it, for example

Update one=-1, two=foo's three to False

python=3.10
pyarrow=10.0.0

Component(s)

Parquet, Python

The text was updated successfully, but these errors were encountered:

phpsxg · 2022-12-05T07:44:03Z

We want to process the database data into the corresponding parquet file, and then read the data directly to read the parquet file, which requires a corresponding update operation

phpsxg · 2022-12-05T08:24:38Z

Does pyarrow support append-insert and update?

assignUser · 2022-12-05T12:30:49Z

No you can not update or insert into an existing parquet file as they are immutable. This is a restriction inherent to parquet, not pyarrow. (the spec theoretically supports appending but no lib supports it, details)

So to update an existing parquet file you have to read the existing data into memory, add the new data and write that to disk as a new file (with the same name). You can use partitioning to add/append new data to a multi-parquet-file data set by adding new files or overwriting only small partitions. See pyarrow docs for exisiting_data_behavior:

This behavior, in combination with a unique basename_template for each write, will allow for an append workflow.

‘delete_matching’ is useful when you are writing a partitioned dataset. The first time each partition directory is encountered the entire directory will be deleted. This allows you to overwrite old partitions completely.

I have opened apache/arrow-cookbook#278 to add an example of this to the python cookbook

I am not quite sure I understand your second question.

phpsxg added the Type: usage Issue is a user question label Dec 5, 2022

AlenkaF added the Component: Python label Dec 5, 2022

AlenkaF changed the title ~~write_dataset how to add and update data~~ [Python] write_dataset how to add and update data Dec 5, 2022

assignUser mentioned this issue Dec 5, 2022

ds.write_dataset how to implement new data? #14837

Closed

rok added Component: Parquet Component: Python and removed Component: Python labels Jan 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] write_dataset how to add and update data #14834

[Python] write_dataset how to add and update data #14834

phpsxg commented Dec 5, 2022

phpsxg commented Dec 5, 2022

phpsxg commented Dec 5, 2022

assignUser commented Dec 5, 2022

[Python] write_dataset how to add and update data #14834

[Python] write_dataset how to add and update data #14834

Comments

phpsxg commented Dec 5, 2022

Describe the usage question you have. Please include as many useful details as possible.

Component(s)

phpsxg commented Dec 5, 2022

phpsxg commented Dec 5, 2022

assignUser commented Dec 5, 2022