Python PyArrow Dataset Writer #542

wjones127 · 2022-01-12T05:18:36Z

Description

We have a PyArrow Dataset reader that works for Delta tables. Looking through the writer, I think we might have enough functionality to create a one.

Here are my rough notes on how that might work:

Use pyarrow.dataset.write_dataset to write the parquet files.
- basename_template could be set to a UUID, guaranteeing file uniqueness.
- existing_data_behavior could be set to overwrite_or_ignore. (Not great behavior if there's ever a UUID collision, though. Might make a ticket to give a better option in PyArrow.)
- file_visitor will be set to a callback that will push the filename and metadata to the back of a list. The metadata contains the file statistics.
Take parameters that determine what kind of transaction. I think initial support should be for an Append and Overwrite, including support for creating a new table. But need to leave room in API for update, delete, and merge.
- Is there any standard for CommitInfo in delta-rs?
Use create_transaction to create the transaction, using the file path and stats retrieved earlier.
Use try_commit_transaction to commit it.

There's probably some protocol details I am overlooking, so would welcome any guidance.

The text was updated successfully, but these errors were encountered:

wjones127 · 2022-02-24T04:48:07Z

API Draft

Standalone function for writing, to allow for idempotent create or append/overwrite:

def write_deltalake(
    table: Union[str, DeltaTable],
    data,
    mode: Literal['append', 'overwrite'] = 'append',
    backend: str = 'pyarrow'
):
    pass

I'm thinking backend parameter would function similar to the engine parameter in parquet fuctions in Pandas (example). For now PyArrow might be the only backend, but I foresee we could also support a Datafusion based one.

Add methods to DeltaTable for operations.

class DeltaTable:
    ...
    def write(self, data, mode: Literal['append', 'overwrite'] = 'append', backend: str = 'pyarrow'):
        write_deltalake(self, data, mode, backend)

    def delete_where(self, where_expr, backend: str = 'pyarrow'):
        '''Delete rows matching the expression'''
        pass

    def update(self, where_expr, set_values: Dict[str, Any], backend: str = 'pyarrow'):
        '''Modify values in rows matching the expression'''
        pass

I'll leave the signature for merge for later; it likely involves a builder.

Draft Usage Docs

For overwrites and appends, use write_deltalake(). If the table does not
already exist, it will be created. The data parameter will accept a Pandas
DataFrame, a PyArrow Table, or an iterator of PyArrow Record Batches.

from deltalake.writer import write_deltalake
df = pd.DataFrame({'x': [1, 2, 3]})
write_deltalake('path/to/table', df)

By default, writes append to the table. To overwrite, pass in mode='overwrite':

write_deltalake('path/to/table', df, mode='overwrite')

If you have a DeltaTable object, you can also call the DeltaTable.write()
method:

DeltaTable('path/to/table').write(df, mode='overwrite')

To delete rows based on an expression, use DeltaTable.delete()

from deltalake.writer import delete_deltalake
import pyarrow.dataset as ds
DeltaTable('path/to/table').delete(ds.field('x') == 2)

To update a subset of rows with new values, use

 from deltalake.writer import delete_deltalake
 import pyarrow.dataset as ds
 # Increment y where x = 2
 DeltaTable('path/to/table').update(
     where_expr=ds.field('x') == 2,
     set_values={
         'y': ds.field('y') + 1
    }
)

GraemeCliffe-inspirato · 2022-02-26T01:47:59Z

I'm not sure what the convention is, but it might be a good idea to have overwrite be the default argument for mode of .write()

wjones127 · 2022-02-26T17:27:14Z

I'm not sure what the convention is, but it might be a good idea to have overwrite be the default argument for mode of .write()

The default in PySpark (which I think most users will be coming from) is to error if any data already exists.

https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrameWriter.saveAsTable.html?highlight=saveastable#pyspark.sql.DataFrameWriter.saveAsTable

That makes sense for the standalone function write_deltalake(), but maybe not as much for the method on DeltaTable, since in that case I think you always have some data there. Maybe there should be DeltaTable.append() and DeltaTable.overwrite() methods, rather than a DeltaTable.write()?

GraemeCliffe-inspirato · 2022-04-07T00:19:30Z

@wjones127 I'm interested in helping support this but haven't contributed to the project before. Are any of these reasonable for a first time contributor?

wjones127 · 2022-04-07T00:38:19Z

@GraemeCliffe-inspirato One good first issue might be the delta.appendOnly part of #575. The other part is more complicated, but we'd happily take a PR for just that first piece.

#576 would be a good one if you want to get more familiar with the Rust part of the project.

WarSame · 2022-04-22T03:45:00Z

@wjones127 I have submitted a small PR for the first part of #575. I'm interested in learning more about the invariants part of #575 .

wjones127 · 2022-04-22T05:40:31Z

@WarSame RE: invariants, see my comment in #575

k-ai0 · 2022-09-25T18:43:08Z

Is the functionality of "table creation" still a WIP? I know that the grid shows transactions are not yet up and running.

Note that "./test_deltalake_table" does not exist on the filesystem for the below code example:

import deltalake
import pandas
import numpy as np
df = pandas.DataFrame(np.random.uniform(0,1, (40,3)))
df.columns = ['X','Y','Z']
deltalake.writer.write_deltalake('./test_deltalake_table',df)

yields

PyDeltaTableError: Failed to read delta log object: Generic DeltaObjectStore error: No such file or directory (os error 2)

referencing the following:

I know the grid shows that the "write transactions" is not yet enabled. I'm posting to check if this includes inital creation of the deltalake folder/file structure too. It seems like it does, but just checking.

wjones127 · 2022-09-25T18:59:51Z

Is the functionality of "table creation" still a WIP?

No that part should work now.

Could you create a new issue for the error you are showing? Make sure to provide the version of deltalake you are using.

wjones127 added the enhancement New feature or request label Jan 12, 2022

houqp mentioned this issue Feb 23, 2022

Pandas support for Delta Lake delta-io/delta#467

Closed

This was referenced Feb 27, 2022

[Python] Initial PyArrow writer #566

Merged

Python: Finish filesystem bindings #570

Closed

roeap mentioned this issue Sep 11, 2022

feat: integrate object_store for read/write with pyarrow #799

Merged

darabos mentioned this issue Sep 26, 2022

Implement merge command #850

Closed

wjones127 closed this as completed Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python PyArrow Dataset Writer #542

Python PyArrow Dataset Writer #542

wjones127 commented Jan 12, 2022 •

edited

Loading

wjones127 commented Feb 24, 2022

GraemeCliffe-inspirato commented Feb 26, 2022

wjones127 commented Feb 26, 2022

GraemeCliffe-inspirato commented Apr 7, 2022

wjones127 commented Apr 7, 2022 •

edited

Loading

WarSame commented Apr 22, 2022

wjones127 commented Apr 22, 2022

k-ai0 commented Sep 25, 2022 •

edited

Loading

wjones127 commented Sep 25, 2022

Python PyArrow Dataset Writer #542

Python PyArrow Dataset Writer #542

Comments

wjones127 commented Jan 12, 2022 • edited Loading

Description

wjones127 commented Feb 24, 2022

API Draft

Draft Usage Docs

GraemeCliffe-inspirato commented Feb 26, 2022

wjones127 commented Feb 26, 2022

GraemeCliffe-inspirato commented Apr 7, 2022

wjones127 commented Apr 7, 2022 • edited Loading

WarSame commented Apr 22, 2022

wjones127 commented Apr 22, 2022

k-ai0 commented Sep 25, 2022 • edited Loading

wjones127 commented Sep 25, 2022

wjones127 commented Jan 12, 2022 •

edited

Loading

wjones127 commented Apr 7, 2022 •

edited

Loading

k-ai0 commented Sep 25, 2022 •

edited

Loading