[Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert list of records #22407

asfimport · 2019-07-22T17:39:11Z

I noticed that pyarrow.Table.to_pydict() exists, but pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create one, but it doesn't take into account potential mismatches between column order and number of columns.

I'm including some code I've written which I've been using to handle arrow conversions to ordered dictionaries and lists of dictionaries.. I've also included an example where this can be used to speed up pandas.to_dict() by a factor of 6x.

def from_pylist(pylist, names=None, schema=None, safe=True):
    """
    Converts a python list of dictionaries to a pyarrow table
    :param pylist: pylist list of dictionaries
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    if schema:
        for column in schema.names:
            arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
        arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
        for column in names:
            arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe))
        arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def to_pylist(arrow_table, index_columns=None):
    """
    Converts a pyarrow table to a python list of dictionaries
    :param arrow_table: arrow table
    :param index_columns: columns to index
    :return: python list of dictionaries
    """
    pydict = arrow_table.to_pydict()
    if index_columns:
        columns = arrow_table.schema.names
        columns.append("_index")
        pylist = [{column: tuple([pydict[index_column][row] for index_column in index_columns]) if column == '_index' else pydict[column][row] for column in columns} for row in range(arrow_table.num_rows)]
    else:
        pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)]
    return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
    """
    Converts a pyarrow table to a python ordered dictionary
    :param pydict: ordered dictionary
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    dict_columns = list(pydict.keys())
    if schema:
        for column in schema.names:
            if column in pydict:
                arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)]))
            else:
                arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)]))
        arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
        if not names:
            names = dict_columns
        for column in names:
            if column in dict_columns:
                arrow_columns.append(pa.array(pydict[column], safe=safe))
            else:
                arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe))
        arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def get_indexed_values(arrow_table, index_columns):
    """
    returns back a set of unique values for a list of columns.
    :param arrow_table: arrow_table
    :param index_columns: list of column names
    :return: set of tuples
    """
    pydict = arrow_table.to_pydict()
    index_set = set([tuple([pydict[index_column][row] for index_column in index_columns]) for row in range(arrow_table.num_rows)])
    return index_set

Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict()

# benchmark panda conversion to python objects
print('**benchmark 1 million rows**')
start_time = time.time()
python_df1 = panda_df1.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python: " + str(total_time))

start_time = time.time()
arrow_df1 = pa.Table.from_pandas(panda_df1)
pydict = arrow_df1.to_pydict()
python_df1 = [{column: pydict[column][row] for column in arrow_df1.schema.names} for row in range(arrow_df1.num_rows)]
total_time = time.time() - start_time
print("pandas to arrow to python: " + str(total_time))

print('**benchmark 4 million rows**')
start_time = time.time()
python_df4 = panda_df4.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python:: " + str(total_time))

start_time = time.time()
arrow_df4 = pa.Table.from_pandas(panda_df4)
pydict = arrow_df4.to_pydict()
python_df4 = [{column: pydict[column][row] for column in arrow_df4.schema.names} for row in range(arrow_df4.num_rows)]
total_time = time.time() - start_time
print("pandas to arrow to python: " + str(total_time))

**benchmark 1 million rows**
pandas to python: 13.204811334609985
pandas to arrow to python: 2.00173282623291
**benchmark 4 million rows**
pandas to python:: 51.655067682266235
pandas to arrow to python: 8.562284231185913

Reporter: David Lee / @davlee1972
Assignee: Alenka Frim / @AlenkaF

Related issues:

[Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist() (duplicates)
[Python] Allow creating Table from Python dict (relates to)

PRs and other links:

GitHub Pull Request #12010

_{Note: This issue was originally created as ARROW-6001. Please see the migration documentation for further details.}

asfimport · 2019-07-22T17:43:18Z

Wes McKinney / @wesm:
Hm, so we have Table.from_pydict

https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L1021

I think Table.from_arrays could be improved to accept other Python sequences. Can we split up this ticket into the different improvements you're proposing (and clarify how what you're saying is different from the existing Table.from_pydict function)?

asfimport · 2019-07-22T17:46:07Z

David Lee / @davlee1972:
Current implementation

asfimport · 2019-07-22T18:56:40Z

Antoine Pitrou / @pitrou:
Also cc @jorisvandenbossche

asfimport · 2019-07-23T15:37:16Z

David Lee / @davlee1972:

Table.from_dict in 0.14.1 looks fine. The code I originally reviewed iterated through the ordered dictionary keys instead of the schema field names.

Here's some testing samples for to_pylist() and from_pylist()

test_schema = pa.schema([
 pa.field('id', pa.int16()),
 pa.field('struct_test', pa.list_(pa.struct([pa.field("child_id", pa.int16()), pa.field("child_name", pa.string())]))),
 pa.field('list_test', pa.list_(pa.int16()))
])
test_data = [
{'id': 1, 'struct_test': [{'child_id': 11, 'child_name': '_11'}, {'child_id': 12, 'child_name': '_12'}], 'list_test': [1,2,3]},
{'id': 2, 'struct_test': [{'child_id': 21, 'child_name': '_21'}], 'list_test': [4,5]} 
]
test_tbl = from_pylist(test_data, schema = test_schema)
test_list = to_pylist(test_tbl)
test_tbl
test_list

asfimport · 2019-07-31T07:47:18Z

Joris Van den Bossche / @jorisvandenbossche:
See also ARROW-4032 for similar discussion (I closed that one to not have duplicate issues).

Since we have now to_pydict / from_pydict, repurposing this issue to focus on the to/from list of dicts usecase.

asfimport · 2019-07-31T07:58:16Z

Joris Van den Bossche / @jorisvandenbossche:
I think the functionality to convert to / from a list of dicts (a "list of records") is something nice to have in pyarrow. The question is then where to fit it in or how to call the new method.

I think Table.from_arrays could be improved to accept other Python sequences

I personally would not add such functionality to from_arrays, which is working column-wise (the arrays you pass make up the columns of the resulting Table). That's a well defined scope, and I would keep functionality to convert row-wise input data in a separate function.

For from_pydict, it is similar: that function also currently works column-wise.

So I think new methods such as from_pylist / to_pylist is the better approach.
I am only not fully sure about the name "pylist", as that name does not directly reflect that it is a list of rows as dicts (it could also be a list of column-wise arrays). In pandas, this is basically called from_records, but the "records" could also be confusing in arrow context given that we have RecordBatches (although a method to convert a list of that is already called from_batches).

asfimport · 2020-09-24T14:02:12Z

Antoine Pitrou / @pitrou:
I think calling this from_pylist/to_pylist is fine. I would expect it to mean "a list of individual rows".

However, a question remains: does to_pylist return a list of tuples or a list of dicts?

asfimport · 2022-01-11T13:35:41Z

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 12010
#12010

asfimport closed this as completed Jan 11, 2022

asfimport assigned AlenkaF Jan 10, 2023

This was referenced Jan 11, 2023

[Python] New pyarrow.Table functions: from_pydict(), from_pylist() and to_pylist() #20633

Closed

[Python] Allow creating Table from Python dict #21655

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert list of records #22407

[Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert list of records #22407

asfimport commented Jul 22, 2019 •

edited

Loading

asfimport commented Jul 22, 2019

asfimport commented Jul 22, 2019

asfimport commented Jul 22, 2019

asfimport commented Jul 23, 2019

asfimport commented Jul 31, 2019

asfimport commented Jul 31, 2019

asfimport commented Sep 24, 2020

asfimport commented Jan 11, 2022

[Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert list of records #22407

[Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert list of records #22407

Comments

asfimport commented Jul 22, 2019 • edited Loading

Related issues:

PRs and other links:

asfimport commented Jul 22, 2019

asfimport commented Jul 22, 2019

asfimport commented Jul 22, 2019

asfimport commented Jul 23, 2019

asfimport commented Jul 31, 2019

asfimport commented Jul 31, 2019

asfimport commented Sep 24, 2020

asfimport commented Jan 11, 2022

asfimport commented Jul 22, 2019 •

edited

Loading