Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Add from_pylist() and to_pylist() to pyarrow.Table to convert list of records #22407

Closed
asfimport opened this issue Jul 22, 2019 · 8 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Jul 22, 2019

I noticed that pyarrow.Table.to_pydict() exists, but pyarrow.Table.from_pydict() doesn't exist. There is a proposed ticket to create one, but it doesn't take into account potential mismatches between column order and number of columns.

I'm including some code I've written which I've been using to handle arrow conversions to ordered dictionaries and lists of dictionaries.. I've also included an example where this can be used to speed up pandas.to_dict() by a factor of 6x.

 

def from_pylist(pylist, names=None, schema=None, safe=True):
    """
    Converts a python list of dictionaries to a pyarrow table
    :param pylist: pylist list of dictionaries
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    if schema:
        for column in schema.names:
            arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe, type=schema.types[schema.get_field_index(column)]))
        arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
        for column in names:
            arrow_columns.append(pa.array([v[column] if column in v else None for v in pylist], safe=safe))
        arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def to_pylist(arrow_table, index_columns=None):
    """
    Converts a pyarrow table to a python list of dictionaries
    :param arrow_table: arrow table
    :param index_columns: columns to index
    :return: python list of dictionaries
    """
    pydict = arrow_table.to_pydict()
    if index_columns:
        columns = arrow_table.schema.names
        columns.append("_index")
        pylist = [{column: tuple([pydict[index_column][row] for index_column in index_columns]) if column == '_index' else pydict[column][row] for column in columns} for row in range(arrow_table.num_rows)]
    else:
        pylist = [{column: pydict[column][row] for column in arrow_table.schema.names} for row in range(arrow_table.num_rows)]
    return pylist

def from_pydict(pydict, names=None, schema=None, safe=True):
    """
    Converts a pyarrow table to a python ordered dictionary
    :param pydict: ordered dictionary
    :param names: list of column names
    :param schema: pyarrow schema
    :param safe: True or False
    :return: arrow table
    """
    arrow_columns = list()
    dict_columns = list(pydict.keys())
    if schema:
        for column in schema.names:
            if column in pydict:
                arrow_columns.append(pa.array(pydict[column], safe=safe, type=schema.types[schema.get_field_index(column)]))
            else:
                arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe, type=schema.types[schema.get_field_index(column)]))
        arrow_table = pa.Table.from_arrays(arrow_columns, schema.names)
    else:
        if not names:
            names = dict_columns
        for column in names:
            if column in dict_columns:
                arrow_columns.append(pa.array(pydict[column], safe=safe))
            else:
                arrow_columns.append(pa.array([None] * len(pydict[dict_columns[0]]), safe=safe))
        arrow_table = pa.Table.from_arrays(arrow_columns, names)
    return arrow_table

def get_indexed_values(arrow_table, index_columns):
    """
    returns back a set of unique values for a list of columns.
    :param arrow_table: arrow_table
    :param index_columns: list of column names
    :return: set of tuples
    """
    pydict = arrow_table.to_pydict()
    index_set = set([tuple([pydict[index_column][row] for index_column in index_columns]) for row in range(arrow_table.num_rows)])
    return index_set

Here are my benchmarks using pandas to arrow to python vs of pandas.to_dict()

 

# benchmark panda conversion to python objects
print('**benchmark 1 million rows**')
start_time = time.time()
python_df1 = panda_df1.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python: " + str(total_time))

start_time = time.time()
arrow_df1 = pa.Table.from_pandas(panda_df1)
pydict = arrow_df1.to_pydict()
python_df1 = [{column: pydict[column][row] for column in arrow_df1.schema.names} for row in range(arrow_df1.num_rows)]
total_time = time.time() - start_time
print("pandas to arrow to python: " + str(total_time))

print('**benchmark 4 million rows**')
start_time = time.time()
python_df4 = panda_df4.to_dict(orient='records')
total_time = time.time() - start_time
print("pandas to python:: " + str(total_time))

start_time = time.time()
arrow_df4 = pa.Table.from_pandas(panda_df4)
pydict = arrow_df4.to_pydict()
python_df4 = [{column: pydict[column][row] for column in arrow_df4.schema.names} for row in range(arrow_df4.num_rows)]
total_time = time.time() - start_time
print("pandas to arrow to python: " + str(total_time))

  

**benchmark 1 million rows**
pandas to python: 13.204811334609985
pandas to arrow to python: 2.00173282623291
**benchmark 4 million rows**
pandas to python:: 51.655067682266235
pandas to arrow to python: 8.562284231185913

Reporter: David Lee / @davlee1972
Assignee: Alenka Frim / @AlenkaF

Related issues:

PRs and other links:

Note: This issue was originally created as ARROW-6001. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Hm, so we have Table.from_pydict

https://github.com/apache/arrow/blob/master/python/pyarrow/table.pxi#L1021

I think Table.from_arrays could be improved to accept other Python sequences. Can we split up this ticket into the different improvements you're proposing (and clarify how what you're saying is different from the existing Table.from_pydict function)?

@asfimport
Copy link
Collaborator Author

David Lee / @davlee1972:
Current implementation

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
Also cc @jorisvandenbossche

@asfimport
Copy link
Collaborator Author

David Lee / @davlee1972:
 

Table.from_dict in 0.14.1 looks fine. The code I originally reviewed iterated through the ordered dictionary keys instead of the schema field names.

Here's some testing samples for to_pylist() and from_pylist()

 

test_schema = pa.schema([
 pa.field('id', pa.int16()),
 pa.field('struct_test', pa.list_(pa.struct([pa.field("child_id", pa.int16()), pa.field("child_name", pa.string())]))),
 pa.field('list_test', pa.list_(pa.int16()))
])
test_data = [
{'id': 1, 'struct_test': [{'child_id': 11, 'child_name': '_11'}, {'child_id': 12, 'child_name': '_12'}], 'list_test': [1,2,3]},
{'id': 2, 'struct_test': [{'child_id': 21, 'child_name': '_21'}], 'list_test': [4,5]} 
]
test_tbl = from_pylist(test_data, schema = test_schema)
test_list = to_pylist(test_tbl)
test_tbl
test_list

 

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
See also ARROW-4032 for similar discussion (I closed that one to not have duplicate issues).

Since we have now to_pydict / from_pydict, repurposing this issue to focus on the to/from list of dicts usecase.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
I think the functionality to convert to / from a list of dicts (a "list of records") is something nice to have in pyarrow. The question is then where to fit it in or how to call the new method.

I think Table.from_arrays could be improved to accept other Python sequences

I personally would not add such functionality to from_arrays, which is working column-wise (the arrays you pass make up the columns of the resulting Table). That's a well defined scope, and I would keep functionality to convert row-wise input data in a separate function.

For from_pydict, it is similar: that function also currently works column-wise.

So I think new methods such as from_pylist / to_pylist is the better approach.
I am only not fully sure about the name "pylist", as that name does not directly reflect that it is a list of rows as dicts (it could also be a list of column-wise arrays). In pandas, this is basically called from_records, but the "records" could also be confusing in arrow context given that we have RecordBatches (although a method to convert a list of that is already called from_batches).

@asfimport
Copy link
Collaborator Author

Antoine Pitrou / @pitrou:
I think calling this from_pylist/to_pylist is fine. I would expect it to mean "a list of individual rows".

However, a question remains: does to_pylist return a list of tuples or a list of dicts?

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
Issue resolved by pull request 12010
#12010

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants