[Python] Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow #30648

asfimport · 2021-12-16T22:33:03Z

When trying to save a Pandas dataframe with a nested type (list within list, list within dict) using pyarrow engine, the following error is encountered

ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column A with type object')

Repro:

import pandas as pd
x = pd.DataFrame({"A": [[24, 27, [1, 1]]]})
x.to_parquet('/tmp/a.pqt', engine="pyarrow")

Doing a bit of googling, it appears that this is a known Arrow shortcoming. However, this is a commonly encountered datastructure, and 'fastparquet' handles this seamlessly. Is there a proposed timeline/plan for fixing this?

Reporter: Karthik

_{Note: This issue was originally created as ARROW-15142. Please see the migration documentation for further details.}

khoatrandata · 2023-01-25T03:05:44Z

Hi, is there any update on this issue, please?

westonpace · 2023-01-25T14:41:26Z

Sorry, I'm not entirely sure what the type you are looking for is. Currently you are providing:

values
24
27
[1, 1]

Columns must be homogeneous within Arrow / Parquet. If you want list-within-list you should provide

values
[24]
[27]
[1, 1]

You can achieve this with:

x = pd.DataFrame({"A": [[[24], [27], [1, 1]]]})

khoatrandata · 2023-01-30T02:14:39Z

hi @westonpace , please see this for a more realistic example

westonpace · 2023-01-30T18:49:40Z

@khoatrandata if I understand that issue correctly, the user is trying to load a column (with type=jsonb) into Arrow. There is no equivalent Arrow data type (and as far as I can tell no one has ever asked for it before). I think a variable-length binary column should be sufficient for many purposes.

It looks like the current approach is to first load the column into python objects (this will give you a heterogeneous list of python objects). This list is then passed to pa.array. however, there is no guarantee you will be able to turn that into an Arrow array and there is no knowing what the result will be (if all the values are numbers you'll get an int64 array. If all the values are strings you'll get a string array, if the values are mixed you'll get the reported exception).

If the goal is to go to parquet and back then the safest thing to do would be to load the column as binary and save it in parquet as binary (with your own custom metadata to indicate it is a JSONB field).

You could also create a JSONB extension type based on the variable length binary data type.

kou changed the title ~~Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow~~ [Python] Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow Jan 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow #30648

[Python] Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow #30648

asfimport commented Dec 16, 2021

khoatrandata commented Jan 25, 2023

westonpace commented Jan 25, 2023

khoatrandata commented Jan 30, 2023

westonpace commented Jan 30, 2023

[Python] Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow #30648

[Python] Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow #30648

Comments

asfimport commented Dec 16, 2021

khoatrandata commented Jan 25, 2023

westonpace commented Jan 25, 2023

khoatrandata commented Jan 30, 2023

westonpace commented Jan 30, 2023