Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow #30648

Open
asfimport opened this issue Dec 16, 2021 · 4 comments

Comments

@asfimport
Copy link

When trying to save a Pandas dataframe with a nested type (list within list, list within dict) using pyarrow engine, the following error is encountered

ArrowInvalid: ('cannot mix list and non-list, non-null values', 'Conversion failed for column A with type object')

 

Repro:

import pandas as pd
x = pd.DataFrame({"A": [[24, 27, [1, 1]]]})
x.to_parquet('/tmp/a.pqt', engine="pyarrow"

Doing a bit of googling, it appears that this is a known Arrow shortcoming. However, this is a commonly encountered datastructure, and 'fastparquet' handles this seamlessly. Is there a proposed timeline/plan for fixing this?

Reporter: Karthik

Note: This issue was originally created as ARROW-15142. Please see the migration documentation for further details.

@khoatrandata
Copy link

Hi, is there any update on this issue, please?

@westonpace
Copy link
Member

Sorry, I'm not entirely sure what the type you are looking for is. Currently you are providing:

values
24
27
[1, 1]

Columns must be homogeneous within Arrow / Parquet. If you want list-within-list you should provide

values
[24]
[27]
[1, 1]

You can achieve this with:

x = pd.DataFrame({"A": [[[24], [27], [1, 1]]]})

@kou kou changed the title Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow [Python] Cannot mix struct and non-struct, non-null values error when saving nested types with PyArrow Jan 26, 2023
@khoatrandata
Copy link

hi @westonpace , please see this for a more realistic example

@westonpace
Copy link
Member

@khoatrandata if I understand that issue correctly, the user is trying to load a column (with type=jsonb) into Arrow. There is no equivalent Arrow data type (and as far as I can tell no one has ever asked for it before). I think a variable-length binary column should be sufficient for many purposes.

It looks like the current approach is to first load the column into python objects (this will give you a heterogeneous list of python objects). This list is then passed to pa.array. however, there is no guarantee you will be able to turn that into an Arrow array and there is no knowing what the result will be (if all the values are numbers you'll get an int64 array. If all the values are strings you'll get a string array, if the values are mixed you'll get the reported exception).

If the goal is to go to parquet and back then the safest thing to do would be to load the column as binary and save it in parquet as binary (with your own custom metadata to indicate it is a JSONB field).

You could also create a JSONB extension type based on the variable length binary data type.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants