Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Can not refer to field in a list of structs #32794

Open
asfimport opened this issue Aug 26, 2022 · 2 comments
Open

[Python] Can not refer to field in a list of structs #32794

asfimport opened this issue Aug 26, 2022 · 2 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented Aug 26, 2022

When the dataset has nested sturcts, "list",  we can not use pyarrow.field(..) to get the reference of the sub-field of the struct.

 

For example

 

import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd

schema = pa.schema(
    [
        pa.field(
            "objects",
            pa.list_(
                pa.struct(
                    [
                        pa.field("name", pa.utf8()),
                        pa.field("attr1", pa.float32()),
                        pa.field("attr2", pa.int32()),
                    ]
                )
            ),
        )
    ]
)

table = pa.Table.from_pandas(
    pd.DataFrame([{"objects": [{"name": "a", "attr1": 5.0, "attr2": 20}]}])
)
print(table)

dataset = ds.dataset(table)
print(dataset)
dataset.scanner(columns=["objects.attr2"]).to_table()

which throws exception:


Traceback (most recent call last):
  File "foo.py", line 31, in <module>
    dataset.scanner(columns=["objects.attr2"]).to_table()
  File "pyarrow/_dataset.pyx", line 298, in pyarrow._dataset.Dataset.scanner
  File "pyarrow/_dataset.pyx", line 2356, in pyarrow._dataset.Scanner.from_dataset
  File "pyarrow/_dataset.pyx", line 2202, in pyarrow._dataset._populate_builder
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(objects.attr2) in objects: list<item: struct<attr1: double, attr2: int64, name: string>>
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string

Reporter: Lei (Eddy) Xu

Related issues:

Note: This issue was originally created as ARROW-17540. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Miles Granger / @milesgranger:
I think this is a duplicate of ARROW-14596 ? (That will also solve providing dotted path when use_legacy_dataset=False)

@asfimport
Copy link
Collaborator Author

Miles Granger / @milesgranger:
Should also mention, that if you are only after a single list element, you can do the following, albeit ugly, bit of code here. Until it's properly fixed.

dataset.to_table(columns={
        'attr2': pc.struct_field(
            pc.list_element(ds.field("objects"), ds.scalar(0)), 
            [1])
    }
)

@asfimport asfimport added this to the 11.0.0 milestone Jan 11, 2023
@raulcd raulcd removed this from the 11.0.0 milestone Jan 11, 2023
@jorisvandenbossche jorisvandenbossche added this to the 12.0.0 milestone Jan 12, 2023
@jorisvandenbossche jorisvandenbossche modified the milestones: 12.0.0, 13.0.0 Apr 6, 2023
@AlenkaF AlenkaF removed this from the 13.0.0 milestone Jun 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants