Skip to content

ParquetDataset.read(): selectively reading array column #17836

@asfimport

Description

@asfimport

Scenario:

  • created a dataframe in spark and saved it as parquet

  • columns include simple types, e.g. String, but also an array of doubles

    Issue:
    I can read the whole data using ParquetDataset in pyarrow.
    I tried reading selectively a simple type => works
    I tried reading selectively the array column => key error in the following place:

    KeyError: 'c'

    /home/hadoop/Python/lib/python2.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.column_name_idx (/arrow/python/build/temp.linux-x86_64-2.7/_parquet.cxx:9777)()
    513 self.column_idx_map[col_bytes] = i
    514
    --> 515 return self.column_idx_map[tobytes(column_name)]

When I just read the whole dataset, I get the correct metadata

pyarrow.Table
a: string
b: string
c: list<element: double not null>
child 0, element: double
d: int64
metadata

{'org.apache.spark.sql.parquet.row.metadata': '{"type":"struct","fields":[{"name":"a","type":"string","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}},{"name":"c","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}},{"name":"d","type":"long","nullable":false,"metadata":{}}]}'}

I might just be missing the correct naming convention of the array column.
But then this name should be reflected in the metadata.

Thanks!

Reporter: Young-Jun Ko

Note: This issue was originally created as ARROW-1842. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions