ParquetDataset.read(): selectively reading array column

Scenario:
- created a dataframe in spark and saved it as parquet
- columns include simple types, e.g. String, but also an array of doubles
  
  Issue:
  I can read the whole data using ParquetDataset in pyarrow.
  I tried reading selectively a simple type => works
  I tried reading selectively the array column => key error in the following place:
  
  KeyError: 'c'
  
  /home/hadoop/Python/lib/python2.7/site-packages/pyarrow/_parquet.pyx in pyarrow._parquet.ParquetReader.column_name_idx (/arrow/python/build/temp.linux-x86_64-2.7/_parquet.cxx:9777)()
      513                 self.column_idx_map[col_bytes] = i
      514 
--> 515         return self.column_idx_map[tobytes(column_name)]

When I just read the whole dataset, I get the correct metadata


pyarrow.Table
a: string
b: string
c: list<element: double not null>
  child 0, element: double
d: int64
metadata
--------
{'org.apache.spark.sql.parquet.row.metadata': '{"type":"struct","fields":[{"name":"a","type":"string","nullable":true,"metadata":{}},{"name":"b","type":"string","nullable":true,"metadata":{}},{"name":"c","type":{"type":"array","elementType":"double","containsNull":false},"nullable":true,"metadata":{}},{"name":"d","type":"long","nullable":false,"metadata":{}}]}'}


I might just be missing the correct naming convention of the array column.
But then this name should be reflected in the metadata.

Thanks!

**Reporter**: [Young-Jun Ko](https://issues.apache.org/jira/browse/ARROW-1842)

<sub>**Note**: *This issue was originally created as [ARROW-1842](https://issues.apache.org/jira/browse/ARROW-1842). Please see the [migration documentation](https://github.com/apache/arrow/issues/14542) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ParquetDataset.read(): selectively reading array column #17836

pyarrow.Table
a: string
b: string
c: list<element: double not null>
child 0, element: double
d: int64
metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ParquetDataset.read(): selectively reading array column #17836

Description

pyarrow.Table a: string b: string c: list<element: double not null> child 0, element: double d: int64 metadata

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

pyarrow.Table
a: string
b: string
c: list<element: double not null>
child 0, element: double
d: int64
metadata