Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++] Implement column projection pushdown to ORC reader in Datasets API #29423

Closed
Tracked by #18729
asfimport opened this issue Aug 30, 2021 · 1 comment
Closed
Tracked by #18729
Assignees
Milestone

Comments

@asfimport
Copy link

ARROW-13572 (#10991) added basic support for ORC file format in the Datasets API, but the reader still reads all columns regardless of the ScanOptions. Since ORC is a columnar format that supports reading only specific fields, we can optimize this step.

The tricky part is to convert the field name of the Arrow schema to the index in the ORC schema. Currently, this logic is included in the Python bindings (

def _traverse(typ, counter):
if isinstance(typ, Schema) or types.is_struct(typ):
for field in typ:
path = (field.name,)
yield path, next(counter)
for sub, c in _traverse(field.type, counter):
yield path + sub, c
elif _is_map(typ):
yield from _traverse(typ.value_type, counter)
elif types.is_list(typ):
# Skip one index for list type, since this can never be selected
# directly
next(counter)
yield from _traverse(typ.value_type, counter)
elif types.is_union(typ):
# Union types not supported, just skip the indexes
for dtype in typ:
next(counter)
for sub_c in _traverse(dtype, counter):
pass
def _schema_to_indices(schema):
return {'.'.join(i): c for i, c in _traverse(schema, count(1))}
), but so this needs to be moved to C++.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

Note: This issue was originally created as ARROW-13797. Please see the migration documentation for further details.

@asfimport
Copy link
Author

Antoine Pitrou / @pitrou:
Issue resolved by pull request 11372
#11372

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants