[C++] Implement column projection pushdown to ORC reader in Datasets API #29423

asfimport · 2021-08-30T16:11:15Z

ARROW-13572 (#10991) added basic support for ORC file format in the Datasets API, but the reader still reads all columns regardless of the ScanOptions. Since ORC is a columnar format that supports reading only specific fields, we can optimize this step.

The tricky part is to convert the field name of the Arrow schema to the index in the ORC schema. Currently, this logic is included in the Python bindings (

arrow/python/pyarrow/orc.py

Lines 36 to 59 in 5ca62b9

    
           def _traverse(typ, counter): 
        
               if isinstance(typ, Schema) or types.is_struct(typ): 
        
                   for field in typ: 
        
                       path = (field.name,) 
        
                       yield path, next(counter) 
        
                       for sub, c in _traverse(field.type, counter): 
        
                           yield path + sub, c 
        
               elif _is_map(typ): 
        
                   yield from _traverse(typ.value_type, counter) 
        
               elif types.is_list(typ): 
        
                   # Skip one index for list type, since this can never be selected 
        
                   # directly 
        
                   next(counter) 
        
                   yield from _traverse(typ.value_type, counter) 
        
               elif types.is_union(typ): 
        
                   # Union types not supported, just skip the indexes 
        
                   for dtype in typ: 
        
                       next(counter) 
        
                       for sub_c in _traverse(dtype, counter): 
        
                           pass 
        
           def _schema_to_indices(schema): 
        
               return {'.'.join(i): c for i, c in _traverse(schema, count(1))}

), but so this needs to be moved to C++.

Reporter: Joris Van den Bossche / @jorisvandenbossche
Assignee: Joris Van den Bossche / @jorisvandenbossche

PRs and other links:

GitHub Pull Request #11372

_{Note: This issue was originally created as ARROW-13797. Please see the migration documentation for further details.}

asfimport · 2021-10-11T15:48:40Z

Antoine Pitrou / @pitrou:
Issue resolved by pull request 11372
#11372

asfimport closed this as completed Oct 11, 2021

asfimport assigned jorisvandenbossche Jan 10, 2023

asfimport added this to the 6.0.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[C++] Support ORC in Arrow Dataset #18729

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Implement column projection pushdown to ORC reader in Datasets API #29423

[C++] Implement column projection pushdown to ORC reader in Datasets API #29423

asfimport commented Aug 30, 2021

asfimport commented Oct 11, 2021

[C++] Implement column projection pushdown to ORC reader in Datasets API #29423

[C++] Implement column projection pushdown to ORC reader in Datasets API #29423

Comments

asfimport commented Aug 30, 2021

PRs and other links:

asfimport commented Oct 11, 2021