Skip to content

Conversation

@yiftizur
Copy link

@yiftizur yiftizur commented Oct 15, 2025

Fixes #953

Rationale for this change

Fixes filtering on nested struct fields when using PyArrow for scan operations.

Are these changes tested?

Yes, the full test suite + new tests

Are there any user-facing changes?

Now, filtering a scan using a nested field will work

Problem

When filtering on nested struct fields (e.g., parentField.childField == 'value'), PyArrow would fail with:

ArrowInvalid: No match for FieldRef.Name(childField) in ...

The issue occurred because PyArrow requires nested field references as tuples (e.g., ("parent", "child")) rather than dotted strings (e.g., "parent.child").

Solution

  1. Modified _ConvertToArrowExpression to accept an optional Schema parameter
  2. Added _get_field_name() method that converts dotted field paths to tuples for nested struct fields
  3. Updated expression_to_pyarrow() to accept and pass the schema parameter
  4. Updated all call sites to pass the schema when available

Changes

  • pyiceberg/io/pyarrow.py:
    • Modified _ConvertToArrowExpression class to handle nested field paths
    • Updated expression_to_pyarrow() signature to accept schema
    • Updated _expression_to_complementary_pyarrow() signature
  • pyiceberg/table/__init__.py:
    • Updated call to _expression_to_complementary_pyarrow() to pass schema
  • Tests:

Example

# Now works correctly:
table.scan(row_filter="parent.child == 'abc123'").to_polars()

The fix converts the field reference from:

  • FieldRef.Name(run_id) (fails - field not found)
  • FieldRef.Nested(FieldRef.Name(mazeMetadata) FieldRef.Name(run_id)) (works!)

Yftach Zur and others added 3 commits October 15, 2025 10:02
Fixes filtering on nested struct fields when using PyArrow for scan operations.

## Problem

When filtering on nested struct fields (e.g., `mazeMetadata.run_id == 'value'`),
PyArrow would fail with:
```
ArrowInvalid: No match for FieldRef.Name(run_id) in ...
```

The issue occurred because PyArrow requires nested field references as tuples
(e.g., `("parent", "child")`) rather than dotted strings (e.g., `"parent.child"`).

## Solution

1. Modified `_ConvertToArrowExpression` to accept an optional `Schema` parameter
2. Added `_get_field_name()` method that converts dotted field paths to tuples
   for nested struct fields
3. Updated `expression_to_pyarrow()` to accept and pass the schema parameter
4. Updated all call sites to pass the schema when available

## Changes

- `pyiceberg/io/pyarrow.py`:
  - Modified `_ConvertToArrowExpression` class to handle nested field paths
  - Updated `expression_to_pyarrow()` signature to accept schema
  - Updated `_expression_to_complementary_pyarrow()` signature
- `pyiceberg/table/__init__.py`:
  - Updated call to `_expression_to_complementary_pyarrow()` to pass schema
- Tests:
  - Added `test_ref_binding_nested_struct_field()` for comprehensive nested field testing
  - Enhanced `test_nested_fields()` with issue apache#953 scenarios

## Example

```python
# Now works correctly:
table.scan(row_filter="mazeMetadata.run_id == 'abc123'").to_polars()
```

The fix converts the field reference from:
- ❌ `FieldRef.Name(run_id)` (fails - field not found)
- ✅ `FieldRef.Nested(FieldRef.Name(mazeMetadata) FieldRef.Name(run_id))` (works!)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@yiftizur
Copy link
Author

@Fokko what do you think of this change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Query on nested struct field with PyIceberg?

1 participant