Skip to content

fix: Handle array of strings columns in Athena materialization#6324

Open
alan-gauthier-jt wants to merge 1 commit intofeast-dev:masterfrom
alan-gauthier-jt:fix-empty-string-array
Open

fix: Handle array of strings columns in Athena materialization#6324
alan-gauthier-jt wants to merge 1 commit intofeast-dev:masterfrom
alan-gauthier-jt:fix-empty-string-array

Conversation

@alan-gauthier-jt
Copy link
Copy Markdown
Contributor

@alan-gauthier-jt alan-gauthier-jt commented Apr 24, 2026

What this PR does / why we need it

Fixes two related bugs that cause TypeError and ValueError when materializing
feature views with Array(String) columns using the Athena offline store.

Arrow/Athena deserializes Array(String) columns as numpy.ndarray (object dtype)
instead of plain Python lists. This breaks two code paths in type_map.py:

  1. _convert_scalar_values_to_proto: pd.isnull(ndarray) returns an array of bools,
    and not <array> raises ValueError: The truth value of an empty array is ambiguous.
    → Already guarded by _is_array_like in newer Feast versions; no change needed here.

  2. _convert_list_values_to_proto (generic list path): proto_type(val=ndarray) passes
    the raw numpy array to the protobuf constructor, which only accepts Python lists →
    TypeError: bad argument type for built-in operation. Additionally, Arrow nullable
    columns can yield None elements inside the ndarray, which StringList also rejects.

  3. _validate_collection_item_types: None elements inside an ndarray failed the
    type(item) in valid_types check before reaching the sanitization step.

Changes

feast/type_map.py

  • Add _to_proto_safe_list(value) helper that:

    • Calls .tolist() on any numpy.ndarray to produce a plain Python list
    • Replaces None elements with "" (empty string), which protobuf StringList accepts
    • Is a no-op for plain Python lists and scalar values
  • Use _to_proto_safe_list in the generic list conversion (end of
    _convert_list_values_to_proto) instead of passing value directly to proto_type.

  • Skip None elements in _validate_collection_item_typesNone entries are valid
    in nullable Arrow columns and are sanitized by _to_proto_safe_list downstream; raising
    a TypeError on them before that point was incorrect.

Testing

Added TestArrowArrayStringListMaterialization in
sdk/python/tests/unit/test_type_map.py covering:

Test Scenario
test_to_proto_safe_list_ndarray ndarray → plain list
test_to_proto_safe_list_empty_ndarray empty ndarray → empty list (was raising ValueError)
test_to_proto_safe_list_ndarray_with_none None elements replaced with ""
test_to_proto_safe_list_plain_list plain list passthrough
test_to_proto_safe_list_plain_list_with_none None in plain list also replaced
test_to_proto_safe_list_scalar_passthrough non-list values unchanged
test_string_list_from_ndarray full round-trip via python_values_to_proto_values
test_string_list_from_empty_ndarray empty ndarray no longer raises ValueError
test_string_list_from_ndarray_with_none_elements None in ndarray no longer raises TypeError
test_string_list_null_row_produces_empty_proto None rows produce empty ProtoValue
test_mixed_batch_simulating_athena_chunk full simulation of a failing Athena materialization batch
pytest sdk/python/tests/unit/test_type_map.py::TestArrowArrayStringListMaterialization -v

Which issues this PR fixes

Fixes #6325

Does this PR introduce a user-facing change?

Yes — materialization of Array(String) (and other array-typed) feature columns from
Athena no longer fails with TypeError or ValueError when a batch contains empty
arrays, None rows, or None elements inside arrays.

Previously:
  TypeError: bad argument type for built-in operation
  ValueError: The truth value of an empty array is ambiguous

After this fix:
  Materialization completes successfully. None elements are stored as "".

Open in Devin Review

devin-ai-integration[bot]

This comment was marked as resolved.

@alan-gauthier-jt alan-gauthier-jt changed the title fix: handle numpy.ndarray Array(String) columns in Athena materialization fix: handle numpyndarray Array(String) columns in Athena materialization Apr 24, 2026
Signed-off-by: Alan Gauthier <alan.gauthier@jobteaser.com>
@alan-gauthier-jt alan-gauthier-jt changed the title fix: handle numpyndarray Array(String) columns in Athena materialization fix: Handle array of strings columns in Athena materialization Apr 24, 2026
# Per-type default values substituted for None elements inside list columns.
# Only STRING_LIST uses ""; numeric/bytes types drop None entirely because
# there is no meaningful in-band sentinel (protobuf rejects wrong scalar types).
_LIST_TYPE_NONE_REPLACEMENT: Dict[ValueType, Any] = {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alan-gauthier-jt The approach used in https://github.com/feast-dev/feast/pull/6327/changes seems safer and preserve list length

none_replacement = _LIST_TYPE_NONE_REPLACEMENT.get(feast_value_type, _DROP_NONE)
if none_replacement is _DROP_NONE:
return [x for x in value if x is not None]
return [x if x is not None else none_replacement for x in value]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if feast_value_type in _LIST_TYPE_NONE_REPLACEMENT instead?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: TypeError / ValueError when materializing Array(String) feature views with Athena offline store

2 participants