fix: sanitize Avro field names on write, respect iceberg-field-name on read by SreeramGarlapati · Pull Request #2540 · apache/iceberg-rust

SreeramGarlapati · 2026-05-30T03:40:03Z

Summary

Iceberg field names can be anything (123column, field.with.dots, etc.) but Avro requires names to match [A-Za-z_][A-Za-z0-9_]*. Java handles this with a sanitize-on-write + restore-on-read protocol using the iceberg-field-name custom Avro property. iceberg-rust was doing neither — writing invalid names directly and ignoring the property on read.

This causes two interop failures:

Write path: iceberg-rust produces invalid Avro when field names have leading digits or special chars → strict parsers (Java, Python) reject the file
Read path: Avro files written by Java with sanitized names get the wrong field names when read by iceberg-rust

Changes

Write path (Iceberg→Avro conversion in SchemaToAvroSchema::field):

Added is_valid_avro_name() — checks [A-Za-z_][A-Za-z0-9_]*
Added sanitize_avro_name() — matches Java's AvroSchemaUtil.sanitize():
- Leading digit: prefix _ (e.g., 123col → _123col)
- Special chars: _x<HEX> (e.g., field.name → field_x2Ename)
- Operates on UTF-16 code units to match Java's charAt() behavior for supplementary chars
When sanitization is needed: stores original name in iceberg-field-name property

Read path (Avro→Iceberg conversion in AvroSchemaToSchema::record):

Checks iceberg-field-name custom attribute first, falls back to Avro field name

Java reference

AvroSchemaUtil.sanitize()
TypeToSchema.struct() — stores ICEBERG_FIELD_NAME_PROP
AvroSchemaUtil.ICEBERG_FIELD_NAME_PROP

Closes #2535

Test plan

test_is_valid_avro_name — validates detection of invalid names
test_sanitize_avro_name — ASCII edge cases match Java behavior
test_sanitize_avro_name_unicode — BMP and supplementary char handling (surrogate pairs)
test_sanitization_round_trip — Iceberg→Avro→Iceberg preserves original names
test_avro_to_iceberg_uses_iceberg_field_name_property — reads Java-written schemas correctly
All 1301 existing tests pass (no regressions in bidirectional schema conversion tests)

…ld-name Iceberg field names can be arbitrary (leading digits, dots, spaces, etc.) but Avro requires names to match [A-Za-z_][A-Za-z0-9_]*. Java handles this by sanitizing invalid names on write and storing the original in an "iceberg-field-name" custom property, then checking that property on read. iceberg-rust was writing unsanitized names directly, which causes Avro validation failures (or produces files unreadable by strict Avro parsers) when field names don't conform to Avro's naming rules. This adds: - sanitize_avro_name(): matches Java's AvroSchemaUtil.sanitize() logic (prefix _ for leading digits, _x<HEX> for special chars) - Write path: sanitizes the field name and stores the original in iceberg-field-name when sanitization was needed - Read path: checks iceberg-field-name property first, falls back to the Avro field name Closes apache#2535 Co-authored-by: rawataaryan9 <rawataaryan9@users.noreply.github.com>

…tion Adds test cases for: - Non-ASCII BMP characters (U+00E9, U+4E2D) - Supplementary characters (surrogate pair handling, matching Java's UTF-16) - Empty string edge case - Read-path: iceberg-field-name property resolution from Java-written schemas - Verify iceberg-field-name property is set on dotted field names Co-authored-by: rawataaryan9 <rawataaryan9@users.noreply.github.com>

SreeramGarlapati and others added 2 commits May 29, 2026 21:31

SreeramGarlapati force-pushed the fix/avro-field-name-sanitization branch from fe8c3ff to 12ba3b5 Compare May 30, 2026 04:33

SreeramGarlapati mentioned this pull request May 30, 2026

fix: respect iceberg-field-name Avro property on read path #2539

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: sanitize Avro field names on write, respect iceberg-field-name on read#2540

fix: sanitize Avro field names on write, respect iceberg-field-name on read#2540
SreeramGarlapati wants to merge 2 commits into
apache:mainfrom
SreeramGarlapati:fix/avro-field-name-sanitization

SreeramGarlapati commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

SreeramGarlapati commented May 30, 2026

Summary

Changes

Java reference

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant