Skip to content

fix: sanitize Avro field names on write, respect iceberg-field-name on read#2540

Open
SreeramGarlapati wants to merge 2 commits into
apache:mainfrom
SreeramGarlapati:fix/avro-field-name-sanitization
Open

fix: sanitize Avro field names on write, respect iceberg-field-name on read#2540
SreeramGarlapati wants to merge 2 commits into
apache:mainfrom
SreeramGarlapati:fix/avro-field-name-sanitization

Conversation

@SreeramGarlapati
Copy link
Copy Markdown
Contributor

Summary

Iceberg field names can be anything (123column, field.with.dots, etc.) but Avro requires names to match [A-Za-z_][A-Za-z0-9_]*. Java handles this with a sanitize-on-write + restore-on-read protocol using the iceberg-field-name custom Avro property. iceberg-rust was doing neither — writing invalid names directly and ignoring the property on read.

This causes two interop failures:

  1. Write path: iceberg-rust produces invalid Avro when field names have leading digits or special chars → strict parsers (Java, Python) reject the file
  2. Read path: Avro files written by Java with sanitized names get the wrong field names when read by iceberg-rust

Changes

Write path (Iceberg→Avro conversion in SchemaToAvroSchema::field):

  • Added is_valid_avro_name() — checks [A-Za-z_][A-Za-z0-9_]*
  • Added sanitize_avro_name() — matches Java's AvroSchemaUtil.sanitize():
    • Leading digit: prefix _ (e.g., 123col_123col)
    • Special chars: _x<HEX> (e.g., field.namefield_x2Ename)
    • Operates on UTF-16 code units to match Java's charAt() behavior for supplementary chars
  • When sanitization is needed: stores original name in iceberg-field-name property

Read path (Avro→Iceberg conversion in AvroSchemaToSchema::record):

  • Checks iceberg-field-name custom attribute first, falls back to Avro field name

Java reference

Closes #2535

Test plan

  • test_is_valid_avro_name — validates detection of invalid names
  • test_sanitize_avro_name — ASCII edge cases match Java behavior
  • test_sanitize_avro_name_unicode — BMP and supplementary char handling (surrogate pairs)
  • test_sanitization_round_trip — Iceberg→Avro→Iceberg preserves original names
  • test_avro_to_iceberg_uses_iceberg_field_name_property — reads Java-written schemas correctly
  • All 1301 existing tests pass (no regressions in bidirectional schema conversion tests)

SreeramGarlapati and others added 2 commits May 29, 2026 21:31
…ld-name

Iceberg field names can be arbitrary (leading digits, dots, spaces, etc.)
but Avro requires names to match [A-Za-z_][A-Za-z0-9_]*. Java handles
this by sanitizing invalid names on write and storing the original in an
"iceberg-field-name" custom property, then checking that property on read.

iceberg-rust was writing unsanitized names directly, which causes Avro
validation failures (or produces files unreadable by strict Avro parsers)
when field names don't conform to Avro's naming rules.

This adds:
- sanitize_avro_name(): matches Java's AvroSchemaUtil.sanitize() logic
  (prefix _ for leading digits, _x<HEX> for special chars)
- Write path: sanitizes the field name and stores the original in
  iceberg-field-name when sanitization was needed
- Read path: checks iceberg-field-name property first, falls back to
  the Avro field name

Closes apache#2535

Co-authored-by: rawataaryan9 <rawataaryan9@users.noreply.github.com>
…tion

Adds test cases for:
- Non-ASCII BMP characters (U+00E9, U+4E2D)
- Supplementary characters (surrogate pair handling, matching Java's UTF-16)
- Empty string edge case
- Read-path: iceberg-field-name property resolution from Java-written schemas
- Verify iceberg-field-name property is set on dotted field names

Co-authored-by: rawataaryan9 <rawataaryan9@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

avro schema writer does not sanitize field names that violate avro naming rules

1 participant