problem
when writing manifest files, iceberg-rust copies iceberg field names verbatim into avro record field names. the avro spec requires names to match [A-Za-z_][A-Za-z0-9_]*. partition field names that start with digits (e.g. hash-derived names like 815d3b5701b94c78884835c1bea174bb_day) produce invalid avro schemas.
java handles this correctly in TypeToSchema.java:
String origFieldName = structField.name();
boolean isValidFieldName = AvroSchemaUtil.validAvroName(origFieldName);
String fieldName = isValidFieldName ? origFieldName : AvroSchemaUtil.sanitize(origFieldName);
Schema.Field field = new Schema.Field(fieldName, ...);
if (!isValidFieldName) {
field.addProp(AvroSchemaUtil.ICEBERG_FIELD_NAME_PROP, origFieldName);
}
sanitization rules (from AvroSchemaUtil.sanitize()):
- leading digit → prefix with
_ (e.g. 9col → _9col)
- special chars →
_x<hex> (e.g. a.b → a_x2Eb)
the original name is preserved in the iceberg-field-name avro field property.
relevant code
crates/iceberg/src/avro/schema.rs — schema_to_avro_schema uses field.name.clone() directly as the avro field name without validation or sanitization.
impact
- manifests written by iceberg-rust with digit-leading partition field names are invalid avro
- other engines (spark, trino, flink) using strict avro readers will reject these manifests
- any table using hash-based partition transforms can produce such names
expected behavior
implement the same sanitize-on-write protocol as java:
- validate name against
[A-Za-z_][A-Za-z0-9_]*
- if invalid, sanitize and store original in
iceberg-field-name property
- always store
field-id property (already done)
problem
when writing manifest files, iceberg-rust copies iceberg field names verbatim into avro record field names. the avro spec requires names to match
[A-Za-z_][A-Za-z0-9_]*. partition field names that start with digits (e.g. hash-derived names like815d3b5701b94c78884835c1bea174bb_day) produce invalid avro schemas.java handles this correctly in
TypeToSchema.java:sanitization rules (from
AvroSchemaUtil.sanitize()):_(e.g.9col→_9col)_x<hex>(e.g.a.b→a_x2Eb)the original name is preserved in the
iceberg-field-nameavro field property.relevant code
crates/iceberg/src/avro/schema.rs—schema_to_avro_schemausesfield.name.clone()directly as the avro field name without validation or sanitization.impact
expected behavior
implement the same sanitize-on-write protocol as java:
[A-Za-z_][A-Za-z0-9_]*iceberg-field-namepropertyfield-idproperty (already done)