Problem
Iceberg is the dominant open table format adjacent to every existing Spark/Databricks target in daco, but there is no translator for it. Iceberg uses its own JSON schema serialization with explicit, mandatory field IDs (monotonic, deterministic) and v3-only types (variant, geometry, geography, timestamp_ns, timestamptz_ns, unknown) — none of which are emitted by the existing databrickssql / sparksql / databrickspyspark translators.
A user authoring an OpenDPI port today cannot generate an Iceberg schema; they have to translate to Spark SQL DDL and lose Iceberg-specific information (field IDs, v3 types).
Proposed change
New package internal/translate/iceberg/ following the avro pattern (resolver + JSON marshal in Translate, no text/template — Iceberg schemas are structured JSON):
translator.go — implements translate.Translator. FileExtension returns .json. Translate calls translate.Prepare(...) then marshals to the Iceberg schema JSON shape, assigning field IDs sequentially in property order (Prepare already preserves that order, so output is deterministic across runs).
resolver.go — implements translate.TypeResolver:
PrimitiveType: string→string, integer→long (narrowed in EnrichField via Constraints.Minimum/Maximum to int where it fits), number→double (or decimal(P,S) when Constraints.MultipleOf is a decimal fraction), boolean→boolean, format:date→date, format:date-time→timestamptz, format:time→time, format:uuid→uuid.
ArrayType(elem) → list<elem> (marker form, materialized in Translate).
MapType(k,v) → map<k,v> (marker form).
RefType/FormatDefName → PascalCase(defName), must agree (per .claude/rules/translators.md).
EnrichField: integer narrowing (lift inferIntegerType from databrickspyspark/resolver.go into shared internal/translate so it isn't duplicated); decimal precision/scale from MultipleOf (lift computeDecimalScale/computeDecimalPrecision similarly).
- Field IDs assigned in
Translate via a counter threaded through marshal — IDs go in data.Extra if needed but are simplest computed inline at marshal time. Prepare/SchemaData shape unchanged.
- Register in
cmd/daco/internal/app.go registerTranslators as iceberg.
V3-only types (variant, geometry, geography, timestamp_ns) are out of scope for the initial PR — JSON Schema doesn't natively express them, so they need a daco-side hint mechanism that should be designed separately. The translator should emit v2-compatible output by default.
References
Test cases
Following the shape in internal/translate/pyspark/translator_test.go:
-
Simple object — sequential field IDs and root naming
Input: {type:object, properties:{name:{type:string}, age:{type:integer}}}
Expected (substring asserts):
{
"type": "struct",
"schema-id": 0,
"fields": [
{"id": 1, "name": "name", "required": false, "type": "string"},
{"id": 2, "name": "age", "required": false, "type": "long"}
]
}
-
Required vs optional — required: ["name"] → "required": true for name, false for age.
-
Decimal from multipleOf — {type:number, multipleOf:0.01, minimum:0, maximum:99999.99} → "type": "decimal(7, 2)".
-
Integer narrowing — {type:integer, minimum:-128, maximum:127} → "type": "int" (Iceberg has no smaller int; narrows from long).
-
Date/time/uuid formats — format:date → "type": "date"; format:date-time → "type": "timestamptz"; format:uuid → "type": "uuid".
-
Arrays — {type:array, items:{type:string}} → "type": {"type": "list", "element-id": N, "element": "string", "element-required": ...}.
-
$ref + $defs — verifies RefType/FormatDefName agreement: a referenced def is emitted as a nested struct with its own field IDs continuing the global counter.
-
Inline nested object — auto-extracted by Prepare to a synthetic def named after the field in PascalCase; emitted as a nested struct with continued IDs.
Problem
Iceberg is the dominant open table format adjacent to every existing Spark/Databricks target in daco, but there is no translator for it. Iceberg uses its own JSON schema serialization with explicit, mandatory field IDs (monotonic, deterministic) and v3-only types (
variant,geometry,geography,timestamp_ns,timestamptz_ns,unknown) — none of which are emitted by the existingdatabrickssql/sparksql/databrickspysparktranslators.A user authoring an OpenDPI port today cannot generate an Iceberg schema; they have to translate to Spark SQL DDL and lose Iceberg-specific information (field IDs, v3 types).
Proposed change
New package
internal/translate/iceberg/following theavropattern (resolver + JSON marshal inTranslate, notext/template— Iceberg schemas are structured JSON):translator.go— implementstranslate.Translator.FileExtensionreturns.json.Translatecallstranslate.Prepare(...)then marshals to the Iceberg schema JSON shape, assigning field IDs sequentially in property order (Preparealready preserves that order, so output is deterministic across runs).resolver.go— implementstranslate.TypeResolver:PrimitiveType:string→string,integer→long(narrowed inEnrichFieldviaConstraints.Minimum/Maximumtointwhere it fits),number→double(ordecimal(P,S)whenConstraints.MultipleOfis a decimal fraction),boolean→boolean,format:date→date,format:date-time→timestamptz,format:time→time,format:uuid→uuid.ArrayType(elem)→list<elem>(marker form, materialized inTranslate).MapType(k,v)→map<k,v>(marker form).RefType/FormatDefName→PascalCase(defName), must agree (per.claude/rules/translators.md).EnrichField: integer narrowing (liftinferIntegerTypefromdatabrickspyspark/resolver.gointo sharedinternal/translateso it isn't duplicated); decimal precision/scale fromMultipleOf(liftcomputeDecimalScale/computeDecimalPrecisionsimilarly).Translatevia a counter threaded through marshal — IDs go indata.Extraif needed but are simplest computed inline at marshal time.Prepare/SchemaDatashape unchanged.cmd/daco/internal/app.goregisterTranslatorsasiceberg.V3-only types (
variant,geometry,geography,timestamp_ns) are out of scope for the initial PR — JSON Schema doesn't natively express them, so they need a daco-side hint mechanism that should be designed separately. The translator should emit v2-compatible output by default.References
.claude/rules/translators.mdin this repo — package layout,EnrichFieldmutation order,RefType/FormatDefNamesymmetry ruleTest cases
Following the shape in
internal/translate/pyspark/translator_test.go:Simple object — sequential field IDs and root naming
Input:
{type:object, properties:{name:{type:string}, age:{type:integer}}}Expected (substring asserts):
{ "type": "struct", "schema-id": 0, "fields": [ {"id": 1, "name": "name", "required": false, "type": "string"}, {"id": 2, "name": "age", "required": false, "type": "long"} ] }Required vs optional —
required: ["name"]→"required": trueforname,falseforage.Decimal from
multipleOf—{type:number, multipleOf:0.01, minimum:0, maximum:99999.99}→"type": "decimal(7, 2)".Integer narrowing —
{type:integer, minimum:-128, maximum:127}→"type": "int"(Iceberg has no smaller int; narrows fromlong).Date/time/uuid formats —
format:date→"type": "date";format:date-time→"type": "timestamptz";format:uuid→"type": "uuid".Arrays —
{type:array, items:{type:string}}→"type": {"type": "list", "element-id": N, "element": "string", "element-required": ...}.$ref+$defs— verifiesRefType/FormatDefNameagreement: a referenced def is emitted as a nestedstructwith its own field IDs continuing the global counter.Inline nested object — auto-extracted by
Prepareto a synthetic def named after the field in PascalCase; emitted as a nested struct with continued IDs.