Skip to content

fix(workflow-operator): auto-rename empty CSV column headers#5282

Merged
mengw15 merged 2 commits into
apache:mainfrom
mengw15:fix/5279-csv-empty-header-rename
May 29, 2026
Merged

fix(workflow-operator): auto-rename empty CSV column headers#5282
mengw15 merged 2 commits into
apache:mainfrom
mengw15:fix/5279-csv-empty-header-rename

Conversation

@mengw15
Copy link
Copy Markdown
Contributor

@mengw15 mengw15 commented May 28, 2026

What changes were proposed in this PR?

A CSV with an empty column header (e.g. a trailing comma id,name,,age) produced an Attribute with an empty name. After #3295 every output port writes via IcebergTableWriter → Parquet, where Avro rejects empty names with IllegalArgumentException: Empty name at flush time — losing the operator port's entire result.

Rename blank header positions to column-<index> (the convention used by pandas, Spark, R, and DuckDB) in all three CSV scan operators: CSVScanSourceOpDesc, ParallelCSVScanSourceOpDesc, and CSVOldScanSourceOpDesc. The issue named the first two; the legacy csvOld variant is still registered in LogicalOp and had the same latent bug.

Note: ParallelCSVScanSourceOpDesc is currently commented out of LogicalOp's operator registry (so it is not reachable from the UI), but it is fixed here for consistency and so the bug does not resurface if it is re-enabled for experiments.

Any related issues, documentation, discussions?

Closes #5279.

How was this PR tested?

Added three cases to CSVScanSourceOpDescSpec — one per operator — feeding a CSV with an empty third header (id,name,,age) and asserting the inferred schema is ["id", "name", "column-3", "age"]. WorkflowOperator suite is green (13 passed); scalafmt clean.

Was this PR authored or co-authored using generative AI tooling?

Generated-by: Claude Code (claude-opus-4-7)

A CSV with an empty column header (e.g. a trailing comma `id,name,,age`)
produced an `Attribute` with an empty name. After apache#3295 every output
port writes via `IcebergTableWriter -> Parquet`, where Avro rejects
empty names with `IllegalArgumentException: Empty name` at flush time,
losing the operator port's entire result.

Rename blank header positions to `column-<index>` (matching the
convention used by pandas, Spark, R, and DuckDB) in all three CSV scan
operators: `CSVScanSourceOpDesc`, `ParallelCSVScanSourceOpDesc`, and
`CSVOldScanSourceOpDesc`. `ParallelCSVScanSourceOpDesc` is currently
commented out of `LogicalOp`'s registry (unreachable from the UI) but is
fixed for consistency and in case it is re-enabled for experiments.

Closes apache#5279.
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 28, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 49.12%. Comparing base (85c4b0c) to head (769f418).

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #5282      +/-   ##
============================================
+ Coverage     49.09%   49.12%   +0.03%     
- Complexity     2372     2377       +5     
============================================
  Files          1051     1051              
  Lines         40344    40334      -10     
  Branches       4277     4277              
============================================
+ Hits          19805    19813       +8     
+ Misses        19379    19358      -21     
- Partials       1160     1163       +3     
Flag Coverage Δ *Carryforward flag
access-control-service 39.53% <ø> (ø)
agent-service 33.76% <ø> (ø) Carriedforward from ca829ec
amber 51.62% <ø> (+0.09%) ⬆️
computing-unit-managing-service 0.00% <ø> (ø)
config-service 0.00% <ø> (ø)
file-service 37.99% <ø> (ø)
frontend 41.04% <ø> (-0.03%) ⬇️ Carriedforward from ca829ec
python 90.79% <ø> (+0.04%) ⬆️ Carriedforward from ca829ec
workflow-compiling-service 56.81% <ø> (ø)

*This pull request uses carry forward flags. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mengw15 mengw15 added this pull request to the merge queue May 29, 2026
Merged via the queue into apache:main with commit 59f776c May 29, 2026
18 checks passed
@mengw15 mengw15 deleted the fix/5279-csv-empty-header-rename branch May 29, 2026 00:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

CSV scan does not auto-rename empty column headers; downstream Iceberg writer crashes with "Empty name" at flush time

3 participants