Skip to content

[bugfix] accept ChunkedArray in Parquet/Odps/Csv writers and ensure TDM writer close#469

Merged
tiankongdeguiji merged 2 commits into
alibaba:masterfrom
tiankongdeguiji:fix/tdm-parquet-chunked-array
Apr 9, 2026
Merged

[bugfix] accept ChunkedArray in Parquet/Odps/Csv writers and ensure TDM writer close#469
tiankongdeguiji merged 2 commits into
alibaba:masterfrom
tiankongdeguiji:fix/tdm-parquet-chunked-array

Conversation

@tiankongdeguiji
Copy link
Copy Markdown
Collaborator

Summary

  • TreeSearch.save_node_feature crashed during TDM tree init with TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array because pa.RecordBatch.from_arrays does not accept ChunkedArray. ParquetWriter, OdpsWriter and CsvWriter now defensively collapse any ChunkedArray column via combine_chunks() before building the batch, so every caller of the writer API is protected (not just TDM).
  • The same run also emitted WARNING: You should close ParquetWriter explicitly because writer.close() was never reached when writer.write() raised. Wrapped the write/close pairs in TreeSearch.save, TreeSearch.save_predict_edge and TreeSearch.save_node_feature with try/finally so the writer is always closed (and the partial file flushed) on the failure path.
  • Added test_parquet_writer_chunked_array and test_csv_writer_chunked_array exercising the new ChunkedArray path end-to-end.

Test plan

  • python -m tzrec.datasets.parquet_dataset_test ParquetWriterTest
  • python -m tzrec.datasets.csv_dataset_test CsvWriterTest
  • python -m tzrec.tools.tdm.gen_tree.tree_search_util_test
  • pre-commit run --files <changed files>
  • Rerun the failing DLC job tdm_trainer_nointention_init_tree_clone against this branch and confirm init_tree reaches "Save nodes and edges table done."

🤖 Generated with Claude Code

…DM writer close

TDM init_tree.save_node_feature crashed with "Cannot convert
pyarrow.lib.ChunkedArray to pyarrow.lib.Array" when a column handed to
ParquetWriter happened to be chunked. pa.RecordBatch.from_arrays only
accepts pa.Array, so ParquetWriter/OdpsWriter/CsvWriter now defensively
collapse ChunkedArray inputs via combine_chunks() before constructing
the batch. Also wrap writer.write/close pairs in TreeSearch.save,
save_predict_edge, and save_node_feature with try/finally so close()
runs (and the partial file is flushed) even when write() raises,
eliminating the "You should close ParquetWriter explicitly" warning
seen alongside the crash.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tiankongdeguiji tiankongdeguiji added the claude-review Let Claude Review label Apr 9, 2026
@github-actions github-actions Bot removed the claude-review Let Claude Review label Apr 9, 2026
Comment thread tzrec/datasets/parquet_dataset.py Outdated
Comment on lines +318 to +322
output_arrays = []
for v in output_dict.values():
if isinstance(v, pa.ChunkedArray):
v = v.combine_chunks()
output_arrays.append(v)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This ChunkedArray flattening loop is duplicated identically in CsvWriter.write() and OdpsWriter.write(). Consider extracting it into a shared helper on BaseWriter (e.g. _flatten_chunked_arrays(output_dict)) to keep it in one place.

Also, BaseWriter.write() in dataset.py:608 still declares OrderedDict[str, pa.Array] — the base class signature should be updated to Union[pa.Array, pa.ChunkedArray] to match the subclasses.

Comment on lines +171 to +174
try:
node_writer.write(node_table_dict)
finally:
node_writer.close()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The try/finally only wraps write(), but node_writer is already open at this point. If any of the pa.array(...) calls above were to raise, the writer would not be closed. Consider widening the scope to cover the full writer lifecycle — especially relevant for OdpsWriter which holds a remote session.

            try:
                node_table_dict = OrderedDict()
                node_table_dict["id"] = pa.array(ids)
                node_table_dict["weight"] = pa.array(weight)
                node_table_dict["features"] = pa.array(features)
                node_writer.write(node_table_dict)
            finally:
                node_writer.close()

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 9, 2026

Code Review Summary

Clean, well-scoped bugfix. The ChunkedArray coercion via combine_chunks() is the correct PyArrow idiom, and the try/finally additions properly guard writer resources. Tests cover the happy path well. No performance or security concerns introduced.

Two suggestions posted inline:

  1. Extract shared ChunkedArray flattening logic — the same 5-line loop is duplicated across ParquetWriter, CsvWriter, and OdpsWriter. A helper on BaseWriter would centralize it. Also, BaseWriter.write() in dataset.py still declares pa.Array only — should be updated to Union[pa.Array, pa.ChunkedArray] to match subclasses.

  2. Widen try/finally scope in tree_search_util.py — currently only write() is wrapped, but the writer is already open when pa.array(...) calls run above it. If those fail, the writer leaks (particularly relevant for OdpsWriter which holds a remote session).

Minor test coverage note:

  • No OdpsWriter ChunkedArray test (understandable given infra dependency). Extracting the flattening into a shared helper would let you test it once rather than per-writer.

Overall: LGTM with minor suggestions. 👍

🤖 Generated with Claude Code

Extract the duplicated ChunkedArray -> Array loop from ParquetWriter,
CsvWriter, and OdpsWriter into a shared BaseWriter._flatten_chunked_arrays
helper. Also widen BaseWriter.write's signature to
OrderedDict[str, Union[pa.Array, pa.ChunkedArray]] so the base class
matches what its subclasses now accept.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@tiankongdeguiji tiankongdeguiji merged commit 11b935b into alibaba:master Apr 9, 2026
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants