[bugfix] accept ChunkedArray in Parquet/Odps/Csv writers and ensure TDM writer close by tiankongdeguiji · Pull Request #469 · alibaba/TorchEasyRec

tiankongdeguiji · 2026-04-08T11:27:09Z

Summary

TreeSearch.save_node_feature crashed during TDM tree init with TypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array because pa.RecordBatch.from_arrays does not accept ChunkedArray. ParquetWriter, OdpsWriter and CsvWriter now defensively collapse any ChunkedArray column via combine_chunks() before building the batch, so every caller of the writer API is protected (not just TDM).
The same run also emitted WARNING: You should close ParquetWriter explicitly because writer.close() was never reached when writer.write() raised. Wrapped the write/close pairs in TreeSearch.save, TreeSearch.save_predict_edge and TreeSearch.save_node_feature with try/finally so the writer is always closed (and the partial file flushed) on the failure path.
Added test_parquet_writer_chunked_array and test_csv_writer_chunked_array exercising the new ChunkedArray path end-to-end.

Test plan

python -m tzrec.datasets.parquet_dataset_test ParquetWriterTest
python -m tzrec.datasets.csv_dataset_test CsvWriterTest
python -m tzrec.tools.tdm.gen_tree.tree_search_util_test
pre-commit run --files <changed files>
Rerun the failing DLC job tdm_trainer_nointention_init_tree_clone against this branch and confirm init_tree reaches "Save nodes and edges table done."

🤖 Generated with Claude Code

…DM writer close TDM init_tree.save_node_feature crashed with "Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array" when a column handed to ParquetWriter happened to be chunked. pa.RecordBatch.from_arrays only accepts pa.Array, so ParquetWriter/OdpsWriter/CsvWriter now defensively collapse ChunkedArray inputs via combine_chunks() before constructing the batch. Also wrap writer.write/close pairs in TreeSearch.save, save_predict_edge, and save_node_feature with try/finally so close() runs (and the partial file is flushed) even when write() raises, eliminating the "You should close ParquetWriter explicitly" warning seen alongside the crash. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-04-09T01:21:43Z

+        output_arrays = []
+        for v in output_dict.values():
+            if isinstance(v, pa.ChunkedArray):
+                v = v.combine_chunks()
+            output_arrays.append(v)


nit: This ChunkedArray flattening loop is duplicated identically in CsvWriter.write() and OdpsWriter.write(). Consider extracting it into a shared helper on BaseWriter (e.g. _flatten_chunked_arrays(output_dict)) to keep it in one place.

Also, BaseWriter.write() in dataset.py:608 still declares OrderedDict[str, pa.Array] — the base class signature should be updated to Union[pa.Array, pa.ChunkedArray] to match the subclasses.

github-actions · 2026-04-09T01:21:48Z

+            try:
+                node_writer.write(node_table_dict)
+            finally:
+                node_writer.close()


nit: The try/finally only wraps write(), but node_writer is already open at this point. If any of the pa.array(...) calls above were to raise, the writer would not be closed. Consider widening the scope to cover the full writer lifecycle — especially relevant for OdpsWriter which holds a remote session.

try: node_table_dict = OrderedDict() node_table_dict["id"] = pa.array(ids) node_table_dict["weight"] = pa.array(weight) node_table_dict["features"] = pa.array(features) node_writer.write(node_table_dict) finally: node_writer.close()

github-actions · 2026-04-09T01:22:01Z

Code Review Summary

Clean, well-scoped bugfix. The ChunkedArray coercion via combine_chunks() is the correct PyArrow idiom, and the try/finally additions properly guard writer resources. Tests cover the happy path well. No performance or security concerns introduced.

Two suggestions posted inline:

Extract shared ChunkedArray flattening logic — the same 5-line loop is duplicated across ParquetWriter, CsvWriter, and OdpsWriter. A helper on BaseWriter would centralize it. Also, BaseWriter.write() in dataset.py still declares pa.Array only — should be updated to Union[pa.Array, pa.ChunkedArray] to match subclasses.
Widen try/finally scope in tree_search_util.py — currently only write() is wrapped, but the writer is already open when pa.array(...) calls run above it. If those fail, the writer leaks (particularly relevant for OdpsWriter which holds a remote session).

Minor test coverage note:

No OdpsWriter ChunkedArray test (understandable given infra dependency). Extracting the flattening into a shared helper would let you test it once rather than per-writer.

Overall: LGTM with minor suggestions. 👍

🤖 Generated with Claude Code

Extract the duplicated ChunkedArray -> Array loop from ParquetWriter, CsvWriter, and OdpsWriter into a shared BaseWriter._flatten_chunked_arrays helper. Also widen BaseWriter.write's signature to OrderedDict[str, Union[pa.Array, pa.ChunkedArray]] so the base class matches what its subclasses now accept. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

tiankongdeguiji added the claude-review Let Claude Review label Apr 9, 2026

github-actions Bot removed the claude-review Let Claude Review label Apr 9, 2026

github-actions Bot reviewed Apr 9, 2026

View reviewed changes

eric-gecheng approved these changes Apr 9, 2026

View reviewed changes

tiankongdeguiji merged commit 11b935b into alibaba:master Apr 9, 2026
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bugfix] accept ChunkedArray in Parquet/Odps/Csv writers and ensure TDM writer close#469

[bugfix] accept ChunkedArray in Parquet/Odps/Csv writers and ensure TDM writer close#469
tiankongdeguiji merged 2 commits into
alibaba:masterfrom
tiankongdeguiji:fix/tdm-parquet-chunked-array

tiankongdeguiji commented Apr 8, 2026

Uh oh!

github-actions Bot Apr 9, 2026

Uh oh!

github-actions Bot Apr 9, 2026

Uh oh!

github-actions Bot commented Apr 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tiankongdeguiji commented Apr 8, 2026

Summary

Test plan

Uh oh!

github-actions Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Apr 9, 2026

Code Review Summary

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants