[bugfix] accept ChunkedArray in Parquet/Odps/Csv writers and ensure TDM writer close#469
Conversation
…DM writer close TDM init_tree.save_node_feature crashed with "Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Array" when a column handed to ParquetWriter happened to be chunked. pa.RecordBatch.from_arrays only accepts pa.Array, so ParquetWriter/OdpsWriter/CsvWriter now defensively collapse ChunkedArray inputs via combine_chunks() before constructing the batch. Also wrap writer.write/close pairs in TreeSearch.save, save_predict_edge, and save_node_feature with try/finally so close() runs (and the partial file is flushed) even when write() raises, eliminating the "You should close ParquetWriter explicitly" warning seen alongside the crash. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| output_arrays = [] | ||
| for v in output_dict.values(): | ||
| if isinstance(v, pa.ChunkedArray): | ||
| v = v.combine_chunks() | ||
| output_arrays.append(v) |
There was a problem hiding this comment.
nit: This ChunkedArray flattening loop is duplicated identically in CsvWriter.write() and OdpsWriter.write(). Consider extracting it into a shared helper on BaseWriter (e.g. _flatten_chunked_arrays(output_dict)) to keep it in one place.
Also, BaseWriter.write() in dataset.py:608 still declares OrderedDict[str, pa.Array] — the base class signature should be updated to Union[pa.Array, pa.ChunkedArray] to match the subclasses.
| try: | ||
| node_writer.write(node_table_dict) | ||
| finally: | ||
| node_writer.close() |
There was a problem hiding this comment.
nit: The try/finally only wraps write(), but node_writer is already open at this point. If any of the pa.array(...) calls above were to raise, the writer would not be closed. Consider widening the scope to cover the full writer lifecycle — especially relevant for OdpsWriter which holds a remote session.
try:
node_table_dict = OrderedDict()
node_table_dict["id"] = pa.array(ids)
node_table_dict["weight"] = pa.array(weight)
node_table_dict["features"] = pa.array(features)
node_writer.write(node_table_dict)
finally:
node_writer.close()
Code Review SummaryClean, well-scoped bugfix. The ChunkedArray coercion via Two suggestions posted inline:
Minor test coverage note:
Overall: LGTM with minor suggestions. 👍 🤖 Generated with Claude Code |
Extract the duplicated ChunkedArray -> Array loop from ParquetWriter, CsvWriter, and OdpsWriter into a shared BaseWriter._flatten_chunked_arrays helper. Also widen BaseWriter.write's signature to OrderedDict[str, Union[pa.Array, pa.ChunkedArray]] so the base class matches what its subclasses now accept. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
TreeSearch.save_node_featurecrashed during TDM tree init withTypeError: Cannot convert pyarrow.lib.ChunkedArray to pyarrow.lib.Arraybecausepa.RecordBatch.from_arraysdoes not acceptChunkedArray.ParquetWriter,OdpsWriterandCsvWriternow defensively collapse anyChunkedArraycolumn viacombine_chunks()before building the batch, so every caller of the writer API is protected (not just TDM).WARNING: You should close ParquetWriter explicitlybecausewriter.close()was never reached whenwriter.write()raised. Wrapped thewrite/closepairs inTreeSearch.save,TreeSearch.save_predict_edgeandTreeSearch.save_node_featurewithtry/finallyso the writer is always closed (and the partial file flushed) on the failure path.test_parquet_writer_chunked_arrayandtest_csv_writer_chunked_arrayexercising the new ChunkedArray path end-to-end.Test plan
python -m tzrec.datasets.parquet_dataset_test ParquetWriterTestpython -m tzrec.datasets.csv_dataset_test CsvWriterTestpython -m tzrec.tools.tdm.gen_tree.tree_search_util_testpre-commit run --files <changed files>tdm_trainer_nointention_init_tree_cloneagainst this branch and confirminit_treereaches "Save nodes and edges table done."🤖 Generated with Claude Code