Skip to content

Fix convert_to_parquet crash on empty struct fields#1

Merged
iarata merged 1 commit intocarp-dk:mainfrom
Zeroupper:fix/parquet-empty-struct-crash
Mar 26, 2026
Merged

Fix convert_to_parquet crash on empty struct fields#1
iarata merged 1 commit intocarp-dk:mainfrom
Zeroupper:fix/parquet-empty-struct-crash

Conversation

@Zeroupper
Copy link
Copy Markdown
Contributor

Problem

convert_to_parquet() crashes when the JSON data contains empty dict fields (e.g. "metadata": {}).

PyArrow infers {} as struct<> (a struct with zero child fields), which Parquet cannot represent. This affects data types like completedapptask and image that contain empty metadata dicts.

Error log and stacktrace

Converting: 655902it [00:12, 50554.63it/s]

Traceback (most recent call last):
  File "src/carp/reader.py", line 1034, in convert_to_parquet
    self._flush_buffer_to_parquet(
        safe_name, buffers[safe_name], writers, output_dir
    )
  File "src/carp/reader.py", line 1071, in _flush_buffer_to_parquet
    writers[name] = pq.ParquetWriter(file_path, table.schema)
  File "pyarrow/parquet/core.py", line 1070, in __init__
    self.writer = _parquet.ParquetWriter(
        sink, schema, ...)
  File "pyarrow/_parquet.pyx", line 2363, in pyarrow._parquet.ParquetWriter.__cinit__
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Cannot write struct type 'metadata' with no child field to Parquet. Consider adding a dummy child field.

Root cause

Some items in the JSON (e.g. completedapptask, image) contain fields like:

"taskData": { "metadata": {} }

pa.Table.from_pylist() infers this as metadata: struct<>, which pq.ParquetWriter rejects because Parquet has no representation for an empty struct.

Additionally, the previous schema handling silently swallowed cast failures (except: pass), which could lead to silent data loss when schemas drifted across batches (e.g. location type with varying fields).

Fix

  • Replace pa.Table.from_pylist(buffer) with pd.json_normalize(buffer) + pa.Table.from_pandas(df). This flattens nested dicts into dot-separated columns (e.g. measurement.data.steps), avoiding empty struct inference entirely.
  • Replace the silent cast-and-pass schema mismatch handling with pa.unify_schemas(), which merges schemas across batches so new columns get nulls in earlier rows instead of being silently dropped.

Flatten nested dicts via pd.json_normalize before writing to Parquet.
This avoids PyArrow inferring empty struct types (e.g. "metadata": {})
which Parquet cannot represent. Also fixes schema drift across batches
by using pa.unify_schemas instead of silent cast-and-pass.
@bardram bardram requested a review from iarata March 25, 2026 08:29
@iarata iarata merged commit e00b048 into carp-dk:main Mar 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants