Fix convert_to_parquet crash on empty struct fields by Zeroupper · Pull Request #1 · carp-dk/carp-analytics-python

Zeroupper · 2026-02-13T15:44:32Z

Problem

convert_to_parquet() crashes when the JSON data contains empty dict fields (e.g. "metadata": {}).

PyArrow infers {} as struct<> (a struct with zero child fields), which Parquet cannot represent. This affects data types like completedapptask and image that contain empty metadata dicts.

Error log and stacktrace

Converting: 655902it [00:12, 50554.63it/s]

Traceback (most recent call last):
  File "src/carp/reader.py", line 1034, in convert_to_parquet
    self._flush_buffer_to_parquet(
        safe_name, buffers[safe_name], writers, output_dir
    )
  File "src/carp/reader.py", line 1071, in _flush_buffer_to_parquet
    writers[name] = pq.ParquetWriter(file_path, table.schema)
  File "pyarrow/parquet/core.py", line 1070, in __init__
    self.writer = _parquet.ParquetWriter(
        sink, schema, ...)
  File "pyarrow/_parquet.pyx", line 2363, in pyarrow._parquet.ParquetWriter.__cinit__
  File "pyarrow/error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Cannot write struct type 'metadata' with no child field to Parquet. Consider adding a dummy child field.

Root cause

Some items in the JSON (e.g. completedapptask, image) contain fields like:

"taskData": { "metadata": {} }

pa.Table.from_pylist() infers this as metadata: struct<>, which pq.ParquetWriter rejects because Parquet has no representation for an empty struct.

Additionally, the previous schema handling silently swallowed cast failures (except: pass), which could lead to silent data loss when schemas drifted across batches (e.g. location type with varying fields).

Fix

Replace pa.Table.from_pylist(buffer) with pd.json_normalize(buffer) + pa.Table.from_pandas(df). This flattens nested dicts into dot-separated columns (e.g. measurement.data.steps), avoiding empty struct inference entirely.
Replace the silent cast-and-pass schema mismatch handling with pa.unify_schemas(), which merges schemas across batches so new columns get nulls in earlier rows instead of being silently dropped.

Flatten nested dicts via pd.json_normalize before writing to Parquet. This avoids PyArrow inferring empty struct types (e.g. "metadata": {}) which Parquet cannot represent. Also fixes schema drift across batches by using pa.unify_schemas instead of silent cast-and-pass.

bardram requested a review from iarata March 25, 2026 08:29

iarata merged commit e00b048 into carp-dk:main Mar 26, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix convert_to_parquet crash on empty struct fields#1

Fix convert_to_parquet crash on empty struct fields#1
iarata merged 1 commit intocarp-dk:mainfrom
Zeroupper:fix/parquet-empty-struct-crash

Zeroupper commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Zeroupper commented Feb 13, 2026

Problem

Error log and stacktrace

Root cause

Fix

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants