[Python] PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file #31678

asfimport · 2022-04-22T17:16:11Z

I'm trying to follow the example here: https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files to write an example partitioned dataset. But I'm consistently getting an error about non-equal schemas. Here's a mcve:

from pathlib import Path
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
size = 100_000_000
partition_col = np.random.randint(0, 10, size)
values = np.random.rand(size)
table = pa.Table.from_pandas(
    pd.DataFrame({"partition_col": partition_col, "values": values})
)
metadata_collector = []
root_path = Path("random.parquet")
pq.write_to_dataset(
    table,
    root_path,
    partition_cols=["partition_col"],
    metadata_collector=metadata_collector,
)

Write the ``_common_metadata`` parquet file without row groups statistics
pq.write_metadata(table.schema, root_path / "_common_metadata")


Write the ``_metadata`` parquet file with row groups statistics of all files
pq.write_metadata(
    table.schema, root_path / "_metadata", metadata_collector=metadata_collector
)

This raises the error

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Input In [92], in <cell line: 1>()
----> 1 pq.write_metadata(
      2     table.schema, root_path / "_metadata", metadata_collector=metadata_collector
      3 )
File ~/tmp/env/lib/python3.8/site-packages/pyarrow/parquet.py:2324, in write_metadata(schema, where, metadata_collector, **kwargs)
   2322 metadata = read_metadata(where)
   2323 for m in metadata_collector:
-> 2324     metadata.append_row_groups(m)
   2325 metadata.write_metadata_file(where)
File ~/tmp/env/lib/python3.8/site-packages/pyarrow/_parquet.pyx:628, in pyarrow._parquet.FileMetaData.append_row_groups()
RuntimeError: AppendRowGroups requires equal schemas.

But all schemas in the metadata_collector list seem to be the same:

all(metadata_collector[0].schema == meta.schema for meta in metadata_collector)
# True

Environment: MacOS. Python 3.8.10.
pyarrow: '7.0.0'
pandas: '1.4.2'
numpy: '1.22.3'
Reporter: Kyle Barron

_{Note: This issue was originally created as ARROW-16287. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

david-waterworth · 2023-05-10T06:16:12Z

This seems to be related to the partition_cols - if you comment this line from write_to_dataset the error is suppressed. I cannot find an example of writing metadata for a partitioned dataset?

legout · 2023-08-22T17:15:24Z

I have the same problem, for datasets in which the schema of the parquet files are identical expect the ordering of the columns.

That means for me, currently, I have to rewrite all parquet files with one unique schema (same column ordering). I wonder, if it is necessary, that the ordering of the columns is identical.

mapleFU · 2023-08-22T17:33:31Z

@legout Can you show the error you meet and the code you're using when using dataset writer? Seems that when writing same file the schema should be same, but I don't fully understand how you meet this when using dataset api.

legout · 2023-08-22T18:43:16Z

Sorry for my confusing comment. Here are some more details.

The parquet files of the dataset are exports from an oracle database written with another software(knime). Unfortunately, this leads to the parquet files having different column ordering, although the data types of the columns are identical.

This means, I am able to read the dataset (parquet files) using pyarrow.dataset or pyarrow.read_table.
However, when trying to create metadata and common metadata files according to https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files, I get the this error

RuntimeError: AppendRowGroups requires equal schemas.

I do understand, that data types have to be identical, but I wonder why the column ordering is important here.

I am currently on my mobile. I'll provide some sample code later.

legout · 2023-08-22T20:42:18Z

Create toy dataset with parquet files having identical column types, but different column ordering.

import os
import tempfile

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as pds

t1 = pa.Table.from_pydict({"A": [1, 2, 3], "B": ["a", "b", "c"]})
t2 = pa.Table.from_pydict({"B": ["a", "b", "c"], "A": [1, 2, 3]})

temp_path = tempfile.mkdtemp()

pq.write_table(t1, os.path.join(temp_path, "t1.parquet"))
pq.write_table(t2, os.path.join(temp_path, "t2.parquet"))

ds = pds.dataset(temp_path)
print(ds.to_table())

pyarrow.Table
A: int64
B: string
----
A: [[1,2,3],[1,2,3]]
B: [["a","b","c"],["a","b","c"]]

Collect metadata of the individual files and create the (global) metadata file.

metadata_collector = [frag.metadata for frag in ds.get_fragments()]

metadata = metadata_collector[0]
metadata.append_row_groups(metadata_collector[1])

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[193], line 2
      1 metadata = metadata_collector[0]
----> 2 metadata.append_row_groups(metadata_collector[1])

File ~/mambaforge/envs/pydala-dev/lib/python3.11/site-packages/pyarrow/_parquet.pyx:793, in pyarrow._parquet.FileMetaData.append_row_groups()

RuntimeError: AppendRowGroups requires equal schemas.
The two columns with index 0 differ.
column descriptor = {
  name: A,
  path: A,
  physical_type: INT64,
  converted_type: NONE,
  logical_type: None,
  max_definition_level: 1,
  max_repetition_level: 0,
}
column descriptor = {
  name: B,
  path: B,
  physical_type: BYTE_ARRAY,
  converted_type: UTF8,
  logical_type: String,
  max_definition_level: 1,
  max_repetition_level: 0,
}

mapleFU · 2023-08-23T02:56:36Z

>>> metadata_collector[0].schema
<pyarrow._parquet.ParquetSchema object at 0x11e3cee80>
required group field_id=-1 schema {
  optional int64 field_id=-1 A;
  optional binary field_id=-1 B (String);
}

>>> metadata_collector[1].schema
<pyarrow._parquet.ParquetSchema object at 0x11e3ceec0>
required group field_id=-1 schema {
  optional binary field_id=-1 B (String);
  optional int64 field_id=-1 A;
}

Oh I got this. This is not allowed. Though it looks like it should be allowed.

Because Parquet schema is at "FileMetadata" ( see https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1024 ), so different row-group should have same schema.

legout · 2023-08-23T07:11:44Z

This means, there is no other solution than rewriting the data with a unique column ordering?

KernelA · 2024-03-14T09:58:16Z

I have a similar issue.

pyarrow 14.0.2

  parquet.write_metadata(
  File ".../lib/python3.9/site-packages/pyarrow/parquet/core.py", line 3589, in write_metadata
    metadata.append_row_groups(m)
  File "pyarrow/_parquet.pyx", line 807, in pyarrow._parquet.FileMetaData.append_row_groups
RuntimeError: AppendRowGroups requires equal schemas.
The two columns with index 0 differ.
column descriptor = {
  name: session_num,
  path: session_num,
  physical_type: INT64,
  converted_type: UINT_64,
  logical_type: Int(bitWidth=64, isSigned=false),
  max_definition_level: 0,
  max_repetition_level: 0,
}
column descriptor = {
  name: session_num,
  path: session_num,
  physical_type: INT64,
  converted_type: UINT_64,
  logical_type: Int(bitWidth=64, isSigned=false),
  max_definition_level: 1,
  max_repetition_level: 0,
}

All partitions have equal schemas. Example are taken from https://arrow.apache.org/docs/14.0/python/parquet.html#writing-metadata-and-common-metadata-files

KernelA · 2024-03-24T11:34:34Z

When all fields are nullable in the schema this error does not occur. I think it relates with #31957

jankovicgd mentioned this issue Aug 8, 2023

ENH: GeoDataFrame.to_parquet() should support partitioning with partition_cols geopandas/geopandas#2361

Open

3 tasks

kou changed the title ~~PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file~~ [Python] PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file #31678

[Python] PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file #31678

asfimport commented Apr 22, 2022

david-waterworth commented May 10, 2023

legout commented Aug 22, 2023

mapleFU commented Aug 22, 2023

legout commented Aug 22, 2023 •

edited

legout commented Aug 22, 2023

mapleFU commented Aug 23, 2023

legout commented Aug 23, 2023

KernelA commented Mar 14, 2024

KernelA commented Mar 24, 2024 •

edited

[Python] PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file #31678

[Python] PyArrow: RuntimeError: AppendRowGroups requires equal schemas when writing _metadata file #31678

Comments

asfimport commented Apr 22, 2022

david-waterworth commented May 10, 2023

legout commented Aug 22, 2023

mapleFU commented Aug 22, 2023

legout commented Aug 22, 2023 • edited

legout commented Aug 22, 2023

mapleFU commented Aug 23, 2023

legout commented Aug 23, 2023

KernelA commented Mar 14, 2024

KernelA commented Mar 24, 2024 • edited

legout commented Aug 22, 2023 •

edited

KernelA commented Mar 24, 2024 •

edited