[Python] Metadata grows exponentially when using schema from disk #25103

asfimport · 2020-05-29T11:19:55Z

When overwriting parquet files we first read the schema that is already on disk this is mainly to deal with some type harmonizing between pyarrow and pandas (that I wont go into).

Regardless here is a simple example (below) with no weirdness. If I continously re-write the same file by first fetching the schema from disk, creating a writer with that schema and then writing same dataframe the file size keeps growing even though the amount of rows has not changed.

Note: My solution was to remove b'ARROW:schema' data from the schema.metadata. this seems to stop the file size growing. So I wonder if the writer keeps appending to it or something? TBH I'm not entirely sure but I have a hunch that the ARROW:schema is just the metadata serialised or something.

I should also note that once the metadata gets to big this leads to a buffer overflow in another part of the code 'thrift' which was referenced here: https://issues.apache.org/jira/browse/PARQUET-1345

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import pathlib
import sys
def main():
    print(f"python: {sys.version}")
    print(f"pa version: {pa.__version__}")
    print(f"pd version: {pd.__version__}")    fname = "test.pq"
    path = pathlib.Path(fname)    df = pd.DataFrame({"A": [0] * 100000})
    df.to_parquet(fname)    print(f"Wrote test frame to {fname}")
    print(f"Size of {fname}: {path.stat().st_size}")    for _ in range(5):
        file = pq.ParquetFile(fname)
        tmp_df = file.read().to_pandas()
        print(f"Number of rows on disk: {tmp_df.shape}")
        print("Reading schema from disk")
        schema = file.schema.to_arrow_schema()
        print("Creating new writer")
        writer = pq.ParquetWriter(fname, schema=schema)
        print("Re-writing the dataframe")
        writer.write_table(pa.Table.from_pandas(df))
        writer.close()
        print(f"Size of {fname}: {path.stat().st_size}")
if __name__ == "__main__":
    main()

(sdm) ➜ ~ python growing_metadata.py
python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
[Clang 9.0.0 (tags/RELEASE_900/final)]
pa version: 0.16.0
pd version: 0.25.2
Wrote test frame to test.pq
Size of test.pq: 1643
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 3637
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 8327
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 19301
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 44944
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 104815

Environment: python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
[Clang 9.0.0 (tags/RELEASE_900/final)]
pa version: 0.16.0
pd version: 0.25.2
Reporter: Kevin Glasson
Assignee: Wes McKinney / @wesm

Related issues:

[C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files (relates to)

Original Issue Attachments:

PRs and other links:

GitHub Pull Request #7577

_{Note: This issue was originally created as ARROW-8980. Please see the migration documentation for further details.}

The text was updated successfully, but these errors were encountered:

asfimport · 2020-06-02T09:17:43Z

Joris Van den Bossche / @jorisvandenbossche:
[~kevinglasson] thanks for the report!

A bit modified example to visualize the issue:

import pyarrow.parquet as pq

fname = "test_metadata_size.parquet" 
df = pd.DataFrame({"A": [0] * 100000})
df.to_parquet(fname)

# first read
file1 = pq.ParquetFile("test_metadata_size.parquet")                                                                                                                                  
table1 = file1.read()                                                                                                                                                                  
schema1 = file1.schema.to_arrow_schema()                                                                                                                                               

# writing
writer = pq.ParquetWriter(fname, schema=schema1)                                                                                                                                      
writer.write_table(pa.Table.from_pandas(df))                                                                                                                                         
writer.close()                                                                                                                                                                       

# second read
file2 = pq.ParquetFile(fname)                                                                                                                                                        
table2 = file2.read()                                                                                                                                                                
schema2 = file2.schema.to_arrow_schema()

and then looking at the different schemas:

>>> schema1                                                                                                                                                                               
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////4gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 818

>>> table1.schema                                                                                                                                                                         
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408

>>> schema2  
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////4gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 818
ARROW:schema: '/////2AGAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 2130

>>> table2.schema
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////2AGAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 2130

So indeed, as you said, it's the ARROW:schema size that is accumulating.

Some observations:

In actual Table.schema, the ARROW:schema field is removed from the metadata (after reading). Sidenote: so if you would use this instead of file.schema.to_arrow_schema() that could be a temporary workaround for you.
When converting the ParquetSchema to a pyarrow Schema, we don't remove the "ARROW:schema" key, which we probably should do? (since that information is only used to propertly reconstruct the arrow schema, so once you have this arrow schema, we can drop the metadata for this. Similarly as we do when reading the actual file)
When writing with a schema that already has a "ARROW:schema" metadata field, another field (with a duplicated key) gets added. I suppose this might be expected since the metadata doesn't check for duplicate keys right now. But it would also help in this case if the field would be over-written.

asfimport · 2020-06-29T04:08:44Z

Wes McKinney / @wesm:
Issue resolved by pull request 7577
#7577

asfimport closed this as completed Jun 29, 2020

asfimport assigned wesm Jan 10, 2023

asfimport added this to the 1.0.0 milestone Jan 11, 2023

asfimport mentioned this issue Jan 11, 2023

[C++][Dataset] ARROW:schema should be removed from schema's metadata when reading Parquet files #25127

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] Metadata grows exponentially when using schema from disk #25103

[Python] Metadata grows exponentially when using schema from disk #25103

asfimport commented May 29, 2020 •

edited

Loading

asfimport commented Jun 2, 2020

asfimport commented Jun 29, 2020

[Python] Metadata grows exponentially when using schema from disk #25103

[Python] Metadata grows exponentially when using schema from disk #25103

Comments

asfimport commented May 29, 2020 • edited Loading

Related issues:

Original Issue Attachments:

PRs and other links:

asfimport commented Jun 2, 2020

asfimport commented Jun 29, 2020

asfimport commented May 29, 2020 •

edited

Loading