Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] Metadata grows exponentially when using schema from disk #25103

Closed
asfimport opened this issue May 29, 2020 · 2 comments
Closed

[Python] Metadata grows exponentially when using schema from disk #25103

asfimport opened this issue May 29, 2020 · 2 comments

Comments

@asfimport
Copy link
Collaborator

asfimport commented May 29, 2020

When overwriting parquet files we first read the schema that is already on disk this is mainly to deal with some type harmonizing between pyarrow and pandas (that I wont go into).

Regardless here is a simple example (below) with no weirdness. If I continously re-write the same file by first fetching the schema from disk, creating a writer with that schema and then writing same dataframe the file size keeps growing even though the amount of rows has not changed.

Note: My solution was to remove b'ARROW:schema' data from the schema.metadata. this seems to stop the file size growing. So I wonder if the writer keeps appending to it or something? TBH I'm not entirely sure but I have a hunch that the ARROW:schema is just the metadata serialised or something.

I should also note that once the metadata gets to big this leads to a buffer overflow in another part of the code 'thrift' which was referenced here: https://issues.apache.org/jira/browse/PARQUET-1345

import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow as pa
import pandas as pd
import pathlib
import sys
def main():
    print(f"python: {sys.version}")
    print(f"pa version: {pa.__version__}")
    print(f"pd version: {pd.__version__}")    fname = "test.pq"
    path = pathlib.Path(fname)    df = pd.DataFrame({"A": [0] * 100000})
    df.to_parquet(fname)    print(f"Wrote test frame to {fname}")
    print(f"Size of {fname}: {path.stat().st_size}")    for _ in range(5):
        file = pq.ParquetFile(fname)
        tmp_df = file.read().to_pandas()
        print(f"Number of rows on disk: {tmp_df.shape}")
        print("Reading schema from disk")
        schema = file.schema.to_arrow_schema()
        print("Creating new writer")
        writer = pq.ParquetWriter(fname, schema=schema)
        print("Re-writing the dataframe")
        writer.write_table(pa.Table.from_pandas(df))
        writer.close()
        print(f"Size of {fname}: {path.stat().st_size}")
if __name__ == "__main__":
    main()
(sdm) ➜ ~ python growing_metadata.py
python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
[Clang 9.0.0 (tags/RELEASE_900/final)]
pa version: 0.16.0
pd version: 0.25.2
Wrote test frame to test.pq
Size of test.pq: 1643
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 3637
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 8327
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 19301
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 44944
Number of rows on disk: (100000, 1)
Reading schema from disk
Creating new writer
Re-writing the dataframe
Size of test.pq: 104815

Environment: python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
[Clang 9.0.0 (tags/RELEASE_900/final)]
pa version: 0.16.0
pd version: 0.25.2
Reporter: Kevin Glasson
Assignee: Wes McKinney / @wesm

Related issues:

Original Issue Attachments:

PRs and other links:

Note: This issue was originally created as ARROW-8980. Please see the migration documentation for further details.

@asfimport
Copy link
Collaborator Author

Joris Van den Bossche / @jorisvandenbossche:
[~kevinglasson] thanks for the report!

A bit modified example to visualize the issue:

import pyarrow.parquet as pq

fname = "test_metadata_size.parquet" 
df = pd.DataFrame({"A": [0] * 100000})
df.to_parquet(fname)

# first read
file1 = pq.ParquetFile("test_metadata_size.parquet")                                                                                                                                  
table1 = file1.read()                                                                                                                                                                  
schema1 = file1.schema.to_arrow_schema()                                                                                                                                               

# writing
writer = pq.ParquetWriter(fname, schema=schema1)                                                                                                                                      
writer.write_table(pa.Table.from_pandas(df))                                                                                                                                         
writer.close()                                                                                                                                                                       

# second read
file2 = pq.ParquetFile(fname)                                                                                                                                                        
table2 = file2.read()                                                                                                                                                                
schema2 = file2.schema.to_arrow_schema() 

and then looking at the different schemas:

>>> schema1                                                                                                                                                                               
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////4gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 818

>>> table1.schema                                                                                                                                                                         
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408

>>> schema2  
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////4gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 818
ARROW:schema: '/////2AGAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 2130

>>> table2.schema
A: int64
  -- field metadata --
  PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////2AGAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 2130

So indeed, as you said, it's the ARROW:schema size that is accumulating.

Some observations:

  • In actual Table.schema, the ARROW:schema field is removed from the metadata (after reading). Sidenote: so if you would use this instead of file.schema.to_arrow_schema() that could be a temporary workaround for you.
  • When converting the ParquetSchema to a pyarrow Schema, we don't remove the "ARROW:schema" key, which we probably should do? (since that information is only used to propertly reconstruct the arrow schema, so once you have this arrow schema, we can drop the metadata for this. Similarly as we do when reading the actual file)
  • When writing with a schema that already has a "ARROW:schema" metadata field, another field (with a duplicated key) gets added. I suppose this might be expected since the metadata doesn't check for duplicate keys right now. But it would also help in this case if the field would be over-written.

@asfimport
Copy link
Collaborator Author

Wes McKinney / @wesm:
Issue resolved by pull request 7577
#7577

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants