-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] Metadata grows exponentially when using schema from disk #25103
Comments
Joris Van den Bossche / @jorisvandenbossche: A bit modified example to visualize the issue: import pyarrow.parquet as pq
fname = "test_metadata_size.parquet"
df = pd.DataFrame({"A": [0] * 100000})
df.to_parquet(fname)
# first read
file1 = pq.ParquetFile("test_metadata_size.parquet")
table1 = file1.read()
schema1 = file1.schema.to_arrow_schema()
# writing
writer = pq.ParquetWriter(fname, schema=schema1)
writer.write_table(pa.Table.from_pandas(df))
writer.close()
# second read
file2 = pq.ParquetFile(fname)
table2 = file2.read()
schema2 = file2.schema.to_arrow_schema() and then looking at the different schemas: >>> schema1
A: int64
-- field metadata --
PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////4gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 818
>>> table1.schema
A: int64
-- field metadata --
PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
>>> schema2
A: int64
-- field metadata --
PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////4gCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 818
ARROW:schema: '/////2AGAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 2130
>>> table2.schema
A: int64
-- field metadata --
PARQUET:field_id: '1'
-- schema metadata --
pandas: '{"index_columns": [{"kind": "range", "name": null, "start": 0, "' + 408
ARROW:schema: '/////2AGAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABAwAQAAAAAAAKAAwAAA' + 2130 So indeed, as you said, it's the ARROW:schema size that is accumulating. Some observations:
|
Wes McKinney / @wesm: |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
When overwriting parquet files we first read the schema that is already on disk this is mainly to deal with some type harmonizing between pyarrow and pandas (that I wont go into).
Regardless here is a simple example (below) with no weirdness. If I continously re-write the same file by first fetching the schema from disk, creating a writer with that schema and then writing same dataframe the file size keeps growing even though the amount of rows has not changed.
Note: My solution was to remove
b'ARROW:schema'
data from theschema.metadata.
this seems to stop the file size growing. So I wonder if the writer keeps appending to it or something? TBH I'm not entirely sure but I have a hunch that the ARROW:schema is just the metadata serialised or something.I should also note that once the metadata gets to big this leads to a buffer overflow in another part of the code 'thrift' which was referenced here: https://issues.apache.org/jira/browse/PARQUET-1345
Environment: python: 3.7.3 | packaged by conda-forge | (default, Dec 6 2019, 08:36:57)
[Clang 9.0.0 (tags/RELEASE_900/final)]
pa version: 0.16.0
pd version: 0.25.2
Reporter: Kevin Glasson
Assignee: Wes McKinney / @wesm
Related issues:
Original Issue Attachments:
PRs and other links:
Note: This issue was originally created as ARROW-8980. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: