Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Truncate stats from Parquet files #113

Closed
rdblue opened this issue Feb 26, 2019 · 6 comments · Fixed by #254
Closed

Truncate stats from Parquet files #113

rdblue opened this issue Feb 26, 2019 · 6 comments · Fixed by #254

Comments

@rdblue
Copy link
Contributor

rdblue commented Feb 26, 2019

Lower and upper bound values from Parquet files are not currently truncated, which takes more space than necessary in manifests. Truncating strings and binary values will probably improve performance for large tables.

@rdblue
Copy link
Contributor Author

rdblue commented Feb 26, 2019

@aokolnychyi, #78 reminded me about this item as well.

@feng-tao
Copy link
Member

feng-tao commented Mar 4, 2019

@rdblue is it an issue good for n00b? if yes, I am interested to take this one :)

@rdblue
Copy link
Contributor Author

rdblue commented Mar 5, 2019

@feng-tao, I think this could be a good first issue. Let us know if you need any help or context.

@feng-tao
Copy link
Member

feng-tao commented Mar 5, 2019

@rdblue , I am still reading the code base. It would be great if you could guide me a little bit on the context or the related code path. Thanks a lot :)

@rdblue
Copy link
Contributor Author

rdblue commented Mar 5, 2019

DataFile and Metrics are the classes that contain metrics and are good candidates for where truncation could happen. I think we would want truncation to be configurable using settings in TableProperties. Metrics are scraped from Parquet metadata in ParquetMetrics, which is called by ParquetWriter.

You might want to explore passing a truncate length option to ParquetWriter. The writer would pass it to ParquetMetrics to truncate values right away. The setting would come from the table when creating a writer. For that, I think you'd update the write builder in Parquet.

@feng-tao
Copy link
Member

feng-tao commented Mar 6, 2019

thanks @rdblue , will take a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants