-
Notifications
You must be signed in to change notification settings - Fork 3.3k
feat(ingestion): add column level description for parquet files #12988
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
❌ Your patch check has failed because the patch coverage (36.36%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage. 📢 Thoughts on this report? Let us know! |
| Returns: | ||
| Dict: Parsed metadata fields dictionary | ||
| """ | ||
| return pandas.read_json(schema_metadata.decode("utf-8")).to_dict()["fields"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same, right?
| return pandas.read_json(schema_metadata.decode("utf-8")).to_dict()["fields"] | |
| return json.loads(schema_metadata.decode("utf-8"))["fields"] |
Unless necessary, I would avoid depending on pandas for this.
For resilience, we should also account for the possibility that the fields field might be missing.
| for name, pyarrow_type in zip(schema.names, schema.types): | ||
| mapped_type = map_pyarrow_type(pyarrow_type) | ||
|
|
||
| description = get_column_metadata(meta_data_fields, name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of traversing meta_data_fields for every column, you could make parse_metadata to build a dictionary indexed by column name.
| meta_data_fields = parse_metadata( | ||
| schema.metadata[b"org.apache.spark.sql.parquet.row.metadata"] | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a guarantee that this metadata field will always exist? We should consider treating it as optional.
sgomezvillamor
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the contrib
Overall it looks good
Beyond some code suggestions, is there any integration tests that could be added/updated to see the impact of this new feature?
Checklist