Skip to content

Conversation

@janfrederickk
Copy link

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Mar 26, 2025
@codecov
Copy link

codecov bot commented Mar 26, 2025

Codecov Report

Attention: Patch coverage is 36.36364% with 7 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...tahub/ingestion/source/schema_inference/parquet.py 36.36% 7 Missing ⚠️

❌ Your patch check has failed because the patch coverage (36.36%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

📢 Thoughts on this report? Let us know!

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Mar 26, 2025
Returns:
Dict: Parsed metadata fields dictionary
"""
return pandas.read_json(schema_metadata.decode("utf-8")).to_dict()["fields"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same, right?

Suggested change
return pandas.read_json(schema_metadata.decode("utf-8")).to_dict()["fields"]
return json.loads(schema_metadata.decode("utf-8"))["fields"]

Unless necessary, I would avoid depending on pandas for this.

For resilience, we should also account for the possibility that the fields field might be missing.

@datahub-cyborg datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Apr 29, 2025
for name, pyarrow_type in zip(schema.names, schema.types):
mapped_type = map_pyarrow_type(pyarrow_type)

description = get_column_metadata(meta_data_fields, name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of traversing meta_data_fields for every column, you could make parse_metadata to build a dictionary indexed by column name.

Comment on lines +118 to +120
meta_data_fields = parse_metadata(
schema.metadata[b"org.apache.spark.sql.parquet.row.metadata"]
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a guarantee that this metadata field will always exist? We should consider treating it as optional.

Copy link
Contributor

@sgomezvillamor sgomezvillamor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contrib
Overall it looks good
Beyond some code suggestions, is there any integration tests that could be added/updated to see the impact of this new feature?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata pending-submitter-response Issue/request has been reviewed but requires a response from the submitter

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants