feat(ingestion): add column level description for parquet files #12988

janfrederickk · 2025-03-26T14:03:24Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

codecov · 2025-03-26T14:06:10Z

Codecov Report

Attention: Patch coverage is 36.36364% with 7 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
...tahub/ingestion/source/schema_inference/parquet.py	36.36%	7 Missing ⚠️

❌ Your patch check has failed because the patch coverage (36.36%) is below the target coverage (75.00%). You can increase the patch coverage or adjust the target coverage.

📢 Thoughts on this report? Let us know!

sgomezvillamor · 2025-04-29T13:49:11Z

metadata-ingestion/src/datahub/ingestion/source/schema_inference/parquet.py

+    Returns:
+        Dict: Parsed metadata fields dictionary
+    """
+    return pandas.read_json(schema_metadata.decode("utf-8")).to_dict()["fields"]


This is the same, right?

Suggested change

return pandas.read_json(schema_metadata.decode("utf-8")).to_dict()["fields"]

return json.loads(schema_metadata.decode("utf-8"))["fields"]

Unless necessary, I would avoid depending on pandas for this.

For resilience, we should also account for the possibility that the fields field might be missing.

sgomezvillamor · 2025-04-29T13:51:33Z

metadata-ingestion/src/datahub/ingestion/source/schema_inference/parquet.py

        for name, pyarrow_type in zip(schema.names, schema.types):
            mapped_type = map_pyarrow_type(pyarrow_type)

+            description = get_column_metadata(meta_data_fields, name)


Instead of traversing meta_data_fields for every column, you could make parse_metadata to build a dictionary indexed by column name.

sgomezvillamor · 2025-04-29T13:53:16Z

metadata-ingestion/src/datahub/ingestion/source/schema_inference/parquet.py

+        meta_data_fields = parse_metadata(
+            schema.metadata[b"org.apache.spark.sql.parquet.row.metadata"]
+        )


Is there a guarantee that this metadata field will always exist? We should consider treating it as optional.

sgomezvillamor

Thanks for the contrib
Overall it looks good
Beyond some code suggestions, is there any integration tests that could be added/updated to see the impact of this new feature?

feat(ingestion): add column level description for parquet files

3f4cb2a

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels Mar 26, 2025

datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Mar 26, 2025

reformat modified files

945d3db

vercel bot temporarily deployed to Preview March 26, 2025 15:05 Inactive

Merge branch 'master' into master

e036764

vercel bot deployed to Preview March 27, 2025 08:24 View deployment

add check

94b8ff2

vercel bot deployed to Preview March 27, 2025 09:48 View deployment

janwackermark added 2 commits March 27, 2025 10:50

add tests

7db4684

remove

58e78fa

vercel bot deployed to Preview March 27, 2025 10:23 View deployment

sgomezvillamor reviewed Apr 29, 2025

View reviewed changes

datahub-cyborg bot added pending-submitter-response Issue/request has been reviewed but requires a response from the submitter and removed needs-review Label for PRs that need review from a maintainer. labels Apr 29, 2025

sgomezvillamor reviewed Apr 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(ingestion): add column level description for parquet files #12988

feat(ingestion): add column level description for parquet files #12988

janfrederickk commented Mar 26, 2025

Uh oh!

codecov bot commented Mar 26, 2025 •

edited

Loading

Uh oh!

sgomezvillamor Apr 29, 2025

Uh oh!

sgomezvillamor Apr 29, 2025

Uh oh!

sgomezvillamor Apr 29, 2025

Uh oh!

sgomezvillamor left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	return pandas.read_json(schema_metadata.decode("utf-8")).to_dict()["fields"]
	return json.loads(schema_metadata.decode("utf-8"))["fields"]

feat(ingestion): add column level description for parquet files #12988

Are you sure you want to change the base?

feat(ingestion): add column level description for parquet files #12988

Conversation

janfrederickk commented Mar 26, 2025

Checklist

Uh oh!

codecov bot commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

sgomezvillamor Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor Apr 29, 2025

Choose a reason for hiding this comment

Uh oh!

sgomezvillamor left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov bot commented Mar 26, 2025 •

edited

Loading