Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion/kafka): add description in dataset properties #7974

Conversation

shubhamjagtap639
Copy link
Contributor

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label May 5, 2023
@shubhamjagtap639 shubhamjagtap639 changed the title feat(ingesion/kafka): add description in dataset properties feat(ingestion/kafka): add description in dataset properties May 5, 2023
@@ -175,7 +175,7 @@ def test_kafka_source_workunits_no_platform_instance(mock_kafka, mock_admin_clie
env="PROD",
)

# DataPlatform aspect should be present when platform_instance is configured
# DataPlatform aspect should not be present when platform_instance is configured
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't think this change is correct

Copy link
Contributor Author

@shubhamjagtap639 shubhamjagtap639 May 9, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, The comment should be as below:
"DataPlatform aspect should not be present when platform_instance is not configured."
right?
As we are testing kafka source workunits with no platform instance here.

/**
* The native kafka key schema type. This can be AVRO/PROTOBUF/JSON.
*/
keySchemaType: optional string
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the motivation for adding these two fields? what other alternatives did you consider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two reasons for adding these two fields:

  1. More metadata clarification: If any new user see the kafka ingested metadata, he/she will not be able know exactly which type of schema is associated with topic. As I am new to this, even I felt the same.
  2. To set top-level doc field: The task was to set top-level doc field as description of dataset only if schema type is AVRO. So as we are generating dataset properties at outer function i.e. _extract_records, we will need schema type at outer function for adding condition. Hence I added those fields.

Alternatives:
We can have separated functions as get_description() in kafka.py. But this will lead to adding same code and calling same metadata fetching APIs again.

# Point to note:
# documentSchema and keySchema both can have the doc, however we are retrieving doc
# from documentSchema and setting it as dataset description.
# doc is optional property in both i.e. documentSchema and keySchema
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please make this comment more concise / clear

@hsheth2
Copy link
Collaborator

hsheth2 commented May 9, 2023

@shubhamjagtap639 also it looks like the tests are failing

@hsheth2 hsheth2 merged commit 8cc6606 into datahub-project:master May 17, 2023
47 checks passed
@shubhamjagtap639 shubhamjagtap639 deleted the kafka-dataset-properties-populate-descriptions branch December 11, 2023 11:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants