feat(ingestion/kafka): add description in dataset properties #7974

shubhamjagtap639 · 2023-05-05T04:57:25Z

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

add description in dataset properties as top-level doc if schema type avro

hsheth2 · 2023-05-09T02:40:19Z

metadata-ingestion/tests/unit/test_kafka_source.py

@@ -175,7 +175,7 @@ def test_kafka_source_workunits_no_platform_instance(mock_kafka, mock_admin_clie
        env="PROD",
    )

-    # DataPlatform aspect should be present when platform_instance is configured
+    # DataPlatform aspect should not be present when platform_instance is configured


don't think this change is correct

Sorry, The comment should be as below:
"DataPlatform aspect should not be present when platform_instance is not configured."
right?
As we are testing kafka source workunits with no platform instance here.

hsheth2 · 2023-05-09T02:41:41Z

metadata-models/src/main/pegasus/com/linkedin/schema/KafkaSchema.pdl

+  /**
+   * The native kafka key schema type. This can be AVRO/PROTOBUF/JSON.
+   */
+  keySchemaType: optional string


what was the motivation for adding these two fields? what other alternatives did you consider?

I have two reasons for adding these two fields:

More metadata clarification: If any new user see the kafka ingested metadata, he/she will not be able know exactly which type of schema is associated with topic. As I am new to this, even I felt the same.

To set top-level doc field: The task was to set top-level doc field as description of dataset only if schema type is AVRO. So as we are generating dataset properties at outer function i.e. _extract_records, we will need schema type at outer function for adding condition. Hence I added those fields.

Alternatives:
We can have separated functions as get_description() in kafka.py. But this will lead to adding same code and calling same metadata fetching APIs again.

hsheth2 · 2023-05-09T03:59:22Z

metadata-ingestion/src/datahub/ingestion/source/kafka.py

+            # Point to note:
+            # documentSchema and keySchema both can have the doc, however we are retrieving doc
+            # from documentSchema and setting it as dataset description.
+            # doc is optional property in both i.e. documentSchema and keySchema


please make this comment more concise / clear

…ions

hsheth2 · 2023-05-09T07:02:09Z

@shubhamjagtap639 also it looks like the tests are failing

…ub.com:shubhamjagtap639/datahub into kafka-dataset-properties-populate-descriptions

…ions

Merge latest commits

…ions

Merge latest code

…ions

shubhamjagtap639 added 5 commits May 3, 2023 16:24

feat(ingesion/kafka): add description in dataset properties

1f510e5

add description in dataset properties as top-level doc if schema type avro

test(ingestion/kafka): add test for updated kafka schema

daaed8a

refactor(ingestion/kafka): add constant in kafka extract record

27b9c7e

refactor(ingestion/kafka): code change as per PR comments

3e3a679

refactor(ingestion/kafka): add comment for set dataset description

37ca7a6

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label May 5, 2023

shubhamjagtap639 changed the title ~~feat(ingesion/kafka): add description in dataset properties~~ feat(ingestion/kafka): add description in dataset properties May 5, 2023

vercel bot had a problem deploying to Preview May 5, 2023 05:06 Failure

hsheth2 reviewed May 9, 2023

View reviewed changes

Merge branch 'master' into kafka-dataset-properties-populate-descript…

d9a287d

…ions

vercel bot deployed to Preview May 9, 2023 04:11 View deployment

schema fixes

aa61ea3

siddiquebagwan-gslab added 2 commits May 9, 2023 14:03

Merge branch 'kafka-dataset-properties-populate-descriptions' of gith…

3c3acb8

…ub.com:shubhamjagtap639/datahub into kafka-dataset-properties-populate-descriptions

Merge branch 'master' into kafka-dataset-properties-populate-descript…

054d782

…ions

vercel bot deployed to Preview May 9, 2023 09:13 View deployment

refactor(ingestion/kafka): code comments modify

3d3921c

vercel bot deployed to Preview May 9, 2023 12:46 View deployment

shubhamjagtap639 and others added 2 commits May 10, 2023 18:48

Merge pull request #2 from shubhamjagtap639/merge-latest-commits

239c36a

Merge latest commits

Merge branch 'master' into kafka-dataset-properties-populate-descript…

149aa77

…ions

vercel bot deployed to Preview May 11, 2023 04:37 View deployment

Merge branch 'master' into kafka-dataset-properties-populate-descript…

dcdb48f

…ions

vercel bot deployed to Preview May 11, 2023 12:54 View deployment

shubhamjagtap639 and others added 5 commits May 15, 2023 10:59

Merge pull request #4 from shubhamjagtap639/merge-latest-code

4c10468

Merge latest code

Merge branch 'datahub-project:master' into master

fdc73fb

Merge branch 'master' into kafka-dataset-properties-populate-descript…

4d4ce1b

…ions

Merge branch 'master' into kafka-dataset-properties-populate-descript…

52cd975

…ions

update doc

ec0b4e2

vercel bot deployed to Preview May 16, 2023 10:54 View deployment

hsheth2 approved these changes May 17, 2023

View reviewed changes

hsheth2 merged commit 8cc6606 into datahub-project:master May 17, 2023
47 checks passed

shubhamjagtap639 deleted the kafka-dataset-properties-populate-descriptions branch December 11, 2023 11:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion/kafka): add description in dataset properties #7974

feat(ingestion/kafka): add description in dataset properties #7974

shubhamjagtap639 commented May 5, 2023

hsheth2 May 9, 2023

shubhamjagtap639 May 9, 2023 •

edited

Loading

hsheth2 May 9, 2023

shubhamjagtap639 May 9, 2023

hsheth2 May 9, 2023

hsheth2 commented May 9, 2023

feat(ingestion/kafka): add description in dataset properties #7974

feat(ingestion/kafka): add description in dataset properties #7974

Conversation

shubhamjagtap639 commented May 5, 2023

Checklist

hsheth2 May 9, 2023

Choose a reason for hiding this comment

shubhamjagtap639 May 9, 2023 • edited Loading

Choose a reason for hiding this comment

hsheth2 May 9, 2023

Choose a reason for hiding this comment

shubhamjagtap639 May 9, 2023

Choose a reason for hiding this comment

hsheth2 May 9, 2023

Choose a reason for hiding this comment

hsheth2 commented May 9, 2023

shubhamjagtap639 May 9, 2023 •

edited

Loading