feat(ingest/datahub): Improvements, bug fixes, and docs #8735

asikowitz · 2023-08-28T16:32:25Z

Makes the following changes:

Add CASE on version = 0 for improved ordering. Note this only matters when comparing aspects with the same exact createdon for the same urn and aspect. In general, we rely on created to impart the correct ordering, not version
Do not overwrite lastObserved time in system metadata
Allow database_connection and kafka_connection to be None, if you only want to ingest from one of them
Rename mysql -> database
Store stop_time in report
Report workunits

Still need to test with postgres and in general, I'd like to set up an integration test to make development of this source safer

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

asikowitz · 2023-08-28T16:33:31Z

metadata-ingestion/src/datahub/testing/compare_metadata_json.py

@@ -55,7 +55,6 @@ def assert_metadata_files_equal(
    output = load_json_file(output_path)

    if update_golden and not golden_exists:
-        golden = load_json_file(output_path)


asikowitz · 2023-08-28T16:33:40Z

metadata-ingestion/tests/unit/stateful_ingestion/state/test_checkpoint.py

@@ -27,7 +27,7 @@ def _assert_checkpoint_deserialization(
 ) -> Checkpoint:
    # Serialize a checkpoint aspect with the previous state.
    checkpoint_aspect = DatahubIngestionCheckpointClass(
-        timestampMillis=int(datetime.now().timestamp() * 1000),
+        timestampMillis=int(datetime.now(tz=timezone.utc).timestamp() * 1000),


As requested in previous PR

asikowitz · 2023-08-28T16:34:10Z

metadata-ingestion/src/datahub/ingestion/source/datahub/datahub_source.py

+        if self.config.database_connection is None:
+            return


This is repetitive but I prefer to (i) an assertion or (ii) passing database_connection as an argument

I'm actually fine with an assertion here

I don't really like assertions because if we ever make code changes where the assertion is no longer valid, I think it's a really bad error to display to the user. + it bypasses the type system

asikowitz · 2023-08-28T16:34:51Z

metadata-ingestion/src/datahub/ingestion/source/datahub/datahub_kafka_reader.py

@@ -27,24 +28,26 @@ class DataHubKafkaReader(Closeable):
    def __init__(
        self,
        config: DataHubSourceConfig,
+        connection_config: KafkaConsumerConnectionConfig,


Passed in separately so that it can be non-optional, avoiding assertions

What if we only passed in one thing, but then did this in init

assert self.config.kafka_connection self.connection_config = self.config.kafka_connection

Same thing here, in general I don't like assertions -- this requires callers to pass in a non-optional consumer connection, rather than requiring them to know that the class assumes it exists

RyanHolstien · 2023-08-28T17:08:49Z

metadata-ingestion/docs/sources/datahub/datahub_pre.md

+  * If you are migrating large amounts of data, consider scaling consumer replicas.
+- Increase the number of gms pods to add redundancy and increase resilience to node evictions
+  * If you are migrating large amounts of data, consider increasing elasticsearch's
+  thread count via the `ELASTICSEARCH_THREAD_COUNT` environment variable.


This setting primarily helps with read traffic on the ElasticSearch side, writes should already be pretty efficient with the bulkProcessor (especially for non-deletes), but it shouldn't hurt to have this in there.

metadata-ingestion/docs/sources/datahub/README.md

metadata-ingestion/docs/sources/datahub/datahub_recipe.yml

metadata-ingestion/src/datahub/ingestion/source/datahub/config.py

metadata-ingestion/docs/sources/datahub/datahub_pre.md

hsheth2 · 2023-08-28T18:30:25Z

metadata-ingestion/docs/sources/datahub/datahub_recipe.yml

+      enabled: true
+      ignore_old_state: false
+  extractor_config:
+    set_system_metadata: false  # Replicate system metadata


eventually I'd like to move these to the "flags" section that we added

Ah I thought the flags section would be for relatively temporary flags. You're thinking we'd store permanent configs in there as well?

let's talk about it more later - we do need a home for some of these things, but not sure where that should be

hsheth2 · 2023-08-28T18:32:24Z

metadata-ingestion/src/datahub/ingestion/source/datahub/datahub_kafka_reader.py

@@ -27,24 +28,26 @@ class DataHubKafkaReader(Closeable):
    def __init__(
        self,
        config: DataHubSourceConfig,
+        connection_config: KafkaConsumerConnectionConfig,


What if we only passed in one thing, but then did this in init

assert self.config.kafka_connection self.connection_config = self.config.kafka_connection

hsheth2 · 2023-08-28T18:33:58Z

metadata-ingestion/src/datahub/ingestion/source/datahub/datahub_source.py

+        if self.config.database_connection is None:
+            return


I'm actually fine with an assertion here

hsheth2 · 2023-08-29T18:32:43Z

metadata-ingestion/docs/sources/datahub/datahub_recipe.yml

+      enabled: true
+      ignore_old_state: false
+  extractor_config:
+    set_system_metadata: false  # Replicate system metadata


let's talk about it more later - we do need a home for some of these things, but not sure where that should be

asikowitz added 3 commits August 28, 2023 12:13

feat(ingest/datahub): Improvements, bug fixes, and docs

b76f2ab

lint

ef6446d

validator on config

22a56d3

asikowitz requested a review from hsheth2 August 28, 2023 16:32

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Aug 28, 2023

asikowitz commented Aug 28, 2023

View reviewed changes

asikowitz requested a review from RyanHolstien August 28, 2023 16:44

RyanHolstien reviewed Aug 28, 2023

View reviewed changes

vercel bot deployed to Preview August 28, 2023 17:14 View deployment

hsheth2 reviewed Aug 28, 2023

View reviewed changes

pr feedback

f1726dc

asikowitz requested a review from hsheth2 August 28, 2023 18:56

italics over bold, which blends in with header

f88d43d

vercel bot deployed to Preview August 28, 2023 19:24 View deployment

hsheth2 approved these changes Aug 29, 2023

View reviewed changes

asikowitz merged commit 40d17f0 into datahub-project:master Aug 29, 2023
51 checks passed

asikowitz deleted the update-datahub-source branch August 29, 2023 18:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest/datahub): Improvements, bug fixes, and docs #8735

feat(ingest/datahub): Improvements, bug fixes, and docs #8735

asikowitz commented Aug 28, 2023

asikowitz Aug 28, 2023

asikowitz Aug 28, 2023

asikowitz Aug 28, 2023

hsheth2 Aug 28, 2023

asikowitz Aug 28, 2023

asikowitz Aug 28, 2023

hsheth2 Aug 28, 2023

asikowitz Aug 28, 2023

RyanHolstien Aug 28, 2023

hsheth2 Aug 28, 2023

asikowitz Aug 28, 2023

hsheth2 Aug 29, 2023

hsheth2 Aug 28, 2023

hsheth2 Aug 28, 2023

hsheth2 Aug 29, 2023

feat(ingest/datahub): Improvements, bug fixes, and docs #8735

feat(ingest/datahub): Improvements, bug fixes, and docs #8735

Conversation

asikowitz commented Aug 28, 2023

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment