feat(ingest): Add DataHub source #8561

asikowitz · 2023-08-03T10:35:12Z

Some remaining questions on deserialization and workunit ids -- see TODOs. Also needs to be more thoroughly tested, have only run on my local machine to a file sink. Already got one error from that test:

 'mysql_parse_errors': {'com.linkedin.pegasus2avro.common.CostCost is missing required field: fieldDiscriminator': {'cost': ['urn:li:mlModel:(urn:li:dataPlatform:science,scienceModel,PROD)']}},

which makes me think I should probably not try to deserialize into MCPWs

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

hsheth2 · 2023-08-03T19:12:53Z

metadata-ingestion/src/datahub/ingestion/api/source_helpers.py

@@ -314,7 +314,7 @@ def auto_empty_dataset_usage_statistics(
        logger.warning(
            f"Usage statistics with unexpected timestamps, bucket_duration={config.bucket_duration}:\n"
            ", ".join(
-                str(datetime.fromtimestamp(ts, tz=timezone.utc))
+                str(datetime.fromtimestamp(ts / 1000, tz=timezone.utc))


yikes - kinda bad that we missed this

It's just in a warning that shouldn't get hit right now, but yeah, not great, because I believe this will raise an out of bounds exception

hsheth2 · 2023-08-03T19:16:18Z

metadata-ingestion/setup.py

@@ -255,6 +255,10 @@ def get_long_description():
    "requests",
 }

+mysql = sql_common | {"pymysql>=1.0.2"}
+
+kafka = kafka_common | kafka_protobuf


do we need kafka_protobuf for the datahub source?

we should only need kafka_common right?

Ah I don't know I just took everything. I'll try with just kafka common

hsheth2 · 2023-08-03T19:18:02Z

metadata-ingestion/src/datahub/ingestion/source/datahub/config.py

+class DataHubSourceConfig(StatefulIngestionConfigBase):
+    mysql_connection: MySQLConfig = Field(
+        # TODO: Check, do these defaults make sense?
+        default=MySQLConfig(username="datahub", password="datahub", database="datahub"),


MySQLConfig also has table_pattern, view_pattern, domain, etc

seems like we might need something separate here

Overall - we should always have a split of "connection config" vs "source config", where the latter inherits/embeds the former

also probably doesn't make sense to have a default here

Yup, I'll remove defaults. I guess I can start disentangling our configs here

Changes made, which didn't really make things simpler unfortunately:

Renamed SQLAlchemyConfig -> SQLCommonConfig

Split out the connection parts (i.e. all) of BasicSQLAlchemyConfig into SQLAlchemyConnectionConfig

Created MySQLConnectionConfig off of SQLAlchemyConnectionConfig and MySQLConfig now also inherits from MySQLConnectionConfig

hsheth2 · 2023-08-03T19:19:06Z

metadata-ingestion/src/datahub/ingestion/source/datahub/config.py

+    )
+
+    kafka_topic_name: str = Field(
+        default="MetadataChangeLog_Timeseries_v1",


let's extract these to constants

hsheth2 · 2023-08-03T19:21:07Z

metadata-ingestion/src/datahub/ingestion/source/datahub/datahub_kafka_reader.py

+
+            yield mcl, msg.offset()
+
+        self.consumer.unassign()


should this be in a finally: block, or does it not really matter?

Probably best to do so yeah

hsheth2 · 2023-08-03T19:21:45Z

metadata-ingestion/src/datahub/ingestion/source/datahub/datahub_mysql_reader.py

+    @property
+    def query(self) -> str:
+        return f"""
+            SELECT urn, aspect, metadata, createdon


we probably also want to copy system metadata

hsheth2 · 2023-08-03T19:22:17Z

metadata-ingestion/src/datahub/ingestion/source/datahub/datahub_mysql_reader.py

+            return MetadataChangeProposalWrapper(
+                entityUrn=row.urn,
+                # TODO: Get rid of deserialization -- create MCPC?
+                aspect=ASPECT_MAP[row.aspect].from_obj(json.loads(row.metadata)),


needs the post_json_transform here

hsheth2 · 2023-08-03T19:33:17Z

metadata-ingestion/src/datahub/ingestion/source/datahub/state.py

+    def commit_checkpoint(self) -> None:
+        if self.state_provider.ingestion_checkpointing_state_provider:
+            self.state_provider.prepare_for_commit()
+            self.state_provider.ingestion_checkpointing_state_provider.commit()


wow this just reminds me of how much I dislike our stateful ingestion implementation

Yeah, I kinda had to hack this in because it's not really built to be committed individually -- the only interface is registering committables and committing them all at once. Being able to commit individually is pretty useful, would be nice to add at some point

hsheth2

Some minor questions but overall looking good

hsheth2 · 2023-08-14T20:11:52Z

metadata-ingestion/src/datahub/ingestion/source/datahub/datahub_source.py

        on_interval = (
            i
            and self.config.commit_state_interval
            and i % self.config.commit_state_interval == 0
        )

-        if not has_errors and (i is None or on_interval):
+        if i is None or on_interval:


so commits happen regardless of commit_with_parse_errors, while updating the state only happens conditionally?

Yeah, I swapped it cause this logic seemed simpler -- this allows me to update the kafka state if there are only mysql errors, and vice versa.

makes sense

hsheth2 · 2023-08-14T20:14:10Z

metadata-ingestion/src/datahub/ingestion/source/datahub/datahub_source.py

+        with DataHubKafkaReader(self.config, self.report, self.ctx) as reader:
+            mcls = reader.get_mcls(from_offsets=from_offsets, stop_time=stop_time)
+            for i, (mcl, offset) in enumerate(mcls):
+                mcp = MetadataChangeProposalWrapper.try_from_mcl(mcl)


should we add some logging around this - call out that changeType=DELETE is not supported yet?

Yeah might as well. So if changeType=DELETE think I should just drop the mcp?

hsheth2 · 2023-08-14T20:16:31Z

metadata-ingestion/src/datahub/ingestion/source/sql/sql_config.py

@@ -91,7 +91,7 @@ def get_sql_alchemy_url(self):
        pass


-class BasicSQLAlchemyConfig(SQLAlchemyConfig):
+class SQLAlchemyConnectionConfig(ConfigModel):


we should map out what we want the full hierarchy to look like in the future

metadata-ingestion/src/datahub/ingestion/source/state/checkpoint.py

…nt.py Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

hsheth2 · 2023-08-15T21:14:23Z

metadata-ingestion/tests/unit/stateful_ingestion/state/test_checkpoint.py

@@ -27,7 +27,7 @@ def _assert_checkpoint_deserialization(
 ) -> Checkpoint:
    # Serialize a checkpoint aspect with the previous state.
    checkpoint_aspect = DatahubIngestionCheckpointClass(
-        timestampMillis=int(datetime.utcnow().timestamp() * 1000),
+        timestampMillis=int(datetime.now().timestamp() * 1000),


missed tz=utc here

Will add in followup

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

feat(ingest): Add DataHub source

f61faed

asikowitz requested a review from hsheth2 August 3, 2023 10:35

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Aug 3, 2023

asikowitz marked this pull request as draft August 3, 2023 10:35

vercel bot deployed to Preview August 3, 2023 10:49 View deployment

hsheth2 reviewed Aug 3, 2023

View reviewed changes

pr feedback

818276c

asikowitz requested a review from hsheth2 August 10, 2023 14:46

asikowitz marked this pull request as ready for review August 10, 2023 14:48

vercel bot deployed to Preview August 10, 2023 15:02 View deployment

asikowitz added 2 commits August 14, 2023 12:33

add mysql pagination; read from all kafka partitions

4c6460d

lint

f8722b5

vercel bot deployed to Preview August 14, 2023 17:19 View deployment

hsheth2 reviewed Aug 14, 2023

View reviewed changes

asikowitz added 2 commits August 15, 2023 12:33

lint; sqlalchemy 1.3 compat (hopefully); utcnow -> now; log delete mcl

42e03d7

info -> debug log

691b6b2

asikowitz requested a review from hsheth2 August 15, 2023 16:37

set maxsize

c9dd85b

hsheth2 reviewed Aug 15, 2023

View reviewed changes

metadata-ingestion/src/datahub/ingestion/source/state/checkpoint.py Outdated Show resolved Hide resolved

vercel bot deployed to Preview August 15, 2023 17:53 View deployment

Update metadata-ingestion/src/datahub/ingestion/source/state/checkpoi…

458971b

…nt.py Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

asikowitz requested a review from hsheth2 August 15, 2023 20:03

vercel bot deployed to Preview August 15, 2023 20:26 View deployment

hsheth2 approved these changes Aug 15, 2023

View reviewed changes

asikowitz merged commit 526e626 into datahub-project:master Aug 15, 2023
44 checks passed

asikowitz deleted the datahub-source branch August 15, 2023 21:49

asikowitz added a commit to asikowitz/datahub that referenced this pull request Aug 23, 2023

feat(ingest): Add DataHub source (datahub-project#8561)

cbac6a5

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

yoonhyejin pushed a commit that referenced this pull request Aug 24, 2023

feat(ingest): Add DataHub source (#8561)

9937a74

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingest): Add DataHub source #8561

feat(ingest): Add DataHub source #8561

asikowitz commented Aug 3, 2023

hsheth2 Aug 3, 2023

asikowitz Aug 3, 2023

hsheth2 Aug 3, 2023

asikowitz Aug 3, 2023

hsheth2 Aug 3, 2023

hsheth2 Aug 3, 2023

asikowitz Aug 3, 2023

asikowitz Aug 10, 2023

hsheth2 Aug 3, 2023

hsheth2 Aug 3, 2023

asikowitz Aug 3, 2023

hsheth2 Aug 3, 2023

hsheth2 Aug 3, 2023

hsheth2 Aug 3, 2023

asikowitz Aug 3, 2023

hsheth2 left a comment

hsheth2 Aug 14, 2023

asikowitz Aug 14, 2023

hsheth2 Aug 15, 2023

hsheth2 Aug 14, 2023

asikowitz Aug 14, 2023

hsheth2 Aug 14, 2023

asikowitz Aug 14, 2023

hsheth2 Aug 15, 2023

asikowitz Aug 15, 2023

feat(ingest): Add DataHub source #8561

feat(ingest): Add DataHub source #8561

Conversation

asikowitz commented Aug 3, 2023

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hsheth2 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment