Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest): Add DataHub source #8561

Merged
merged 8 commits into from
Aug 15, 2023

Conversation

asikowitz
Copy link
Collaborator

Some remaining questions on deserialization and workunit ids -- see TODOs. Also needs to be more thoroughly tested, have only run on my local machine to a file sink. Already got one error from that test:

 'mysql_parse_errors': {'com.linkedin.pegasus2avro.common.CostCost is missing required field: fieldDiscriminator': {'cost': ['urn:li:mlModel:(urn:li:dataPlatform:science,scienceModel,PROD)']}},

which makes me think I should probably not try to deserialize into MCPWs

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@asikowitz asikowitz requested a review from hsheth2 August 3, 2023 10:35
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Aug 3, 2023
@asikowitz asikowitz marked this pull request as draft August 3, 2023 10:35
@@ -314,7 +314,7 @@ def auto_empty_dataset_usage_statistics(
logger.warning(
f"Usage statistics with unexpected timestamps, bucket_duration={config.bucket_duration}:\n"
", ".join(
str(datetime.fromtimestamp(ts, tz=timezone.utc))
str(datetime.fromtimestamp(ts / 1000, tz=timezone.utc))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yikes - kinda bad that we missed this

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just in a warning that shouldn't get hit right now, but yeah, not great, because I believe this will raise an out of bounds exception

@@ -255,6 +255,10 @@ def get_long_description():
"requests",
}

mysql = sql_common | {"pymysql>=1.0.2"}

kafka = kafka_common | kafka_protobuf
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need kafka_protobuf for the datahub source?

we should only need kafka_common right?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I don't know I just took everything. I'll try with just kafka common

class DataHubSourceConfig(StatefulIngestionConfigBase):
mysql_connection: MySQLConfig = Field(
# TODO: Check, do these defaults make sense?
default=MySQLConfig(username="datahub", password="datahub", database="datahub"),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MySQLConfig also has table_pattern, view_pattern, domain, etc

seems like we might need something separate here

Overall - we should always have a split of "connection config" vs "source config", where the latter inherits/embeds the former

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also probably doesn't make sense to have a default here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, I'll remove defaults. I guess I can start disentangling our configs here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes made, which didn't really make things simpler unfortunately:

  • Renamed SQLAlchemyConfig -> SQLCommonConfig
  • Split out the connection parts (i.e. all) of BasicSQLAlchemyConfig into SQLAlchemyConnectionConfig
  • Created MySQLConnectionConfig off of SQLAlchemyConnectionConfig and MySQLConfig now also inherits from MySQLConnectionConfig

)

kafka_topic_name: str = Field(
default="MetadataChangeLog_Timeseries_v1",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's extract these to constants


yield mcl, msg.offset()

self.consumer.unassign()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be in a finally: block, or does it not really matter?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably best to do so yeah

@property
def query(self) -> str:
return f"""
SELECT urn, aspect, metadata, createdon
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we probably also want to copy system metadata

return MetadataChangeProposalWrapper(
entityUrn=row.urn,
# TODO: Get rid of deserialization -- create MCPC?
aspect=ASPECT_MAP[row.aspect].from_obj(json.loads(row.metadata)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs the post_json_transform here

def commit_checkpoint(self) -> None:
if self.state_provider.ingestion_checkpointing_state_provider:
self.state_provider.prepare_for_commit()
self.state_provider.ingestion_checkpointing_state_provider.commit()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow this just reminds me of how much I dislike our stateful ingestion implementation

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I kinda had to hack this in because it's not really built to be committed individually -- the only interface is registering committables and committing them all at once. Being able to commit individually is pretty useful, would be nice to add at some point

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor questions but overall looking good

on_interval = (
i
and self.config.commit_state_interval
and i % self.config.commit_state_interval == 0
)

if not has_errors and (i is None or on_interval):
if i is None or on_interval:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so commits happen regardless of commit_with_parse_errors, while updating the state only happens conditionally?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I swapped it cause this logic seemed simpler -- this allows me to update the kafka state if there are only mysql errors, and vice versa.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense

with DataHubKafkaReader(self.config, self.report, self.ctx) as reader:
mcls = reader.get_mcls(from_offsets=from_offsets, stop_time=stop_time)
for i, (mcl, offset) in enumerate(mcls):
mcp = MetadataChangeProposalWrapper.try_from_mcl(mcl)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we add some logging around this - call out that changeType=DELETE is not supported yet?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah might as well. So if changeType=DELETE think I should just drop the mcp?

@@ -91,7 +91,7 @@ def get_sql_alchemy_url(self):
pass


class BasicSQLAlchemyConfig(SQLAlchemyConfig):
class SQLAlchemyConnectionConfig(ConfigModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should map out what we want the full hierarchy to look like in the future

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah...

@asikowitz asikowitz requested a review from hsheth2 August 15, 2023 16:37
…nt.py

Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
@@ -27,7 +27,7 @@ def _assert_checkpoint_deserialization(
) -> Checkpoint:
# Serialize a checkpoint aspect with the previous state.
checkpoint_aspect = DatahubIngestionCheckpointClass(
timestampMillis=int(datetime.utcnow().timestamp() * 1000),
timestampMillis=int(datetime.now().timestamp() * 1000),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missed tz=utc here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add in followup

@asikowitz asikowitz merged commit 526e626 into datahub-project:master Aug 15, 2023
44 checks passed
@asikowitz asikowitz deleted the datahub-source branch August 15, 2023 21:49
asikowitz added a commit to asikowitz/datahub that referenced this pull request Aug 23, 2023
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
yoonhyejin pushed a commit that referenced this pull request Aug 24, 2023
Co-authored-by: Harshal Sheth <hsheth2@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants