Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingestion) Allow for ingestion to read files remotely #7552

Merged
merged 18 commits into from
Mar 30, 2023

Conversation

xiphl
Copy link
Contributor

@xiphl xiphl commented Mar 12, 2023

The proposition of this PR is to allow the following sources to read from URLs (especially git repos) instead of being constrained to local files:

  1. csv-enricher
  2. business glossary
  3. file
  4. file-lineage

Still work in progress. (Especially updating the tests)

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 12, 2023
@xiphl xiphl changed the title Allow for ingestion to read files remotely feat(ingestion) Allow for ingestion to read files remotely Mar 12, 2023
@shirshanka
Copy link
Contributor

Excited for this! Support for s3, gcs would be great to add as well.

@xiphl
Copy link
Contributor Author

xiphl commented Mar 13, 2023

I will try to finish the missing bits soon!
I'm not sure how to implement for S3/GCS sources though.

@xiphl
Copy link
Contributor Author

xiphl commented Mar 17, 2023

After working on it yesterday, I think i can finish it sometime this weekend!

@xiphl
Copy link
Contributor Author

xiphl commented Mar 20, 2023

let me work on the failed tests first

@xiphl xiphl closed this Mar 20, 2023
@xiphl xiphl reopened this Mar 21, 2023
@vercel
Copy link

vercel bot commented Mar 21, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated
docs-website ✅ Ready (Inspect) Visit Preview 💬 Add your feedback Mar 22, 2023 at 3:55AM (UTC)

@xiphl
Copy link
Contributor Author

xiphl commented Mar 22, 2023

@shirshanka I could do with some advice on the failed assert in auditStamp time between the golden and generated json file.
For some reason the time set by freezegun isn't picked up when its at the file-lineage logic, yet I don't see any reason why that would be the case when that module had similar approaches to generating the auditStamp. Reordering the tests doesnt change the fact that it always fail at file-lineage.
Any ideas?


@freeze_time(FROZEN_TIME)
@pytest.mark.integration
def test_remote_ingest(docker_compose_runner, pytestconfig, tmp_path, mock_time):
Copy link
Collaborator

@asikowitz asikowitz Mar 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @xiphl, the mock_time fixture you're using here also mocks time.time() which is what generates audit stamps. I think this test will pass if you remove the fixture!

Nevermind, seems like that's not the issue, although I think that mock should no longer be necessary. I believe the issue is that the file-based lineage source defines:

auditStamp = models.AuditStampClass(
    time=get_sys_time(), actor="urn:li:corpUser:pythonEmitter"
)

at the top of datahub/ingestion/source/metadata/lineage.py, which gets run when the file is imported, which happens when tests/unit/test_file_lineage_source.py is imported, before freeze_time does its monkeypatching for this test.

We should move the definition of that audit stamp elsewhere, perhaps in get_lineage_metadata_change_event_proposal, or just create the audit stamps on demand

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup that was the cause indeed!

@xiphl xiphl closed this Mar 26, 2023
@codecov-commenter
Copy link

codecov-commenter commented Mar 26, 2023

Codecov Report

Patch coverage: 71.77% and project coverage change: +0.04 🎉

Comparison is base (419bee8) 74.87% compared to head (348723b) 74.92%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7552      +/-   ##
==========================================
+ Coverage   74.87%   74.92%   +0.04%     
==========================================
  Files         353      353              
  Lines       35385    35429      +44     
==========================================
+ Hits        26495    26544      +49     
+ Misses       8890     8885       -5     
Flag Coverage Δ
pytest-testIntegration 51.57% <50.00%> (+0.74%) ⬆️
pytest-testIntegrationBatch1 36.43% <16.93%> (-0.03%) ⬇️
pytest-testQuick 63.51% <58.87%> (-0.04%) ⬇️
pytest-testSlowIntegration 32.90% <12.09%> (-0.05%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ata-ingestion/src/datahub/ingestion/source/file.py 75.78% <56.71%> (+1.21%) ⬆️
...gestion/src/datahub/configuration/config_loader.py 93.33% <86.66%> (-2.13%) ⬇️
...stion/src/datahub/ingestion/source/csv_enricher.py 87.68% <88.88%> (-0.11%) ⬇️
...hub/ingestion/source/metadata/business_glossary.py 91.36% <100.00%> (-0.08%) ⬇️
...n/src/datahub/ingestion/source/metadata/lineage.py 91.75% <100.00%> (+16.24%) ⬆️
...rc/datahub/ingestion/source_config/csv_enricher.py 88.88% <100.00%> (ø)

... and 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@xiphl xiphl marked this pull request as ready for review March 26, 2023 13:56
Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks so much for working on this and contributing further to datahub. I left a few docs / style comments, mostly for my own benefit; I think this is good to go.

pathlib.Path(self.config.filename), mode="r", encoding="utf-8-sig"
) as f:
rows = csv.DictReader(f, delimiter=self.config.delimiter)
keep_rows = [row for row in rows]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can be more succinctly list(row). Also, can you add a quick note in this source's docstring that this source will not work with very large csv files that do not fit into memory?

Comment on lines +598 to +606
for wu in self.get_resource_workunits(
entity_urn=entity_urn,
term_associations=term_associations,
tag_associations=tag_associations,
owners=owners,
domain=domain,
description=description,
):
yield wu
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Know you didn't write this, but can be more concisely yield from self.get_resource_workunits(...)

Comment on lines +192 to +194
for x in list(
self.config.path.glob(f"*{self.config.file_extension}")
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unnecessary list call

if self.config.path.is_file():
path_parsed = parse.urlparse(str(self.config.path))
if path_parsed.scheme in ("file", ""):
self.config.path = pathlib.Path(self.config.path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems self.config.path is only used in this method. It might be cleaner to always have self.config.path: str and store the Path object in a separate variable

@@ -160,7 +155,6 @@ def _get_entity_urn(entity_config: EntityConfig) -> Optional[str]:
new_upstream = models.UpstreamClass(
dataset=upstream_entity_urn,
type=models.DatasetLineageTypeClass.TRANSFORMED,
auditStamp=auditStamp,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you keep this, but use an auditStamp created in this function?

@asikowitz asikowitz merged commit 7d240c6 into datahub-project:master Mar 30, 2023
yoonhyejin pushed a commit that referenced this pull request Apr 3, 2023
Co-authored-by: xiphl <xiphlerl9@gmail.com>
Allows the CsvEnricher, BusinessGlossary, File, and LineageFile sources to read from URLs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants