feat(ingestion) Allow for ingestion to read files remotely #7552

xiphl · 2023-03-12T15:43:59Z

The proposition of this PR is to allow the following sources to read from URLs (especially git repos) instead of being constrained to local files:

csv-enricher
business glossary
file
file-lineage

Still work in progress. (Especially updating the tests)

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

shirshanka · 2023-03-13T01:32:21Z

Excited for this! Support for s3, gcs would be great to add as well.

xiphl · 2023-03-13T05:57:10Z

I will try to finish the missing bits soon!
I'm not sure how to implement for S3/GCS sources though.

xiphl · 2023-03-17T02:14:52Z

After working on it yesterday, I think i can finish it sometime this weekend!

xiphl · 2023-03-20T12:34:08Z

let me work on the failed tests first

vercel · 2023-03-21T14:31:05Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated
docs-website	✅ Ready (Inspect)	Visit Preview	💬 Add your feedback	Mar 22, 2023 at 3:55AM (UTC)

xiphl · 2023-03-22T03:59:27Z

@shirshanka I could do with some advice on the failed assert in auditStamp time between the golden and generated json file.
For some reason the time set by freezegun isn't picked up when its at the file-lineage logic, yet I don't see any reason why that would be the case when that module had similar approaches to generating the auditStamp. Reordering the tests doesnt change the fact that it always fail at file-lineage.
Any ideas?

asikowitz · 2023-03-22T22:47:54Z

metadata-ingestion/tests/integration/remote/test_remote.py

+
+@freeze_time(FROZEN_TIME)
+@pytest.mark.integration
+def test_remote_ingest(docker_compose_runner, pytestconfig, tmp_path, mock_time):


~~Hey @xiphl, the mock_time fixture you're using here also mocks time.time() which is what generates audit stamps. I think this test will pass if you remove the fixture!~~

Nevermind, seems like that's not the issue, although I think that mock should no longer be necessary. I believe the issue is that the file-based lineage source defines:

auditStamp = models.AuditStampClass( time=get_sys_time(), actor="urn:li:corpUser:pythonEmitter" )

at the top of datahub/ingestion/source/metadata/lineage.py, which gets run when the file is imported, which happens when tests/unit/test_file_lineage_source.py is imported, before freeze_time does its monkeypatching for this test.

We should move the definition of that audit stamp elsewhere, perhaps in get_lineage_metadata_change_event_proposal, or just create the audit stamps on demand

yup that was the cause indeed!

codecov-commenter · 2023-03-26T10:58:31Z

Codecov Report

Patch coverage: 71.77% and project coverage change: +0.04 🎉

Comparison is base (419bee8) 74.87% compared to head (348723b) 74.92%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7552      +/-   ##
==========================================
+ Coverage   74.87%   74.92%   +0.04%     
==========================================
  Files         353      353              
  Lines       35385    35429      +44     
==========================================
+ Hits        26495    26544      +49     
+ Misses       8890     8885       -5

Flag	Coverage Δ
pytest-testIntegration	`51.57% <50.00%> (+0.74%)`	⬆️
pytest-testIntegrationBatch1	`36.43% <16.93%> (-0.03%)`	⬇️
pytest-testQuick	`63.51% <58.87%> (-0.04%)`	⬇️
pytest-testSlowIntegration	`32.90% <12.09%> (-0.05%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
...ata-ingestion/src/datahub/ingestion/source/file.py	`75.78% <56.71%> (+1.21%)`	⬆️
...gestion/src/datahub/configuration/config_loader.py	`93.33% <86.66%> (-2.13%)`	⬇️
...stion/src/datahub/ingestion/source/csv_enricher.py	`87.68% <88.88%> (-0.11%)`	⬇️
...hub/ingestion/source/metadata/business_glossary.py	`91.36% <100.00%> (-0.08%)`	⬇️
...n/src/datahub/ingestion/source/metadata/lineage.py	`91.75% <100.00%> (+16.24%)`	⬆️
...rc/datahub/ingestion/source_config/csv_enricher.py	`88.88% <100.00%> (ø)`

... and 2 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

asikowitz

This looks great! Thanks so much for working on this and contributing further to datahub. I left a few docs / style comments, mostly for my own benefit; I think this is good to go.

asikowitz · 2023-03-30T00:41:57Z

metadata-ingestion/src/datahub/ingestion/source/csv_enricher.py

+                pathlib.Path(self.config.filename), mode="r", encoding="utf-8-sig"
+            ) as f:
+                rows = csv.DictReader(f, delimiter=self.config.delimiter)
+                keep_rows = [row for row in rows]


Can be more succinctly list(row). Also, can you add a quick note in this source's docstring that this source will not work with very large csv files that do not fit into memory?

asikowitz · 2023-03-30T00:43:53Z

metadata-ingestion/src/datahub/ingestion/source/csv_enricher.py

+                for wu in self.get_resource_workunits(
+                    entity_urn=entity_urn,
+                    term_associations=term_associations,
+                    tag_associations=tag_associations,
+                    owners=owners,
+                    domain=domain,
+                    description=description,
+                ):
+                    yield wu


Know you didn't write this, but can be more concisely yield from self.get_resource_workunits(...)

asikowitz · 2023-03-30T00:49:07Z

metadata-ingestion/src/datahub/ingestion/source/file.py

+                    for x in list(
+                        self.config.path.glob(f"*{self.config.file_extension}")
+                    )


Unnecessary list call

asikowitz · 2023-03-30T00:53:14Z

metadata-ingestion/src/datahub/ingestion/source/file.py

-        if self.config.path.is_file():
+        path_parsed = parse.urlparse(str(self.config.path))
+        if path_parsed.scheme in ("file", ""):
+            self.config.path = pathlib.Path(self.config.path)


It seems self.config.path is only used in this method. It might be cleaner to always have self.config.path: str and store the Path object in a separate variable

asikowitz · 2023-03-30T01:00:41Z

metadata-ingestion/src/datahub/ingestion/source/metadata/lineage.py

@@ -160,7 +155,6 @@ def _get_entity_urn(entity_config: EntityConfig) -> Optional[str]:
                            new_upstream = models.UpstreamClass(
                                dataset=upstream_entity_urn,
                                type=models.DatasetLineageTypeClass.TRANSFORMED,
-                                auditStamp=auditStamp,


Can you keep this, but use an auditStamp created in this function?

Co-authored-by: xiphl <xiphlerl9@gmail.com> Allows the CsvEnricher, BusinessGlossary, File, and LineageFile sources to read from URLs.

xiphl added 2 commits March 12, 2023 21:40

lineage, file and glossary

662074e

csv enricher updated

b5088f1

github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 12, 2023

Merge branch 'master' into read-file-remotely

667745e

xiphl changed the title ~~Allow for ingestion to read files remotely~~ feat(ingestion) Allow for ingestion to read files remotely Mar 12, 2023

xiphl added 4 commits March 19, 2023 18:08

integration test included

ee23a21

Merge remote-tracking branch 'origin' into read-file-remotely

91dac72

documentation and try/catches

27e2f82

linting

4239a12

xiphl closed this Mar 20, 2023

xiphl added 6 commits March 21, 2023 20:32

pipeline

2c242f3

lint

5f1a9c8

change to pipeline

40ba19d

removed configs and changes

3a91068

Merge remote-tracking branch 'origin' into read-file-remotely

7ca2cf1

cleanup

41e26b5

xiphl reopened this Mar 21, 2023

vercel bot deployed to Preview March 21, 2023 14:31 View deployment

xiphl closed this Mar 22, 2023

remove logging statement

f6314ce

xiphl reopened this Mar 22, 2023

vercel bot deployed to Preview March 22, 2023 03:55 View deployment

hsheth2 assigned asikowitz Mar 22, 2023

asikowitz reviewed Mar 22, 2023

View reviewed changes

xiphl closed this Mar 26, 2023

xiphl added 4 commits March 26, 2023 09:52

Merge remote-tracking branch 'origin' into read-file-remotely

057a3cd

adedlogging

3f6d50a

removed auditstamp

569b691

changed golden

348723b

xiphl reopened this Mar 26, 2023

vercel bot deployed to Preview March 26, 2023 10:56 View deployment

xiphl marked this pull request as ready for review March 26, 2023 13:56

asikowitz approved these changes Mar 30, 2023

View reviewed changes

asikowitz merged commit 7d240c6 into datahub-project:master Mar 30, 2023

asikowitz mentioned this pull request Mar 30, 2023

refactor(ingest): Minor cleanup of File, CsvEnricher, BusinessGlossary, and FileLineage sources #7718

Merged

5 tasks

yoonhyejin pushed a commit that referenced this pull request Apr 3, 2023

feat(ingestion) Allow for ingestion to read files remotely (#7552)

fcf0da7

Co-authored-by: xiphl <xiphlerl9@gmail.com> Allows the CsvEnricher, BusinessGlossary, File, and LineageFile sources to read from URLs.

xiphl mentioned this pull request Apr 20, 2023

[bug] Ingestion breaks on Windows file path #7865

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ingestion) Allow for ingestion to read files remotely #7552

feat(ingestion) Allow for ingestion to read files remotely #7552

xiphl commented Mar 12, 2023 •

edited

Loading

shirshanka commented Mar 13, 2023

xiphl commented Mar 13, 2023

xiphl commented Mar 17, 2023

xiphl commented Mar 20, 2023

vercel bot commented Mar 21, 2023 •

edited

Loading

xiphl commented Mar 22, 2023

asikowitz Mar 22, 2023 •

edited

Loading

xiphl Mar 26, 2023

codecov-commenter commented Mar 26, 2023 •

edited

Loading

asikowitz left a comment

asikowitz Mar 30, 2023

asikowitz Mar 30, 2023

asikowitz Mar 30, 2023

asikowitz Mar 30, 2023

asikowitz Mar 30, 2023

feat(ingestion) Allow for ingestion to read files remotely #7552

feat(ingestion) Allow for ingestion to read files remotely #7552

Conversation

xiphl commented Mar 12, 2023 • edited Loading

shirshanka commented Mar 13, 2023

xiphl commented Mar 13, 2023

xiphl commented Mar 17, 2023

xiphl commented Mar 20, 2023

vercel bot commented Mar 21, 2023 • edited Loading

xiphl commented Mar 22, 2023

asikowitz Mar 22, 2023 • edited Loading

Choose a reason for hiding this comment

xiphl Mar 26, 2023

Choose a reason for hiding this comment

codecov-commenter commented Mar 26, 2023 • edited Loading

Codecov Report

asikowitz left a comment

Choose a reason for hiding this comment

asikowitz Mar 30, 2023

Choose a reason for hiding this comment

asikowitz Mar 30, 2023

Choose a reason for hiding this comment

asikowitz Mar 30, 2023

Choose a reason for hiding this comment

asikowitz Mar 30, 2023

Choose a reason for hiding this comment

asikowitz Mar 30, 2023

Choose a reason for hiding this comment

xiphl commented Mar 12, 2023 •

edited

Loading

vercel bot commented Mar 21, 2023 •

edited

Loading

asikowitz Mar 22, 2023 •

edited

Loading

codecov-commenter commented Mar 26, 2023 •

edited

Loading