Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ingest/postgres): support extracting metadata from all databases in single recipe #7581

Merged

Conversation

mayurinehate
Copy link
Collaborator

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@mayurinehate mayurinehate changed the title feat(ingest/postgres): support extracting metadata from all databases in single pipeline feat(ingest/postgres): support extracting metadata from all databases in single recipe Mar 14, 2023
@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Mar 14, 2023
@@ -101,13 +105,14 @@ class PostgresConfig(BasicSQLAlchemyConfig):
default=False, description="Include table lineage for views"
)

def get_identifier(self: BasicSQLAlchemyConfig, schema: str, table: str) -> str:
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not required as we override get_identifier within PostgresSource

@mayurinehate mayurinehate marked this pull request as ready for review March 15, 2023 09:22
@mayurinehate mayurinehate force-pushed the postgres_support_all_databases branch from 27a5d6f to b284ed3 Compare March 15, 2023 09:22
Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had some questions about this

@vercel
Copy link

vercel bot commented Mar 24, 2023

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated
docs-website ✅ Ready (Inspect) Visit Preview 💬 Add your feedback Mar 24, 2023 at 6:36AM (UTC)

Copy link
Collaborator

@hsheth2 hsheth2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Last question: what does this do to existing users of the postgres source? If they weren't specifying database previously, what was the existing behavior? Will the urns change based on this change?

@mayurinehate
Copy link
Collaborator Author

Last question: what does this do to existing users of the postgres source? If they weren't specifying database previously, what was the existing behavior? Will the urns change based on this change?

The logic to construct urn remains same in this PR. Also, AFAIK, running postgres source without database did not yield any meaningful results earlier - just some containers with "none" in name.

Copy link
Collaborator

@asikowitz asikowitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly concerned about whether we close inspector connections. Otherwise a few style comments but LGTM!

logger.debug(f"sql_alchemy_url={url}")
engine = create_engine(url, **self.config.options)
with engine.connect() as conn:
if self.config.database and self.config.database != "":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is redundant -- "" is falsy

inspector = inspect(
create_engine(url, **self.config.options).connect()
)
yield inspector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to keep conn open as we run ingestion on each inspector? If we store databases in memory then we can move the iteration outside the with block to close conn. I don't think it really matters either way though

More importantly, are these inspectors getting closed?

url = self.config.get_sql_alchemy_url()
engine = create_engine(url, **self.config.options)
def _get_view_lineage_elements(
self, inspector: Inspector
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious, how come we're using Inspectors over a straight conn (or whatever the result of .connect() is?

Copy link
Collaborator Author

@mayurinehate mayurinehate Apr 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using connection directly does make sense indeed. especially for this piece _get_view_lineage_elements, as it does not require inspector. Inspector interface is used in sql_common to get metadata of tables, etc and probably that's why needs to be used here as well.

@jjoyce0510 jjoyce0510 merged commit fc238c2 into datahub-project:master Mar 28, 2023
yoonhyejin pushed a commit that referenced this pull request Apr 3, 2023
… in single recipe (#7581)

Co-authored-by: Tamas Nemeth <treff7es@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants