Skip to content

Conversation

@treff7es
Copy link
Contributor

@treff7es treff7es commented Oct 2, 2025

Optimize Snowflake Ingestion: Skip Query Fetching When All Features Disabled

Problem

When running Snowflake ingestion with all query-based features disabled, the ingestion process was still:

  • Fetching queries from Snowflake's query_history and access_history tables
  • Parsing and fingerprinting each query
  • Taking several minutes to process queries that would never be used

This happened because the feature flags (include_lineage, include_queries, include_usage_statistics, include_query_usage_statistics, include_operations) only controlled what output was
generated
, not whether to fetch queries in the first place.

Example Configuration Affected

source:
  type: snowflake
  config:
    include_table_lineage: false
    include_queries: false
    include_usage_stats: false
    include_query_usage_statistics: false
    include_operational_stats: false
    include_views: true
    # ... other settings

Users would see 'num_preparsed_queries': > 0 in logs even though all query features were disabled.

Solution

Added a conditional check in SnowflakeQueriesExtractor.get_workunits_internal() (lines 300-322 in snowflake_queries.py) that evaluates whether any query-based features are enabled before executing
expensive Snowflake queries:

needs_query_data = any([
    self.config.include_lineage,
    self.config.include_queries,
    self.config.include_usage_statistics,
    self.config.include_query_usage_statistics,
    self.config.include_operations,
])

if not needs_query_data:
    logger.info("All query-based features are disabled. Skipping expensive query log fetch.")
else:
    # Fetch copy history and query log..
<!--

Thank you for contributing to DataHub!

Before you submit your PR, please go through the checklist below:

- [ ] The PR conforms to DataHub's [Contributing Guideline](https://github.com/datahub-project/datahub/blob/master/docs/CONTRIBUTING.md) (particularly [PR Title Format](https://github.com/datahub-project/datahub/blob/master/docs/CONTRIBUTING.md#pr-title-format))
- [ ] Links to related issues (if applicable)
- [ ] Tests for the changes have been added/updated (if applicable)
- [ ] Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
- [ ] For any breaking change/potential downtime/deprecation/big changes an entry has been made in [Updating DataHub](https://github.com/datahub-project/datahub/blob/master/docs/how/updating-datahub.md)

-->

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Oct 2, 2025
@codecov
Copy link

codecov bot commented Oct 2, 2025

Codecov Report

❌ Patch coverage is 80.00000% with 2 lines in your changes missing coverage. Please review.
✅ All tests successful. No failed tests found.

Files with missing lines Patch % Lines
...ub/ingestion/source/snowflake/snowflake_queries.py 80.00% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@datahub-cyborg datahub-cyborg bot added the needs-review Label for PRs that need review from a maintainer. label Oct 2, 2025
Copy link
Collaborator

@skrydal skrydal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, in next step, let's try to limit amount of flags used to calculate needs_query_data.

@skrydal skrydal merged commit 78d2583 into master Oct 3, 2025
70 checks passed
@skrydal skrydal deleted the snowflake_skip_sql_parsing branch October 3, 2025 09:08
yoonhyejin pushed a commit that referenced this pull request Oct 9, 2025
alplatonov pushed a commit to alplatonov/datahub that referenced this pull request Oct 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ingestion PR or Issue related to the ingestion of metadata needs-review Label for PRs that need review from a maintainer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants