Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/bigquery): fix partition and median queries for profiling #8778

Conversation

mayurinehate
Copy link
Collaborator

@mayurinehate mayurinehate commented Sep 4, 2023

  1. Remove unnecessary function wrapped around partition column (issue reported on slack)
  2. Use approx_quantiles for median
  3. Set partitionSpec if limit..offset is used. This captures that profiling was not on full table. also fixes row_count for bigquery profiling if limit..offset were used. Earlier row_count would have been same as limit.
  4. use string constants for dialects, etc
  5. remove redundant if checks - they are already checked in code flow.
  6. report skipped profiles with reason details

Checklist

  • The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added the ingestion PR or Issue related to the ingestion of metadata label Sep 4, 2023
column_profile.median = str(
self.dataset.engine.execute(
sa.select(
sa.text(f"approx_quantiles(`{column}`, 2) [OFFSET (1)]")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is required. As of version 0.15.50 of GX, median calculation does not use this bigquery patched method.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add a comment explaining that

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can do it in a follow-up too

@hsheth2 hsheth2 merged commit e680a97 into datahub-project:master Sep 6, 2023
58 checks passed
spadhi7 added a commit to spadhi7/datahub that referenced this pull request Oct 4, 2023
* tag 'v0.11.0': (188 commits)
  fix(spark-test): upgrade gradle and fix spark smoke test (datahub-project#8777)
  fix(gms): Fixed Recently Viewed section for users with '@' in the URN. (datahub-project#8754)
  feat: add feedback widget (datahub-project#8732)
  fix(custom-search): fix custom search to be able to use unquoted query (datahub-project#8805)
  docs(db-retention): update with default setting (datahub-project#8797)
  feat(openapi): entity endpoints & analytics raw (datahub-project#8537)
  feat(search): Also de-duplicate the field queries based on field names (datahub-project#8788)
  fix(ingest): drop `wrap_aspect_as_workunit` method (datahub-project#8766)
  feat(ingest): drop sql_metadata parser (datahub-project#8765)
  docs: minor fix on versioning navbar and dropdown (datahub-project#8790)
  chore(ingest): upgrade sqlglot fork (datahub-project#8775)
  docs: add datahub source to integrations page (datahub-project#8787)
  fix(ingest/bigquery): fix partition and median queries for profiling (datahub-project#8778)
  fix(ingest/tableau): fix tableau native CLL for snowflake, add type annotations (datahub-project#8779)
  refactor(ingest): Add support for group-owners in dataflow entities (datahub-project#8154)
  feat(systemMetadata): Adding a lastRunId field system metadata  (datahub-project#8672)
  feat(airflow-plugin): add package type information (datahub-project#8795)
  fix(ingest/datahub): Support postgres; build(postgres): Modernize postgres docker setup (datahub-project#8762)
  docs(session): add documentation for session token duration and fix default (datahub-project#8791)
  chore(analytics): bump version (datahub-project#8786)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants