Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of the at_least_one test #776

Merged
merged 12 commits into from
Jun 2, 2023

Conversation

JoshuaHuntley
Copy link
Contributor

resolves #775

[This is a:

  • documentation update
  • bug fix with no breaking changes
  • new functionality
  • a breaking change

All pull requests from community contributors should target the main branch (default).

Description & motivation

Checklist

  • This code is associated with an Issue which has been triaged and accepted for development.
  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I followed guidelines to ensure that my changes will work on "non-core" adapters by:
    • dispatching any new macro(s) so non-core adapters can also use them (e.g. the star() source)
    • using the limit_zero() macro in place of the literal string: limit 0
    • using dbt.type_* macros instead of explicit datatypes (e.g. dbt.type_timestamp() instead of TIMESTAMP
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md
    ](https://github.com/JoshuaHuntley)

@JoshuaHuntley
Copy link
Contributor Author

The failing Redshift test is unrelated to these changes. The at_least_one tests in Redshift succeed.

Copy link
Contributor

@joellabes joellabes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes a ton of sense! Thanks @JoshuaHuntley 🎉

A couple of suggested changes - one that is a deferral to a different changeset (if at all... I'm not totally sold yet) and one change from a subquery to a CTE if possible.

macros/generic_tests/at_least_one.sql Outdated Show resolved Hide resolved
macros/generic_tests/at_least_one.sql Outdated Show resolved Hide resolved
@JoshuaHuntley
Copy link
Contributor Author

@joellabes I made the requested changes. FYI, I'm firmly in favor of CTEs but stuck to the original approach as much as possible on the first go-round. I've flattened the code out a bit so there is now only one layer of nesting. I see your position about the all-or-nothing approach to error handling. Pushing it down to the user, especially considering our intended user base, is fair.

@joellabes
Copy link
Contributor

Thanks @JoshuaHuntley! From the CI logs, it looks like you might have to get rid of the test that checks for the tested column being in the group_by columns as well.

@JoshuaHuntley
Copy link
Contributor Author

@joellabes I removed the test. There are Redshift adapter tests that are not related to this PR. Should I remove them or can we ignore them?

@dbeatty10
Copy link
Contributor

@JoshuaHuntley I'm @joellabes's colleague and I was asked to lend a hand on this review.

I pushed a couple commits to do the following:

  • restore the group by test and fix the underlying issue that was causing it to fail
  • update the changelog

The tests are all passing now ✅, so I think we'll be ready to merge soon.

Details

Case-sensitive group by test

The group by test was failing when the column to test also was included in the list of columns to group by. So I added some case-sensitive logic that checks whether it is in that list or not.

Note that other macros like union, unpivot, etc take additional steps to make this comparison case-insensitive.

So users might still get an error like column reference "field" is ambiguous if they do something like this:

  - name: data_test_at_least_one
    columns:
      - name: field
        tests:
          - dbt_utils.at_least_one
          - dbt_utils.at_least_one:
              group_by_columns: ['FIELD']

Currently, I'm more comfortable with that possibility than attempting to handle all the details and complications of a case-insensitive check. We can always update the implementation later if case-insensitive comparisons are needed.

limit in subqueries

I seem to recall that not all databases support limit within subqueries. Maybe MariaDB and MySQL fit in this category? For any databases in which this is the case, they will need to override this implementation in order to work.

@dbeatty10 dbeatty10 self-requested a review June 2, 2023 17:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve the performance of the at_least_one test by pruning early.
3 participants