Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bigquery model-level unique tests on concat() columns #35

Closed
Dimi727 opened this issue Oct 4, 2021 · 6 comments
Closed

Bigquery model-level unique tests on concat() columns #35

Dimi727 opened this issue Oct 4, 2021 · 6 comments
Labels
bug Something isn't working good_first_issue Good for newcomers

Comments

@Dimi727
Copy link

Dimi727 commented Oct 4, 2021

Hi everyone.

I did not find this issue anywhere yet which seems odd to me and I dont know if it is on my side.

Using Bigquery and dbt 0.20.2.

Having defined a test for a model like this:

  - name: base_test_model
    tests:
      - unique:
          column_name: "concat(id, field, changed_at)"

Which was working before dbt 0.20.0. Now I get an error:

Database Error in test unique_base_test_id_field_changed_at (models\base\schema.yml)
SELECT list expression references column id which is neither grouped nor aggregated at [12:5] compiled SQL at

I checked in projects where I use dbt 0.19.x and the test above compiled to this:

select count(*) as validation_errors
from (

    select
        concat(id, field, changed_at)

    from base_test
    where concat(id, field, changed_at) is not null
    group by concat(id, field, changed_at)
    having count(*) > 1

) validation_errors

Now it compiles to this:

select
      count(*) as failures
    from (
      
    
select
    concat(id, field, changed_at) as unique_field,
    count(*) as n_records

from base_test
where concat(id, field, changed_at) is not null
group by concat(id, field, changed_at)
having count(*) > 1

    ) dbt_internal_test

which Bigquery seems not to like. Using the query above and removing the alias unique_field works for Bigquery.

With Postgres and MySQL both of the above queries run. Did not test Snowflake and Redshift.

@Dimi727
Copy link
Author

Dimi727 commented Oct 5, 2021

Seems to be a Bigquery thing:
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#having_clause

If a query contains aliases in the SELECT clause, those aliases override names in a FROM clause.

Therefore the compiled test query for Bigquery adapters should look like this to work:

select
      count(*) as failures
    from (
      
    
select
    concat(id, field, changed_at) as unique_field,
    count(*) as n_records

from base_test
where concat(id, field, changed_at) is not null
group by unique_field
having count(*) > 1

    ) dbt_internal_test

@DigUpTheHatchet
Copy link
Contributor

DigUpTheHatchet commented Oct 5, 2021

I've also experienced a number of tests start failing with the same message:
SELECT list expression references {column} which is neither grouped nor aggregated

After upgrading to version 20.0.2, these errors started appearing. There was no issue when we were running on 20.0.1. We're also running on BigQuery.

It seems to be affecting generic and schema tests where a concat() function is used.

I'm going to quickly check the compiled tests code for both versions and see what has changed.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Oct 7, 2021

@Dimi727 @DigUpTheHatchet Thanks for opening and commenting. This one is on me: dbt-labs/dbt-core#3812. I added an alias to the column produced by the unique test query for the sake of users who wanted to use it in conjunction with the --store-failures feature, on databases that require tables be created with named/aliased columns.

I have a few comments:

  1. We really encourage you to test uniqueness on an actual column, rather than passing a column expression into the unique test. (The ability to do this is indeed documented; I wouldn't say it's highly encouraged.) Is there any reason why you wouldn't want concat(id, field, changed_at) as a field in your model, to serve as a unique identifier for the row?
  2. There is a dedicated generic test for testing uniqueness on a combination of columns: dbt_utils.unique_combination_of_columns. If you know you want to test a combination of columns, without storing a surrogate key column in your model, this is the way I'd recommend.
  3. Having said both of the above, you can get this working in the meantime by overriding the unique test in your root project, following the suggestions I'll leave below.

Per @artamoshin's detailed research in dbt-labs/dbt-core#3905 (comment), the vast majority of databases want us to group by {{ column_name }} rather than group by <alias>. It just so happens that BigQuery is happy with group by {{ column_name }} ... until there's an alias. Ugh!

One way to fix this for sure for sure would be to reimplement the unique test query in the dbt-bigquery adapter plugin so that it includes a CTE:

{% macro bigquery__test_unique(model, column_name) %}

with dbt_test__target as (
  
  select {{ column_name }} as unique_field
  from {{ model }}
  where {{ column_name }} is not null
  
)

select
    unique_field,
    count(*) as n_records

from dbt_test__target
group by unique_field
having count(*) > 1

{% endmacro %}

That CTE-based approach would also have the nice side-benefit of fixing another BigQuery-specific issue (https://github.com/dbt-labs/dbt/issues/3489), when the column name and table name match.

Personally, I'd love to make the CTE-based query the default, but I know that some databases don't support CTEs nested inside subqueries—and dbt now wraps generic test queries in subqueries when executing them. We can't win them all, unfortunately. Given that other databases by and large support the existing syntax, I don't think it's unreasonable to make this a BigQuery-specific change.

Would either of you be interested in contributing:

  • A fix for the above (by adding bigquery__test_unique to the dbt-bigquery plugin)
  • An integration test case that passes an expression into each built-in generic test, at least for BigQuery for starters. We should run this on all of our supported adapters. If it's not tested functionality, it really shouldn't be documented! There's nothing stopping us from accidentally breaking it (again) in the future.

@DigUpTheHatchet
Copy link
Contributor

@jtcohen6 I've been wanting to make a first contribution so I'll give this a go over the weekend.

I can't promise how successful I will be but I will try!

@Dimi727
Copy link
Author

Dimi727 commented Oct 11, 2021

@jtcohen6

there any reason why you wouldn't want concat(id, field, changed_at) as a field in your model, to serve as a unique identifier for the row?

Well :-) I think its kind of redundant if there is no use for this column except an dbt-internal unique test or incremental-model wihch needs an unique field (which is not always used).

@jtcohen6 jtcohen6 transferred this issue from dbt-labs/dbt-core Oct 12, 2021
@jtcohen6 jtcohen6 added bug Something isn't working good_first_issue Good for newcomers labels Oct 12, 2021
@jtcohen6
Copy link
Contributor

Resolved by #10

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good_first_issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants