Bigquery uniqueness test fails on column name matching table name #33

abuckenheimer · 2021-06-22T20:59:58Z

Describe the bug

schema test unique cannot be used on columns in bigquery where the column name has the same name as the table. Bigquery yields the following error:

Grouping by expressions of type STRUCT is not allowed at [4:10]

Steps To Reproduce

create table <your dataset>.foo (
    foo int,
    bar string,
);

-- this mimics this existing structure of a uniqueness test on the foo column in this table but fails with a:
-- Grouping by expressions of type STRUCT is not allowed at [4:10]
with validation_errors as (
    select
        foo, count(*)
    from <your dataset>.foo
    group by foo
    having count(*) > 1
)
select *
from validation_errors;

Expected behavior

using an alias for the table name disambiguates the reference for bigquery and allows the test to run succesfully

with validation_errors as (
    select
        model_table.foo, count(*)
    from <your dataset>.foo as model_table
    group by model_table.foo
    having count(*) > 1
)
select *
from validation_errors;

Screenshots and log output

Database Error in test dbt_utils_source_unique_combination_of_columns_... (models/....yml)
  Grouping by expressions of type STRUCT is not allowed at [18:14]
  compiled SQL at target/run/trumid_poc/models/....yml/schema_test/dbt_utils_source_unique_combin_1cc53959c9ef4a8b014db7bc87dadc5d.sql

System information

Which database are you using dbt with?

The output of dbt --version:

installed version: 0.20.0-rc1
   latest version: 0.19.1

Your version of dbt is ahead of the latest release!

Plugins:
  - bigquery: 0.20.0rc1
  - postgres: 0.20.0rc1
  - redshift: 0.20.0rc1
  - snowflake: 0.20.0rc1

The operating system you're using:

Distributor ID: Ubuntu
Description:    Ubuntu 18.04.4 LTS
Release:        18.04
Codename:       bionic

on wsl

The output of python --version:
3.9.2

Additional context

Found this issue in dbt_utils technically and will have to file an issue there as well but it holds for a classic uniqueness test as well

The text was updated successfully, but these errors were encountered:

jtcohen6 · 2021-06-23T20:10:07Z

@abuckenheimer You're totally right.

This is tougher than it should be today because a test doesn't actually have a good link between itself and the model it's defined on. It's going to be even trickier after the introduction of the where config in v0.20.0 (#3392). If this config is set, instead of templating out {{ model }} as <your dataset>.foo, dbt will template it out as (select * from <your dataset>.foo where col = 1) foo.

At the time, I was thinking that it was important for the subquery to be aliased the same as the model identifier, in case the generic test definition depended on using the model identifier as a column alias. But it wouldn't really be a generic test if it did, now would it?

Here are the options I can think of:

Not do this. In general, I don't think it's a good practice for a BigQuery column to share the name of its view/table, for exactly this reason—it causes unpleasant surprises when querying.
Change the {{ model }} templating to always alias the selected-from object/subquery as model, whether or not the where config is supplied. In the unique, not_null, and accepted_values tests, qualify columns with the model alias.
Add another subquery + alias, to wrap around the subquery + alias associated with the optional where config:

{% macro default__test_unique(model, column_name) %}

select
    model.{{ column_name }},
    count(*) as n_records

from (select * from {{ model }}) model
where model.{{ column_name }} is not null
group by model.{{ column_name }}
having count(*) > 1

{% endmacro %}

Option 3 is something you can do right now, in your own project, to get this working in the meantime. But I'm leaning toward option 2, which would require adjusting this (gnarly) code:

https://github.com/fishtown-analytics/dbt/blob/c794600242e734d870d4bf1c8fb4c0349ab961eb/core/dbt/parser/schema_test_builders.py#L381-L391

In order to always alias, and always use the same alias (just the word model):

    def build_model_str(self): 
        targ = self.target 
        cfg_where = "config.get('where')" 
        alias = "model"
        if isinstance(self.target, UnparsedNodeUpdate): 
            identifier = self.target.name 
            target_str = f"{{{{ ref('{targ.name}') }}}}" 
        elif isinstance(self.target, UnpatchedSourceDefinition):
            target_str = f"{{{{ source('{targ.source.name}', '{targ.table.name}') }}}}" 
        unfiltered = f"{target_str} {alias}"
        filtered = f"(select * from {target_str} where {{{{{cfg_where}}}}}) {alias}" 
        return f"{{% if {cfg_where} %}}{filtered}{{% else %}}{unfiltered}{{% endif %}}"

From there, it's as simple as adding model. to qualify the columns referenced in generic test definitions.

I don't love using model as an alias—it's an overloaded term, and the thing being tested could just as easily be a source, snapshot, or seed—but for better or for worse, generic test definitions have been written with {{ model }} in their bodies. I struggle to come up with a better alias name that isn't overly clunky (thing_being_tested).

What do you think? If you're on board with the proposal, we may be able to sneak it in for v0.20.0rc2, since it's a relevant tweak to code that's changing in v0.20.

abuckenheimer · 2021-06-23T20:37:11Z

I think your onto something with the model alias but I actually think instead of picking a natural name you should pick an unnatural one, model is just as likely to run into the disambiguation problem I ran into (presumably multiple people have columns with model as the name), but since its injected by dbt the reasoning behind it may be a bit more obscure. If you use something like __dbt_model as the name your less likely to have a collision and consequently have everything work naturally without having to rely on the user specifying <alias>.<column>. Also think __dbt_model may better hint to the user that this is an implementation detail of dbt and not something they'd normally have to worry about.

abuckenheimer · 2021-06-23T20:46:23Z

ah actually I may be conflating your approaches 2 and 3. I guess what your saying for 2 is that users should always qualify model columns with the model.<column> syntax which may let dbt better understand the test expression in the process. While 3 is that dbt tests macros should always alias to something (which is more relevant for the __dbt_model comment above)

jtcohen6 · 2021-06-23T22:36:29Z

Yes, I think you're right, we should pick an alias that's unlikely to clash. It would be a shame to fix this issue (name + column clash) and create another!

There's another option, which now feels so obvious that I'm embarrassed I didn't think of it earlier. We could rewrite generic tests to use an "import" CTE, similar to our style guide:

with dbt_test__target as (

    select * from {{ model }} 

)

select
    {{ column_name }},
    count(*) as n_records

from dbt_test__target
where {{ column_name }} is not null
group by {{ column_name }}
having count(*) > 1

I don't think we'd need to qualify the column name at all, assuming there's no change the column is also named dbt_test__target.

abuckenheimer · 2021-06-23T23:59:47Z

love it

jtcohen6 · 2021-06-24T01:40:53Z

@abuckenheimer Any interest in a contribution? :)

jtcohen6 · 2021-10-19T08:39:56Z

Resolved by #10

abuckenheimer referenced this issue in abuckenheimer/dbt Jul 8, 2021

using import style cte to fix column/table name ambiguity (#3489)

d18befe

jtcohen6 transferred this issue from dbt-labs/dbt-core Oct 12, 2021

jtcohen6 added bug Something isn't working good_first_issue Good for newcomers labels Oct 12, 2021

DigUpTheHatchet mentioned this issue Oct 18, 2021

Created bigquery__test_unique to handle expressions passed to BQ uniq… #10

Merged

jtcohen6 closed this as completed Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bigquery uniqueness test fails on column name matching table name #33

Bigquery uniqueness test fails on column name matching table name #33

abuckenheimer commented Jun 22, 2021

jtcohen6 commented Jun 23, 2021 •

edited

abuckenheimer commented Jun 23, 2021

abuckenheimer commented Jun 23, 2021 •

edited

jtcohen6 commented Jun 23, 2021 •

edited

abuckenheimer commented Jun 23, 2021

jtcohen6 commented Jun 24, 2021

jtcohen6 commented Oct 19, 2021

Bigquery uniqueness test fails on column name matching table name #33

Bigquery uniqueness test fails on column name matching table name #33

Comments

abuckenheimer commented Jun 22, 2021

Describe the bug

Steps To Reproduce

Expected behavior

Screenshots and log output

System information

Additional context

jtcohen6 commented Jun 23, 2021 • edited

abuckenheimer commented Jun 23, 2021

abuckenheimer commented Jun 23, 2021 • edited

jtcohen6 commented Jun 23, 2021 • edited

abuckenheimer commented Jun 23, 2021

jtcohen6 commented Jun 24, 2021

jtcohen6 commented Oct 19, 2021

jtcohen6 commented Jun 23, 2021 •

edited

abuckenheimer commented Jun 23, 2021 •

edited

jtcohen6 commented Jun 23, 2021 •

edited