Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add compare_relation_columns macro #5

Open
wants to merge 1 commit into
base: master
from

Conversation

@clrcrl
Copy link
Collaborator

commented Jul 3, 2019

Opening this early to get feedback on whether this is a useful macro to write.

Also wondering the best way to test this since the results will be different across each warehouse due to nuances in the data types.

@clrcrl clrcrl force-pushed the feature/compare-queries branch from 4461b9d to 85848f1 Jul 3, 2019

@clrcrl clrcrl force-pushed the feature/compare-relation-columns branch from 38ebaaa to fb40178 Jul 3, 2019

@clrcrl clrcrl requested a review from drewbanin Jul 3, 2019

@clrcrl clrcrl changed the base branch from feature/compare-queries to master Jul 3, 2019

@ryantuck

This comment has been minimized.

Copy link

commented Jul 11, 2019

Hey @clrcrl - this seems like a great idea and is a must-have for us when doing internal validation.

I wrote this macro for accomplishing exactly this a few months back when testing out some model rebuilds in redshift (I think you helped me debug it - thanks again!)

  • It requires you to define a list of join_cols, which is the set of primary key columns that you're going to use to compare
  • It's implemented as old_table left join new_table, so it's lacking the ability to determine if something exists in the new table but not the old (but a row-level function could address that separately from a column-comparison macro)
  • The get_cols() macro it references hits the information_schema in redshift to fetch columns to compare (also included below, though I imagine you've built something similar)
  • It conditionally compares timestamp columns to see if they're within 1s of each other, but otherwise does a general test for equality. This is likely not fully adequate for all cases.

I'd be happy to contribute here but wouldn't want to duplicate efforts - let me know if that sounds appealing!

column comparison macro

{% macro generate_validation_unions(old_table, new_table, join_cols) %}

  {% set col_types = ['boolean', 'timestamp', 'other'] %}
  {% set join_cols_concat = '(' + ','.join(join_cols) + ')' %}

  -- iterate through column types
  {% for col_type in col_types %}

    -- get columns
    {% set cols = ['test_col_{{col_type}}'] %}
    {% if execute %}
      {% set cols = get_cols(new_table, col_type) %}
    {% endif %}

    {% if cols | length > 0 %}

      {% for c in cols | reject('in', join_cols) %}

        select
          '{{old_table}}' as old_table,
          '{{new_table}}' as new_table,
          '{{c}}' as col_name,
          {% for jc in join_cols %}
            o.{{jc}},
          {% endfor %}
          n.{{join_cols[0]}} is not null as new_exists,
          -- cast to text based on col_type
          {% if col_type == 'boolean' %}
            case o.{{c}}
              when true then 'true'
              when false then 'false'
              end as legacy_val,
            case n.{{c}}
              when true then 'true'
              when false then 'false'
              end as new_val
          {% else %}
            o.{{c}}::text as legacy_val,
            n.{{c}}::text as new_val
          {% endif %}
        from
          {{old_table}} o
          left join
            {{new_table}} n using {{join_cols_concat}}
        where
          -- test for equality based on col_type
          {% if col_type == 'timestamp' %}
            abs(extract(epoch from o.{{c}} - n.{{c}})) >= 1
          {% else %}
            o.{{c}} != n.{{c}}
          {% endif %}

          {% if not loop.last %} union all {% endif %}
      {% endfor %}
    {% if not loop.last %} union all {% endif %}
    {% endif %}
  {% endfor %}

{% endmacro %}

get_cols() macro

-- macro that retrieves columns from a given name
-- col_type should be either 'boolean', 'timestamp', or 'other'
-- currently used for the 'validation' model

{% macro get_cols(qualified_table_name, col_type) %}

  {% call statement('get_columns', fetch_result=True) %}

    {% set schema_name = qualified_table_name.split('.')[0] %}
    {% set table_name = qualified_table_name.split('.')[1] %}

    select
      column_name
    from
      information_schema.columns
    where
      table_schema = '{{schema_name}}'
      and table_name = '{{table_name}}'

      {% if col_type == 'boolean' %}
        and data_type = 'boolean'
      {% elif col_type == 'timestamp' %}
        and data_type like 'timestamp%'
      {% else %}
        and data_type != 'boolean'
        and data_type not like 'timestamp%'
      {% endif %}

  {% endcall %}


  {% set results = ['blah'] %}
  {% if execute %}
    {% set results = load_result('get_columns').table.columns['column_name'].values() %}
  {% endif %}
  {{return(results)}}

{% endmacro %}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.