Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use built-in adapter functionality for datatypes #586

Merged
merged 12 commits into from
Jul 5, 2022

Conversation

jtcohen6
Copy link
Contributor

@jtcohen6 jtcohen6 commented May 12, 2022

resolves #598

This is a:

  • documentation update
  • bug fix with no breaking changes
  • new functionality
  • a breaking change ??? (maybe???)

All pull requests from community contributors should target the main branch (default).

Description & motivation

Follow-up to TODOs left from #577. Experiment with using api.Column.translate_type for our existing type_* macros.

Feels good:

  • This is code defined within the adapter plugin, and already used for internal purposes.
  • For the most part, it "just works" as a drop-in replacement. There are a few gaps in the core / adapter methods where we could encode slightly more of this logic.
  • We can preserve dispatch on all legacy dbt_utils macros, so that a project / package maintainer can still intervene / override if needed.

What feels less good:

  • We had some explicit behaviors here previously, that are now implicit. Do they still work as before?
  • What's the right interface for package authors to access this functionality? Do they want to call {{ api.Column.translate_type('string') }}, {{ get_data_type('string') }}, or {{ type_string() }}?
  • Within dbt-core: We should aim to reconcile / consolidate the agate type conversion methods with Column class type translation.

Checklist

  • I have verified that these changes work locally on the following warehouses (Note: it's okay if you do not have access to all warehouses, this helps us understand what has been covered)
    • BigQuery
    • Postgres
    • Redshift
    • Snowflake
  • I followed guidelines to ensure that my changes will work on "non-core" adapters by:
    • dispatching any new macro(s) so non-core adapters can also use them (e.g. the star() source)
    • using the limit_zero() macro in place of the literal string: limit 0
    • using dbt_utils.type_* macros instead of explicit datatypes (e.g. dbt_utils.type_timestamp() instead of TIMESTAMP — hah!
  • I have updated the README.md (if applicable)
  • I have added tests & descriptions to my models (and macros if applicable)
  • I have added an entry to CHANGELOG.md

@dbeatty10
Copy link
Contributor

This turned out awesome! 💪

We had some explicit behaviors here previously, that are now implicit. Do they still work as before?

Is there anything we can do to make this feel more good? e.g., like adding in some kind of unit testing?

What's the right interface for package authors to access this functionality? Do they want to call {{ api.Column.translate_type('string') }}, {{ get_data_type('string') }}, or {{ type_string() }}?

It seems like package authors will be able to use any of those 3 no matter what, right? But we are just try to choose which of those we want to encourage?

Will the package authors have a variable in hand (like my_value = 'string')? If so, then either of these options make sense:

  • {{ get_data_type('string') }}
  • {{ api.Column.translate_type('string') }}

Otherwise, this is nice and pithy:

  • {{ type_string() }}

@jtcohen6
Copy link
Contributor Author

Is there anything we can do to make this feel more good? e.g., like adding in some kind of unit testing?

Definitely! I'll look into how we might do this

But we are just try to choose which of those we want to encourage?

Correct! Which of these do we want to encourage, in a world where they actually all do the same thing? Just a question of the right syntactic sugar to sprinkle in.

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Jun 2, 2022

I've added tests for these data types (and removed the previous placeholders). These tests check both:

  • new (redefined) type_* macros
  • previous type_* macro definitions, manually readded in each Legacy test case

Two hesitations with the current implementation:

  • I'm doing a "more lenient" comparison for the legacy version of Postgres type__string(). Previously this returned varchar, now it returns text. Those are true aliases within Postgres, but (for some complex reasons) dbt doesn't perfectly map between them today. (Related: [CT-636] [Bug] Postgres unlimited varchar default to varchar(256) dbt-core#5238) I opted to make the test xfail, rather than truly pass or truly fail.
  • What should type_timestamp do on BigQuery? The most straightforward answer is, return a timestamp, and that's what it does today. But on other databases, timestamp means (by default) timestamp without time zone. On BigQuery that's datetime, whereas timestamp always means timestamp with time zone. Other macros (e.g. dateadd) return datetime values on BigQuery. I was able to get this working and these tests passing as is, and I think it's simplest for everyone if timestamp just means timestamp. But I also think we might do well to encourage folks to use more-precise types (datetime, timestampntz, timestamptz) that have clear and consistent analogues across databases.

All of the actual changes for this PR are happening in this repo only, for now. Things we could do in core/plugins to make the code here slightly cleaner:

  • Move the "syntactic sugar" of get_data_type(x) into dbt-core — it's just a wrapper for api.Column.translate_type(x), which is already there
  • Update the BigQueryColumn.numeric_type function to never return parametrized numeric types. They're technically supported on BQ, but a real nightmare to use.

Not sure if ready for final review, but possibly another look!

@jtcohen6 jtcohen6 marked this pull request as ready for review June 2, 2022 09:46
@jtcohen6 jtcohen6 requested a review from dbeatty10 June 3, 2022 12:14
{% macro default__type_numeric() %}
numeric(28, 6)
{{ return(api.Column.numeric_type("numeric", 28, 6)) }}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: SparkSQL wants this to be called decimal instead of numeric. Investigate whether that works on other standard DBs, or if we should use translate_type for it

Copy link
Contributor

@dbeatty10 dbeatty10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel good about this when you do.

run_functional_test.sh Show resolved Hide resolved
run_test.sh Outdated Show resolved Hide resolved
Comment on lines 17 to 29
{%- macro get_data_type(dtype) -%}
{# if there is no translation for 'dtype', it just returns 'dtype' #}
{{ return(api.Column.translate_type(dtype)) }}
{%- endmacro -%}

{# string ------------------------------------------------- #}

{%- macro type_string() -%}
{{ return(adapter.dispatch('type_string', 'dbt_utils')()) }}
{%- endmacro -%}

{% macro default__type_string() %}
string
{% endmacro %}

{%- macro redshift__type_string() -%}
varchar
{%- endmacro -%}

{% macro postgres__type_string() %}
varchar
{% endmacro %}

{% macro snowflake__type_string() %}
varchar
{{ return(dbt_utils.get_data_type("string")) }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would feel good about pushing both get_data_type(dtype) and type_{X} into dbt-core.

But I'm not sure which risks or maintenance burden it would impose to have both.

They each seem to have their use-cases:

  1. type_{X} is both pithy and clear but doesn't accept a variable dtype
  2. get_data_type(dtype) is a little more verbose but does accept a variable dtype

If we could only choose one option to push into dbt-core, I would choose type_{X} because:

  • it is compact and clear
  • we can always utilize the api.Column.translate_type(dtype) syntax for cases when we need a variable dtype

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think you make a compelling case in favor of type_{X} macros!

api.Column.translate_type(dtype) will accept ANY input string, even api.Column.translate_type('fake_type_xyz'). It will translate that input string if it's recognized, or just return the input string if it isn't.

So, there is benefit to the "stronger typing" achieved by macros that have the standard type right in the name, even if (behind-the-scenes) it still just shells out to api.Column.translate_type.

tests/functional/data_type/base_data_type_macro.py Outdated Show resolved Hide resolved
tests/functional/data_type/base_data_type_macro.py Outdated Show resolved Hide resolved
@jtcohen6
Copy link
Contributor Author

I removed cross_db_utils/base_cross_db_macro.py and cross_db_utils/fixture_cross_db_macro.py because I realized they were unused. Not exactly related to this PR, though, so I can kick out that change if preferred.

@dbeatty10
Copy link
Contributor

dbeatty10 commented Jun 30, 2022

Looks great to me!

@jtcohen6 Do these sound like the right next steps?

  1. Merge Move data type macros into dbt-core dbt-core#5428 (you)
  2. Update requirements.txt in this PR (you or me)
  3. Merge this PR (me)
  4. Open PR to resolve docs.getdbt.com #1644 (me), review (you), merge (me)
  5. Review Cross database type_{X} macros docs.getdbt.com#1648 (you) and merge (me)

@jtcohen6
Copy link
Contributor Author

@dbeatty10 That sounds right to me! After merging the dbt-core PR, there are also the companion PRs in adapter repos. All of these run the tests added in this PR. Some have small substantive changes.

git+https://github.com/dbt-labs/dbt-bigquery.git
git+https://github.com/dbt-labs/dbt-core.git@jerco/data-type-macros#egg=dbt-core&subdirectory=core
git+https://github.com/dbt-labs/dbt-core.git@jerco/data-type-macros#egg=dbt-tests-adapter&subdirectory=tests/adapter
git+https://github.com/dbt-labs/dbt-bigquery.git@jerco/data-type-macros
Copy link
Contributor Author

@jtcohen6 jtcohen6 Jun 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that the dbt-core PR is merged, only the dbt-bigquery PR (dbt-labs/dbt-bigquery#214) is actually blocking this one, since the Redshift + Snowflake PRs don't have substantive changes (just inheriting tests).

TODO: Add back imports from main. There was no reason to remove these.

git+https://github.com/dbt-labs/dbt-redshift.git
git+https://github.com/dbt-labs/dbt-snowflake.git

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jtcohen6 Just pushed these in a commit and CI is running now.

Also added back this one along with dbt-redshift and dbt-snowflake:

  • git+https://github.com/dbt-labs/dbt-core.git#egg=dbt-postgres&subdirectory=plugins/postgres

Please let me know if I shouldn't have added that one in -- obviously easy to pull it back out.

@dbeatty10 dbeatty10 merged commit dcd85fb into main Jul 5, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Rewrite type_* macros to use built-in adapter capabilities
2 participants