-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CT-2857] Model contracts: raise warning for numeric types without specified scale #8183
Comments
We should:
|
After poking around this some more yesterday:
Is equivalent to I'm inclined to say that If we want to raise a warning for numeric types without precision/scale specified, we don't have a perfect mechanism for doing that right now (short of regex), but the |
We discussed this further in a community feedback session, one concern that was brought up is that floats aren't always reliable from a rounding perspective (especially for accounting data).
|
Thanks @graciegoheen. Here's details, for example, from the BigQuery documentation about the dangers of using
https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#floating_point_types And from Snowflake:
https://docs.snowflake.com/en/sql-reference/data-types-numeric#rounding-errors In my humble opinion, best to avoid floating point data types entirely unless the use case specifically calls for them (e.g. scientific measurements, engineering machinery, perhaps geographic lat/long kinds of representation.). |
Being involved with scientific and engineering measurements, and concerned about issues with precision truncation and overflow, I find it best to avoid decimal data types entirely unless the use case specifically calls for them (e.g., currency, counters) I think the best dbt can do for us here is to enforce that any decimal types specify the precision/scale |
@gnilrets - I get your point that there are cases where fixed-point datatypes would also be problematic. My experience is though, that most folks who are doing scientific computing tend to understand the "gotchas" of datatypes and also the underlying issues with rounding that will occur in any computational representation of numbers versus their pure Platonic forms :). However, I think the reverse is not true - most folks who are doing non-scientific analytics (and I think 90%+, maybe even 95%+ of analytics are non-scientific) don't know much of anything about numeric datatypes, rounding, etc. E.g. folks doing FP&A modeling or ad spend analysis. In short, if you don't already know what a float is, and when you should and should not use it, you probably shouldn't be using it :). But this happens all the time - I see all kinds of data pipelines using floats all over the place when they're doing financial analysis. I've seen major data replication and ingest vendors, on multiple occasions, cast financial data incoming from source systems where the data had fixed-point representations into FLOAT in Snowflake and BigQuery. So that seems like the much more common error to avoid. And if those are all reasonable assumptions then we should not ever automatically exacerbate these issues by casting fixed point to floating point datatypes silently. |
@graciegoheen - I had another idea on this. What if, when someone enters a datatype in a data contract, you check So, for example, if the data contract specifies That would keep you out of the business of trying to specifically parse out the details of particular datatypes on particular database platforms and instead you're just doing a direct lower-cased text string comparison with no specific logic in it which is a lot simpler. |
This has been a great (and humbling) conversation :) I think our goals here should be to:
To that end, I think we could accomplish this with a few tweaks to one macro. If the user has provided a I wish we could use the adapter API method So, while this approach is pretty naive, it's not doing anything functional (just raising a warning), and should do what we want (help users) in the vast majority of cases. {% macro default__get_empty_schema_sql(columns) %}
{%- set col_err = [] -%}
{%- set col_naked_numeric = [] -%}
select
{% for i in columns %}
{%- set col = columns[i] -%}
{%- if col['data_type'] is not defined -%}
{%- do col_err.append(col['name']) -%}
{#-- If this column's type is just 'numeric' missing precision/scale, raise a warning --#}
{#- elif api.Column.create(col['name'], col['data_type'].strip()).is_numeric() -#}
{%- elif col['data_type'].strip().lower() in ('numeric', 'decimal', 'number') -%}
{%- do col_naked_numeric.append(col['name']) -%}
{%- endif -%}
{% set col_name = adapter.quote(col['name']) if col.get('quote') else col['name'] %}
cast(null as {{ col['data_type'] }}) as {{ col_name }}{{ ", " if not loop.last }}
{%- endfor -%}
{%- if (col_err | length) > 0 -%}
{{ exceptions.column_type_missing(column_names=col_err) }}
{%- elif (col_naked_numeric | length) > 0 -%}
{#-- TODO: This should be an actual identifiable warning / log event --#}
{{ log("Detected columns with numeric type and unspecified precision/scale, this can lead to unintended rounding: " ~ col_naked_numeric, info = true) }}
{%- endif -%}
{% endmacro %} With the example above, if I express the type as
If I switch it to
|
Welp, this part is probably the yuckiest bit to swallow. But I do appreciate the naive solution here.
maybe at the very least we could encapsulate this check in a global-level (in dbt-core) macro so we have something consistent to use if we are going to lean on this 'best effort numeric check' at the global-level |
dbt-bigquery overrides default__get_empty_schema_sql: https://github.com/dbt-labs/dbt-bigquery/blob/main/dbt/include/bigquery/macros/utils/get_columns_spec_ddl.sql#L7 |
slack thread & #7824 (comment)
A few data platforms (Snowflake, Redshift) have
scale=0
for their numeric/decimal data type. (It's 9 by default on BigQuery.)Postgres' docs say:
I agree!
Reproduction case
Thanks @gnilrets for the clear repro case! If we run this on Snowflake:
Without contract enforcement,
two_thirds
is anumber(7,6)
. However, if I enforce the contract and fail to specify the precision:Then the build passes, but
two_thirds
is now anumber(38,0)
, so I’ve lost all precision andtwo_thirds = 1
!Acceptance Criteria
--- updated proposal ---
If the user has provided a numeric data type, which should specify precision/scale, without having specified anything except the type — raise a warning.
The text was updated successfully, but these errors were encountered: