Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2857] Model contracts: raise warning for numeric types without specified scale #8183

Closed
Tracked by #7979
jtcohen6 opened this issue Jul 21, 2023 · 10 comments · Fixed by #8721
Closed
Tracked by #7979

[CT-2857] Model contracts: raise warning for numeric types without specified scale #8183

jtcohen6 opened this issue Jul 21, 2023 · 10 comments · Fixed by #8721
Assignees
Labels
enhancement New feature or request Impact: Adapters model_contracts multi_project paper_cut A small change that impacts lots of users in their day-to-day

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Jul 21, 2023

slack thread & #7824 (comment)

A few data platforms (Snowflake, Redshift) have scale=0 for their numeric/decimal data type. (It's 9 by default on BigQuery.)

Postgres' docs say:

The SQL standard requires a default scale of 0, i.e., coercion to integer precision. We find this a bit useless. If you're concerned about portability, always specify the precision and scale explicitly.

I agree!

Reproduction case

Thanks @gnilrets for the clear repro case! If we run this on Snowflake:

-- models/fct_test_contract.sql
select
   2/3 as two_thirds

Without contract enforcement, two_thirds is a number(7,6). However, if I enforce the contract and fail to specify the precision:

models:
  - name: fct_test_contract
    config:
      contract:
        enforced: true

    columns:
      - name: two_thirds
        data_type: number

Then the build passes, but two_thirds is now a number(38,0), so I’ve lost all precision and two_thirds = 1!

Acceptance Criteria

--- updated proposal ---

If the user has provided a numeric data type, which should specify precision/scale, without having specified anything except the type — raise a warning.

@jtcohen6 jtcohen6 added bug Something isn't working model_contracts labels Jul 21, 2023
@github-actions github-actions bot changed the title Check yaml numeric scale >= SQL numeric scale in model contracts [CT-2857] Check yaml numeric scale >= SQL numeric scale in model contracts Jul 21, 2023
@graciegoheen
Copy link
Contributor

We should:

  • really recommend people use float, unless they know the expected scale via casting
  • put in some warnings when you specify a scale-less numeric data type

https://community.snowflake.com/s/article/To-Float-or-Not-to-Float-Choosing-Correct-Numeric-Data-Type-in-Snowflake

@jtcohen6 jtcohen6 changed the title [CT-2857] Check yaml numeric scale >= SQL numeric scale in model contracts [CT-2857] Encourage floats for decimals, and warn for numeric types without scale specified Aug 1, 2023
@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Aug 1, 2023

After poking around this some more yesterday:
Screenshot 2023-08-01 at 10 23 39

create or replace table dbt_jcohen.my_tbl (id numeric not null) as (
    select 2/3 as id
);

Is equivalent to cast(2/3 as numeric), rather than 2/3::numeric.

I'm inclined to say that float is the way to go, when you don't want to specify a precision & scale — understanding that there might be some performance benefits to fixed-point over floating-point numerics.

If we want to raise a warning for numeric types without precision/scale specified, we don't have a perfect mechanism for doing that right now (short of regex), but the is_numeric adapter method could be a starting point.

@graciegoheen
Copy link
Contributor

We discussed this further in a community feedback session, one concern that was brought up is that floats aren't always reliable from a rounding perspective (especially for accounting data).

  • perhaps the docs / default we show is a numeric type with scale & precision supplied
  • throw a warning if someone supplies a numeric data type without scale and precision (I think this is worth investigating)

@codigo-ergo-sum
Copy link

Thanks @graciegoheen. Here's details, for example, from the BigQuery documentation about the dangers of using FLOAT (bolding mine):

Floating point values are approximations.

    The binary format used to represent floating point values can only represent a subset of the numbers between the most positive number and most negative number in the value range. This enables efficient handling of a much larger range than would be possible otherwise. **Numbers that are not exactly representable are approximated by utilizing a close value instead. For example, 0.1 cannot be represented as an integer scaled by a power of 2. When this value is displayed as a string, it is rounded to a limited number of digits, and the value approximating 0.1 might appear as "0.1", hiding the fact that the value is not precise. In other situations, the approximation can be visible**.
    Summation of floating point values might produce surprising results because of [limited precision](https://en.wikipedia.org/wiki/Floating-point_arithmetic#Accuracy_problems). For example, (1e30 + 1) - 1e30 = 0, while (1e30 - 1e30) + 1 = 1.0. This is because the floating point value does not have enough precision to represent (1e30 + 1), and the result is rounded to 1e30. This example also shows that the result of the SUM aggregate function of floating points values depends on the order in which the values are accumulated. In general, this order is not deterministic and therefore the result is not deterministic. Thus, **the resulting SUM of floating point values might not be deterministic and two executions of the same query on the same tables might produce different results**.
    If the above points are concerning, use a [decimal type](https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#decimal_types) instead.

https://cloud.google.com/bigquery/docs/reference/standard-sql/data-types#floating_point_types

And from Snowflake:

Floating point operations can have small rounding errors in the least significant digit(s). Rounding errors can occur in any type of floating-point processing, including trigonometric functions, statistical, and geospatial functions.

Errors can vary each time the query is executed.

Errors can be larger when operands have different precision or scale.

**Errors can accumulate, especially when aggregate functions (e.g. SUM() or AVG()) process large numbers of rows.** Casting to a fixed-point data type before aggregating can reduce or eliminate these errors.

Rounding errors can occur not only when working with SQL, but also when working with other code (e.g. Java, JavaScript, or Python) that runs inside Snowflake (e.g. in [UDFs](https://docs.snowflake.com/en/sql-reference/udf-overview.html) and [stored procedures](https://docs.snowflake.com/en/sql-reference/stored-procedures-overview.html)).

**When comparing two floating-point numbers, Snowflake recommends comparing for approximate equality rather than exact equality.**

https://docs.snowflake.com/en/sql-reference/data-types-numeric#rounding-errors

In my humble opinion, best to avoid floating point data types entirely unless the use case specifically calls for them (e.g. scientific measurements, engineering machinery, perhaps geographic lat/long kinds of representation.).

@gnilrets
Copy link

gnilrets commented Aug 1, 2023

Being involved with scientific and engineering measurements, and concerned about issues with precision truncation and overflow, I find it best to avoid decimal data types entirely unless the use case specifically calls for them (e.g., currency, counters)

I think the best dbt can do for us here is to enforce that any decimal types specify the precision/scale

@codigo-ergo-sum
Copy link

@gnilrets - I get your point that there are cases where fixed-point datatypes would also be problematic. My experience is though, that most folks who are doing scientific computing tend to understand the "gotchas" of datatypes and also the underlying issues with rounding that will occur in any computational representation of numbers versus their pure Platonic forms :).

However, I think the reverse is not true - most folks who are doing non-scientific analytics (and I think 90%+, maybe even 95%+ of analytics are non-scientific) don't know much of anything about numeric datatypes, rounding, etc. E.g. folks doing FP&A modeling or ad spend analysis.

In short, if you don't already know what a float is, and when you should and should not use it, you probably shouldn't be using it :). But this happens all the time - I see all kinds of data pipelines using floats all over the place when they're doing financial analysis. I've seen major data replication and ingest vendors, on multiple occasions, cast financial data incoming from source systems where the data had fixed-point representations into FLOAT in Snowflake and BigQuery. So that seems like the much more common error to avoid. And if those are all reasonable assumptions then we should not ever automatically exacerbate these issues by casting fixed point to floating point datatypes silently.

@codigo-ergo-sum
Copy link

@graciegoheen - I had another idea on this. What if, when someone enters a datatype in a data contract, you check INFORMATION_SCHEMA after the table has been created, and if the text of the returned datatype specified doesn't match what was specified in the contract, it throws a warning?

So, for example, if the data contract specifies NUMERIC but INFORMATION_SCHEMA in the database in question comes back with NUMERIC(38,0) you could throw a warning saying "Datatype specified in contract for column X does not match that created in database and datatype coercion/conversion may have taken place during model contract execution. It is advisable to check the actual created datatype in the database for this column". Of course you'd want the comparison to be case-insensitive.

That would keep you out of the business of trying to specifically parse out the details of particular datatypes on particular database platforms and instead you're just doing a direct lower-cased text string comparison with no specific logic in it which is a lot simpler.

@jtcohen6 jtcohen6 changed the title [CT-2857] Encourage floats for decimals, and warn for numeric types without scale specified [CT-2857] Model contracts: raise warning for numeric types without specified scale Aug 15, 2023
@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Aug 31, 2023

This has been a great (and humbling) conversation :)

I think our goals here should be to:

  • Raise a warning that helps users do the thing they want
  • Do the most naive thing possible, rather than trying something cleverer (and leaky)

To that end, I think we could accomplish this with a few tweaks to one macro. If the user has provided a numeric data type, which should specify precision/scale, without having specified anything except the type — raise a warning.

I wish we could use the adapter API method Column.is_numeric for this, but it leads to too many false positives on Snowflake, because integer types are actually just aliases for numeric(38, 0) (docs) !! I don't think we should raise a warning when users specify a "naked" integer, when the whole point is that the precision/scale need to be specified.

So, while this approach is pretty naive, it's not doing anything functional (just raising a warning), and should do what we want (help users) in the vast majority of cases.

{% macro default__get_empty_schema_sql(columns) %}
    {%- set col_err = [] -%}
    {%- set col_naked_numeric = [] -%}
    select
    {% for i in columns %}
      {%- set col = columns[i] -%}
      {%- if col['data_type'] is not defined -%}
        {%- do col_err.append(col['name']) -%}
      {#-- If this column's type is just 'numeric' missing precision/scale, raise a warning --#}
      {#- elif api.Column.create(col['name'], col['data_type'].strip()).is_numeric() -#}
      {%- elif col['data_type'].strip().lower() in ('numeric', 'decimal', 'number') -%}
        {%- do col_naked_numeric.append(col['name']) -%}
      {%- endif -%}
      {% set col_name = adapter.quote(col['name']) if col.get('quote') else col['name'] %}
      cast(null as {{ col['data_type'] }}) as {{ col_name }}{{ ", " if not loop.last }}
    {%- endfor -%}
    {%- if (col_err | length) > 0 -%}
      {{ exceptions.column_type_missing(column_names=col_err) }}
    {%- elif (col_naked_numeric | length) > 0 -%}
      {#-- TODO: This should be an actual identifiable warning / log event --#}
      {{ log("Detected columns with numeric type and unspecified precision/scale, this can lead to unintended rounding: " ~ col_naked_numeric, info = true) }}
    {%- endif -%}
{% endmacro %}

With the example above, if I express the type as numeric in the model's contract:

$ dbt run
...
08:05:20 1 of 1 START sql view model dbt_jcohen.fct_test_contract ....................... [RUN]
08:05:21 Detected columns with numeric type and unspecified precision/scale: ['two_thirds']
08:05:21 1 of 1 OK created sql view model dbt_jcohen.fct_test_contract .................. [SUCCESS 1 in 1.08s]
...

If I switch it to numeric(38,10):

$ dbt run
...
08:06:22 1 of 1 START sql view model dbt_jcohen.fct_test_contract ....................... [RUN]
08:06:23 1 of 1 OK created sql view model dbt_jcohen.fct_test_contract .................. [SUCCESS 1 in 1.20s]

@MichelleArk
Copy link
Contributor

I wish we could use the adapter API method Column.is_numeric for this, but it leads to too many false positives on Snowflake, because integer types are actually just aliases for numeric(38, 0) (docs) !! I don't think we should raise a warning when users specify a "naked" integer, when the whole point is that the precision/scale need to be specified.

Welp, this part is probably the yuckiest bit to swallow. But I do appreciate the naive solution here.

{%- elif col['data_type'].strip().lower() in ('numeric', 'decimal', 'number') -%}

maybe at the very least we could encapsulate this check in a global-level (in dbt-core) macro so we have something consistent to use if we are going to lean on this 'best effort numeric check' at the global-level

@jtcohen6 jtcohen6 added enhancement New feature or request paper_cut A small change that impacts lots of users in their day-to-day and removed bug Something isn't working labels Sep 24, 2023
@MichelleArk
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request Impact: Adapters model_contracts multi_project paper_cut A small change that impacts lots of users in their day-to-day
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants