Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CT-2848] [Bug] dbt show adding decimal places to non-decimal values #8153

Closed
2 tasks done
dave-connors-3 opened this issue Jul 19, 2023 · 10 comments · Fixed by #8306 or #8561
Closed
2 tasks done

[CT-2848] [Bug] dbt show adding decimal places to non-decimal values #8153

dave-connors-3 opened this issue Jul 19, 2023 · 10 comments · Fixed by #8306 or #8561
Assignees
Labels
backport 1.6.latest bug Something isn't working help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Impact: Adapters
Milestone

Comments

@dave-connors-3
Copy link
Contributor

dave-connors-3 commented Jul 19, 2023

Is this a regression in a recent version of dbt-core?

  • I believe this is a regression in dbt-core functionality
  • I have searched the existing issues, and I could not find an existing issue for this regression

Current Behavior

dbt show is adding decimal places to numeric/integer values incorrectly. This is causing problems for users in the dbt Cloud IDE, as the IDE uses that command to render the preview tab.

Expected/Previous Behavior

Previously (we think!) this was not an issue, and decimal places were properly handled by dbt show

Steps To Reproduce

  1. use dbt-snowflake>=1.5.0
  2. create a model in your project:
# in my_model.sql
select
  cast(1 as numeric) as _numeric,
  cast(1 as integer) as _integer,
  cast(1 as decimal) as _decimal
  1. run dbt show -s my_model --output JSON
  2. get this!
❯ dbt show -s my_model --output json
20:33:15  Running with dbt=1.5.3
20:33:16  Registered adapter: snowflake=1.5.2
20:33:16  Unable to do partial parsing because of a version mismatch
20:33:17  Found 3 models, 4 tests, 0 snapshots, 0 analyses, 436 macros, 0 operations, 0 seed files, 0 sources, 0 exposures, 0 metrics, 0 groups
20:33:17  
20:33:18  Concurrency: 8 threads (target='dev')
20:33:18  
20:33:19  {
  "node": "my_model",
  "show": [
    {
      "_NUMERIC": 1.0,
      "_INTEGER": 1.0,
      "_DECIMAL": 1.0
    }
  ]
}

Relevant log output

No response

Environment

- OS: mac
- Python: Python 3.9.16
- dbt (working version): unknown
- dbt (regression version): 1.5.3

Which database adapter are you using with dbt?

snowflake

Additional Context

No response

@dave-connors-3 dave-connors-3 added bug Something isn't working triage regression labels Jul 19, 2023
@github-actions github-actions bot changed the title [Regression] dbt show adding decimal places to non-decimal values [CT-2848] [Regression] dbt show adding decimal places to non-decimal values Jul 19, 2023
@dave-connors-3
Copy link
Contributor Author

did some more local testing on other versions of 1.5 and observed the same behavior - -this may not be a regression, but just a regular bug!

@dave-connors-3 dave-connors-3 changed the title [CT-2848] [Regression] dbt show adding decimal places to non-decimal values [CT-2848] [Bug] dbt show adding decimal places to non-decimal values Jul 19, 2023
@jtcohen6
Copy link
Contributor

Copying from #8167 (great minds think alike @dave-connors-3):

What's going on?

We're just using the agate library's type system and JSON serialization:

table = result.agate_table
# Hack to get Agate table output as string
output = io.StringIO()
if self.args.output == "json":
table.to_json(path=output)

This is being returned and saved as a decimal.Decimal type, rather than a Python int — and the decimal.Decimal type serializes to JSON number (1.0) instead of JSON integer (1)

ipdb> table
<agate.table.Table object at 0x1078d1360>
ipdb> [list(row) for row in table]
[['0005', 'false', Decimal('10'), False, '01T00000aabbccdd'], ['0006', 'true', Decimal('20'), True, '01T00000aabbccde']]
ipdb> table.print_json()
[{"stringnum": "0005", "stringbool": "false", "intfield": 10.0, "boolfield": false, "dtstring": "01T00000aabbccdd"}, {"stringnum": "0006", "stringbool": "true", "intfield": 20.0, "boolfield": true, "dtstring": "01T00000aabbccde"}]
ipdb> type(table[0][2])
<class 'decimal.Decimal'>

@graciegoheen graciegoheen self-assigned this Jul 20, 2023
@graciegoheen
Copy link
Contributor

Thanks to 2 great minds for opening this up!

It does look like this was an issue with prior versions of the IDE (before the dbt show command existed), so it's not a regression.

That being said, we definitely want to enable users to be able to spot check their data using dbt show / preview in the IDE.

@graciegoheen graciegoheen added help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors and removed triage labels Jul 20, 2023
@graciegoheen graciegoheen removed their assignment Jul 20, 2023
@booster-analytics-team
Copy link

Experiencing the same issue here, running dbt version 1.5.0. Following for updates.

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 2, 2023

I think we might be able to fix this with a couple of additional overrides in our custom Number type, which reimplements agate's built-in one:

class Number(agate.data_types.Number):
    # undo the change in https://github.com/wireservice/agate/pull/733
    # i.e. do not cast True and False to numeric 1 and 0
    def cast(self, d):
        if type(d) == bool:
            raise agate.exceptions.CastError("Do not cast True to 1 or False to 0.")
        # preserve integers as native Python int
        elif type(d) == int:
            return d
        else:
            return super().cast(d)
            
    def jsonify(self, d):
        if d is None:
            return d
        # do not cast integers to floats
        elif type(d) == int:
            return d

        return float(d)
$ dbt show -s test --output json
20:08:09  Running with dbt=1.7.0-a1
20:08:09  Registered adapter: postgres=1.7.0-a1
20:08:09  Found 1 seed, 1 model, 0 sources, 0 exposures, 0 metrics, 352 macros, 0 groups, 0 semantic models
20:08:09
20:08:09  Concurrency: 5 threads (target='dev')
20:08:09
20:08:09  {
  "node": "test",
  "show": [
    {
      "_numeric": 1.0,
      "_integer": 1,
      "_decimal": 1.0
    }
  ]
}

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 6, 2023

@dave-connors-3 Okay, the failing CI in #8306 reminded me of what was missing with my suggestion above. While this fixes the JSON output, it now raises an error while trying to do normal pretty-printed output, because agate expects all its Number data types to be the Python type decimal.Decimalhere where it calls max_precision, which then calls normalize.

Approach 1: During JSON serialization, if it can be an int, make it an int

In this approach, we allow the Number type to always return a Decimal, and never a true Python int, so that Decimal-specific helper methods can be called later on.

Only during JSON serialization, we convert all numbers with 0-scale (no digits after the decimal) to integers instead of floats. (Unfortunately, the agate library seems to call this "precision" instead of "scale"...). This yields a similar result to what agate does when pretty-printing markdown tables: 1.0 just becomes 1, regardless of type.

Downside: This would remove the distinction between integer and decimal/numeric type, in the case where someone is saying 1.00::numeric and wants to keep seeing those zeroes.

class Number(agate.data_types.Number):
    # undo the change in https://github.com/wireservice/agate/pull/733
    # i.e. do not cast True and False to numeric 1 and 0
    def cast(self, d) -> Decimal:
        if type(d) == bool:
            raise agate.exceptions.CastError("Do not cast True to 1 or False to 0.")
        else:
            return super().cast(d)

    def jsonify(self, d) -> Union[int, float]:
        if d is None:
            return d
        # do not cast integers to floats
        elif d == int(d):
        # fancier way of doing the same: "if agate.utils.max_precision([d]) == 0|
            return int(d)
        else:
            return float(d)
$ dbt show -s test --output json
14:49:17  Running with dbt=1.7.0-a1
14:49:17  Registered adapter: postgres=1.7.0-a1
14:49:17  Found 1 model, 0 sources, 0 exposures, 0 metrics, 352 macros, 0 groups, 0 semantic models
14:49:17
14:49:17  Concurrency: 5 threads (target='dev')
14:49:17
14:49:17  {
  "node": "test",
  "show": [
    {
      "_numeric": 1,
      "_integer": 1,
      "_decimal": 1
    }
  ]
}

It still works for decimals that have nonzero digits after the decimal place, but we're dropping any trailing zeroes. (Remember "significant digits"? Remember high-school science class?)

$ dbt show --output json --inline 'select 1 as a, 1.0 as b, 1.1 as c, 1.10 as d'
15:08:34  Running with dbt=1.7.0-a1
15:08:34  Registered adapter: postgres=1.7.0-a1
15:08:34  Found 1 model, 0 sources, 0 exposures, 0 metrics, 352 macros, 0 groups, 0 semantic models
15:08:34
15:08:34  Concurrency: 5 threads (target='dev')
15:08:34
15:08:34  {
  "show": [
    {
      "a": 1,
      "b": 1,
      "c": 1.1,
      "d": 1.1
    }
  ]
}

Approach 2: Integer != Number

Rather than changing the Number type, we add a new data type Integer, which expects to return a Python int rather than a decimal.Decimal type:

# we do not want integers being coerced to decimals in JSON output
class Integer(agate.data_types.DataType):
    def cast(self, d) -> int:
        if type(d) == int:
            return d
        else:
            raise CastError('Can not parse value "%s" as Integer.' % d)

    def jsonify(self, d) -> int:
        return d


class Number(agate.data_types.Number):
    # undo the change in https://github.com/wireservice/agate/pull/733
    # i.e. do not cast True and False to numeric 1 and 0
    def cast(self, d) -> decimal.Decimal:
        if type(d) == bool:
            raise agate.exceptions.CastError("Do not cast True to 1 or False to 0.")
        else:
            return super().cast(d)

Then, within our type tester further down, we need to try casting to Integer before Number:

def build_type_tester(
    text_columns: Iterable[str], string_null_values: Optional[Iterable[str]] = ("null", "")
) -> agate.TypeTester:

    types = [
        Integer(null_values=("null", "")),
        Number(null_values=("null", "")),
...

Because Integer isn't an instance of agate.data_types.Number, agate won't try to call max_precision/normalize on it here while formatting the markdown table.

This does preserve the distinction between integers and decimals/numerics, although it doesn't preserve the exact scale of the original result set:

$ psql
psql (14.8 (Homebrew), server 13.11)
Type "help" for help.

jerco=# select
  cast(1 as numeric(38,3)) as _numeric,
  cast(1 as integer) as _integer,
  cast(1 as decimal(38,2)) as _decimal
;
 _numeric | _integer | _decimal
----------+----------+----------
    1.000 |        1 |     1.00
(1 row)
$ dbt show -s test --output json
14:58:28  Running with dbt=1.7.0-a1
14:58:28  Registered adapter: postgres=1.7.0-a1
14:58:28  Unable to do partial parsing because config vars, config profile, or config target have changed
14:58:28  Unable to do partial parsing because a project dependency has been added
14:58:28  Found 1 model, 0 sources, 0 exposures, 0 metrics, 352 macros, 0 groups, 0 semantic models
14:58:28
14:58:28  Concurrency: 5 threads (target='dev')
14:58:28
14:58:28  {
  "node": "test",
  "show": [
    {
      "_numeric": 1.0,
      "_integer": 1,
      "_decimal": 1.0
    }
  ]
}

On its face, this feels like the More Correct approach. HOWEVER!! There is a risk of a very subtle breaking change here. Because we expose agate types to end users, in the form of run_query results — if someone has been in the habit of returning integers, and then using Decimal-type methods against those returned values—methods such as normalize, or anything else from this list)—they would start seeing errors like, "AttributeError: 'int' object has no attribute ''`

-- models/test.sql
{% set my_query %}

select
  cast(1 as numeric(38,3)) as _numeric,
  cast(1 as integer) as _integer,
  cast(1 as decimal(38,2)) as _decimal

{% endset %}

{% set results = run_query(my_query) %}

{% for row in results %}

  {{ row[0].is_zero() }}
  {{ row[1].is_zero() }}
  {{ row[2].is_zero() }}
  
{% endfor %}
$ dbt show -s test
...
Runtime Error
  Compilation Error in model test (models/test.sql)
    'int object' has no attribute 'is_zero'. This can happen when calling a macro that does not exist. Check for typos and/or install package dependencies with "dbt deps".

We've never explicitly documented that you can do those things, but it would be a subtle breaking change nevertheless. We don't encounter this issue with Approach 1, because the values remain Decimal type all the way through, until the very moment of serializing to JSON.

@greg-mckeon
Copy link

@jtcohen6 it seems like option 2 is better here, with us documenting the breaking part of the change and only backporting to 1.6. Does that seem reasonable?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 7, 2023

Yup, that's what I'm thinking too!

@graciegoheen
Copy link
Contributor

Re-opening - we had to revert this fix due to a number of failing tests and an error when running docs generate in the IDE

@graciegoheen graciegoheen reopened this Aug 23, 2023
@dave-connors-3
Copy link
Contributor Author

when running with the associated change in the IDE, i executed a dbt docs generate and got the following error:

16:54:21 Running with dbt=1.6.1-rc1
16:54:22 Registered adapter: snowflake=1.6.1
16:54:22 Found 20 models, 32 tests, 14 sources, 0 exposures, 0 metrics, 486 macros, 0 groups, 0 semantic models
16:54:24 Concurrency: 4 threads (target='default')
16:54:27 Building catalog
16:54:33 Encountered an error:
Runtime Error
  Tables contain columns with the same names (stats:row_count:value), but different types (<dbt.clients.agate_helper.Number object at 0x7fa6e98b7040> vs <dbt.clients.agate_helper.Integer object at 0x7fa6e9a375e0>)
16:54:21 Runtime Error
  Tables contain columns with the same names (stats:row_count:value), but different types (<dbt.clients.agate_helper.Number object at 0x7fa6e98b7040> vs <dbt.clients.agate_helper.Integer object at 0x7fa6e9a375e0>)
Sending event: {'category': 'dbt', 'action': 'invocation', 'label': 'start', 'context': [<snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa6e2bfe640>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa6e0798100>, <snowplow_tracker.self_describing_json.SelfDescribingJson object at 0x7fa6e0798370>]}

This error seems to stem from the agate_helper ColumnTypeBuilder class here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 1.6.latest bug Something isn't working help_wanted Trickier changes, with a clear starting point, good for previous/experienced contributors Impact: Adapters
Projects
None yet
6 participants