Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

account for changing column types in incremental tables #175

Closed
drewbanin opened this issue Oct 4, 2016 · 3 comments
Closed

account for changing column types in incremental tables #175

drewbanin opened this issue Oct 4, 2016 · 3 comments
Milestone

Comments

@drewbanin
Copy link
Contributor

drewbanin commented Oct 4, 2016

Incremental tables are a powerful dbt feature, but there's at least one edge case which makes working with them difficult. During the first run of an incremental model, Redshift will infer a type for every column in the table. Subsequent runs can insert new data which does not conform to the expected type. One example is a varchar(16) field which is inserted into a varchar(8) field.
In practice, this error looks like:

Value too long for character type
DETAIL:
  -----------------------------------------------
  error:  Value too long for character type
  code:      8001
  context:   Value too long for type character varying(8)
  query:     3743263
  location:  funcs_string.hpp:392
  process:   query4_35 [pid=18194]
  -----------------------------------------------

The recommended solution, currently, is to change the model materialization to table, re-run dbt, then change it back to incremental.

One possible solution is to the use temp table generated by dbt during incremental model updates. The types of every temp table column can be compared to the types of every existing table column. If the types differ, then dbt can alter the offending column to the type of temp table column.

This is kind of tricky in practice, because it could be that the existing column is varchar(8) while the new column is varchar(1)... Some logic is required to ensure that columns are not altered to more-restrictive data types.

Additionally, all incremental model operations currently occur within a single transaction. If we were to implement this proposed solution, we'd need to run Python code in between the temp table phase and the delete/insert phase of the incremental model update, thus splitting up the transaction.

CC @jthandy

@jthandy
Copy link
Member

jthandy commented Oct 4, 2016

I don't believe that running python in the middle of a transaction means we need to "split" it... I agree that there is complexity involved in the running of this, and that it's similar to what we've spoken about before where a single operation can be split across multiple nodes in the graph. I think the important thing to do there is to make it such that multiple nodes can participate in a single transaction.

@jthandy
Copy link
Member

jthandy commented Oct 4, 2016

(otherwise, I'm 👍 on everything you outlined)

@drewbanin drewbanin modified the milestone: 0.5.1 Release Oct 18, 2016
@drewbanin
Copy link
Contributor Author

0.5.1 release

yu-iskw pushed a commit to yu-iskw/dbt that referenced this issue Aug 17, 2021
* Add a compare_columns arg to the equality test

The `compare_columns` argument allows a user to specify a subset of
columns to compare when checking the equality of two models' data.
There are a few shortcomings of the previous implementation that this
helps to address. With the previous implementation:

- You cannot compare two ephemeral models, as the comparison depends
  on the ability to introspect relation.
- If two models differ in a predictable and expected way (i.e., the
  addition of a surrogate key column), you would have to create an
  additional materialized model that selects all but the added column
  in order to compare equality.

* Update documentation for equality test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants