You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
primary_key is currently ignored when scd2 is used as merge strategy. This ticket suggests a way to start interpreting the primary_key (or, alternatively, the natural_key) hint, such that users can use a natural key and simple row version column (i.e. "updated_at") as surrogate key.
Cases—currently supported:
(1) no primary_key, no row_version_column_name ➜ combine all user columns in hash, store in _dlt_id and mark with x-row-version hint
(2) no primary_key, yes row_version_column_name ➜ mark column row_version_column_name with x-row-version—values for _dlt_id are random as usual
Cases—to implement in this ticket:
(3) yes primary_key, no row_version_column_name ➜ raise error—we need to be able to make a surrogate key, which is not possible without row_version_column_name
alternative: ignore primary_key and silently resolve to case (1)—not my preference, I'd rather make it explicit and not do things behind the user's back
(4) yes primary_key, yes row_version_column_name ➜ mark column row_version_column_name with x-row-version, combine primary_key column(s) and row_version_column_name column in hash on-the-fly (values are not persisted)—values for _dlt_id are random as usual
natural_key hint?
The primary_key will not be a technical primary key in the destination table using this approach, as we allow multiple records with the same primary_key. Kimball uses the term natural key instead. We could introduce natural_key as a new hint and use that in all cases described above instead of primary_key (and ignore primary_key completeley if provided).
2. Active record literal
Currently we use a high timestamp literal (i.e. 9999-12-31 00:00...) in the "valid to" column to indicate an active record. This is suboptimal in two cases: (1) the destination has a lower max value for timestamp columns (e.g. Clickhouse) and (2) the user prefers a NULL value instead.
We can provide an active_record_literal configuration option as follows:
Possible values for active_record_literal would be null (default) and high_timestamp. The literal used for high_timestamp defaults to 9999-12-31 00:00... for destinations that support it, but can be lower for destinations that have a lower max timestamp (e.g. Clickhouse).
Feature description
This ticket has two components:
primary_key
interpretation1.
primary_key
interpretationprimary_key
is currently ignored whenscd2
is used as merge strategy. This ticket suggests a way to start interpreting theprimary_key
(or, alternatively, thenatural_key
) hint, such that users can use a natural key and simple row version column (i.e. "updated_at") as surrogate key.Cases—currently supported:
(1) no
primary_key
, norow_version_column_name
➜ combine all user columns in hash, store in_dlt_id
and mark withx-row-version
hint(2) no
primary_key
, yesrow_version_column_name
➜ mark columnrow_version_column_name
withx-row-version
—values for_dlt_id
are random as usualCases—to implement in this ticket:
(3) yes
primary_key
, norow_version_column_name
➜ raise error—we need to be able to make a surrogate key, which is not possible withoutrow_version_column_name
primary_key
and silently resolve to case (1)—not my preference, I'd rather make it explicit and not do things behind the user's back(4) yes
primary_key
, yesrow_version_column_name
➜ mark columnrow_version_column_name
withx-row-version
, combineprimary_key
column(s) androw_version_column_name
column in hash on-the-fly (values are not persisted)—values for_dlt_id
are random as usualnatural_key
hint?The
primary_key
will not be a technical primary key in the destination table using this approach, as we allow multiple records with the sameprimary_key
. Kimball uses the term natural key instead. We could introducenatural_key
as a new hint and use that in all cases described above instead ofprimary_key
(and ignoreprimary_key
completeley if provided).2. Active record literal
Currently we use a high timestamp literal (i.e.
9999-12-31 00:00...
) in the "valid to" column to indicate an active record. This is suboptimal in two cases: (1) the destination has a lower max value for timestamp columns (e.g. Clickhouse) and (2) the user prefers aNULL
value instead.We can provide an
active_record_literal
configuration option as follows:Possible values for
active_record_literal
would benull
(default) andhigh_timestamp.
The literal used forhigh_timestamp
defaults to9999-12-31 00:00...
for destinations that support it, but can be lower for destinations that have a lower max timestamp (e.g. Clickhouse).Are you a dlt user?
Yes, I'm already a dlt user.
Use case
No response
Proposed solution
No response
Related issues
#828
The text was updated successfully, but these errors were encountered: