Skip to content

PostgresHook: Add upsert rows support using ON CONFLICT#67045

Draft
SameerMesiah97 wants to merge 3 commits into
apache:mainfrom
SameerMesiah97:PostgresHook-Upsert-Rows
Draft

PostgresHook: Add upsert rows support using ON CONFLICT#67045
SameerMesiah97 wants to merge 3 commits into
apache:mainfrom
SameerMesiah97:PostgresHook-Upsert-Rows

Conversation

@SameerMesiah97
Copy link
Copy Markdown
Contributor

Description

This change adds a new PostgresHook.upsert_rows method that provides native PostgreSQL UPSERT support using INSERT ... ON CONFLICT.

The new method supports configurable conflict targets through conflict_fields and selective updates through update_fields. When update_fields is omitted or empty, conflicting rows are ignored using DO NOTHING.

upsert_rows reuses the existing batching, transaction handling, serialization, and lineage behavior used by insert_rows, while introducing PostgreSQL-specific UPSERT semantics that are not currently exposed through the generic insert abstraction.

This PR is dependent on PR #66893 merging first.

Rationale

DbApiHook.insert_rows currently supports a generic replace=True abstraction delegated through dialect-specific SQL generation. However, PostgreSQL UPSERT semantics require additional concepts that are not representable through the existing API, including explicit conflict targets and selective update columns.

Supporting PostgreSQL-native UPSERT behavior through insert_rows would require introducing PostgreSQL-specific arguments such as conflict_fields and update_fields into the shared public DbApiHook.insert_rows API. Since DbApiHook is inherited broadly across providers, expanding the generic insert abstraction with provider-specific UPSERT semantics would increase API complexity and introduce ambiguous behavior for non-PostgreSQL hooks.

Adding a dedicated PostgresHook.upsert_rows method keeps PostgreSQL ON CONFLICT semantics explicit and self-contained while avoiding backwards compatibility and abstraction concerns in the shared DbApiHook interface.

The implementation uses PostgreSQL-native INSERT ... ON CONFLICT semantics rather than MERGE, since ON CONFLICT is the established and more broadly compatible UPSERT mechanism across supported PostgreSQL versions.

Tests

Added unit tests verifying that:

  • Standard UPSERT operations correctly generate ON CONFLICT DO UPDATE SQL.
  • UPSERT operations correctly support single and composite conflict fields.
  • UPSERT operations correctly support single and multiple update fields.
  • DO NOTHING behavior is generated when update_fields is omitted.
  • fast_executemany=True uses psycopg2.extras.execute_batch.
  • commit_every correctly chunks UPSERT operations across transactions.
  • Empty row collections do not generate SQL or emit lineage.
  • Empty or invalid target_fields and conflict_fields raise validation errors.

Backwards Compatibility

This change introduces a new provider-specific API and does not modify existing insert_rows behavior or shared DbApiHook interfaces.

Sameer Mesiah added 2 commits May 16, 2026 20:54
…sycopg3. Consolidate duplicated insert/upsert and dialect tests into a shared base class while preserving version-specific behavior and lineage coverage.
…oduce upsert_rows and UPSERT SQL generation with support for DO UPDATE, DO NOTHING, chunked commits, and execute_batch optimization.

Added unit tests for UPSERT SQL generation, DO UPDATE and DO NOTHING behavior, chunked execution, execute_batch execution, validation checks, and transaction handling.
@SameerMesiah97 SameerMesiah97 force-pushed the PostgresHook-Upsert-Rows branch from 64de06d to 5709c06 Compare May 16, 2026 20:14
@SameerMesiah97 SameerMesiah97 changed the title Postgres hook upsert rows PostgresHook: Add upsert rows support using ON CONFLICT May 16, 2026
Copy link
Copy Markdown
Contributor

@justinpakzad justinpakzad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice PR. Left a couple of comments. I know it's still in draft but figured I'd leave some feedback anyways.

target_fields: list[str],
conflict_fields: list[str],
update_fields: list[str] | None = None,
**kwargs,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need the kwargs here as there is nothing consuming them.

def _generate_upsert_sql(
self,
table: str,
values: tuple[Any, ...] | list[Any],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to pass in the values here? The only thing it's used for is to produce the right number of placeholders but I think that can just be done with len(target_fields).

Comment on lines +813 to +819
sql = self._generate_upsert_sql(
table=table,
values=values[0],
target_fields=target_fields,
conflict_fields=conflict_fields,
update_fields=update_fields,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related to my comment above - if we don't need to pass values since it's just used for the number of placeholders, I don't think we need to regenerate the same SQL string on every chunk. This can just be done once outside of the loop.

Comment on lines +824 to +826
if fast_executemany:
# execute_batch reduces round trips by batching parameter sets.
execute_batch(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a guard here since psycopg3 does not support execute_batch? Maybe using the USE_PSYCOPG3 constant that's used in other parts of the code. Either logging a warning and defaulting back to cur.executemany or raising an error.

table,
)

if sql:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also related to my comment above, if we construct the query once outside the loop then this would need to be updated. This could be if nb_rows > 0.

@SameerMesiah97
Copy link
Copy Markdown
Contributor Author

Nice PR. Left a couple of comments. I know it's still in draft but figured I'd leave some feedback anyways.

That’s perfectly fine. I will address the feedback once the dependency is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants