Add merge_operator for Redshift database #753

pankajkoti · 2022-08-26T12:31:08Z

Description

Redshift does not enforce constraints although it accepts those
in the table DDL.
Ref: https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html
As a result, in case of conflicts, e.g. uniqueness constraints, it
does allow insertion of duplicate values and expects you to clean
up duplicates.

Additionally, for merge redshift documentation suggests making use
of a staging table.
Ref: https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

The merge implementation involves the following steps:

  i. Create a staging table having schema identical to the target table
 ii. Load data from target table into stage table. 
     The idea is that we make all the work in the stage table and then finally 
     replace target table data with stage table data.
iii. Handle conflicts. There are two strategies to handle conflict and
     they are explained in detail as below:
       a. ignore: 
               To ignore rows in case of conflicts, we only fetch the rows
               from source table which do not have matching rows in the stage table
               based on the conflict columns and then insert those rows in the stage table.
       b. update: 
              To update rows in case of conflicts, we first update the given target columns from
	      matching rows in stage table that are present in the source table using the given
              source_to_target_column_map and then additionally insert the non-matching rows
              from the source table into the stage table.
 iv. Truncate the target table
  v. Load the stage table data into target table

The above merge steps happen within a transaction block; meaning if one of the statement
fails, everything will be rolled back.

What is the current behavior?

No support for merge_operator in case of Redshift database.

What is the new behavior?

Implements support for merge_operator in case of Redshift database.

Does this introduce a breaking change?

Checklist

Created tests which fail without the change (if possible)

closes: #754

src/astro/databases/aws/redshift.py

codecov · 2022-08-30T18:19:51Z

Codecov Report

Merging #753 (05b2e1c) into main (4585139) will increase coverage by 0.05%.
The diff coverage is 95.83%.

❗ Current head 05b2e1c differs from pull request most recent head c702980. Consider uploading reports for the commit c702980 to get more accurate results

@@            Coverage Diff             @@
##             main     #753      +/-   ##
==========================================
+ Coverage   93.52%   93.57%   +0.05%     
==========================================
  Files          43       43              
  Lines        1776     1822      +46     
  Branches      216      226      +10     
==========================================
+ Hits         1661     1705      +44     
  Misses         93       93              
- Partials       22       24       +2

Impacted Files	Coverage Δ
src/astro/databases/aws/redshift.py	`97.75% <95.83%> (-2.25%)`	⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Redshift does not enforce constraints although it accepts those in the table DDL. Ref: https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html As a result, in case of conflicts, e.g. uniqueness constraint, it does allow insertion of duplicate values and expects you to clean up duplicated. Additionally, for merge redshift documentation suggests making use of a staging table. Ref: https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html The merge implementation involves the following steps: i. Create a staging table having schema identical to the target table ii. Load data from source table into stage table by taking care of mapping source table column to stage table column using the given source_to_target_columns_map. iii. Handle conflicts. There are two strategies to handle conflict and they are explained in detail as below: a. ignore: To ignore rows in case of conflicts, we remove the matching rows from stage table that are present in the target table. b. update: To update rows in case of conflicts, we remove the matching rows from the target table that are present in the stage table. iv. Load the stage table data into target table

for more information, see https://pre-commit.ci

Co-authored-by: Utkarsh Sharma <utkarsharma2@gmail.com>

for more information, see https://pre-commit.ci

utkarsharma2 · 2022-09-01T13:22:02Z

python-sdk/src/astro/databases/aws/redshift.py

+        """
+        source_table_name = self.get_table_qualified_name(source_table)
+        target_table_name = self.get_table_qualified_name(target_table)
+        stage_table_name = self.get_table_qualified_name(Table())


@pankajkoti I think it would be a good idea to create this table in the temp schema. right now we are creating this in the default database scheme. WDYT?

Redshift does not accept schema option for temp tables. As per the docs here: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

TEMPORARY | TEMP
Keyword that creates a temporary table that is visible only within the current session. The table is automatically dropped at the end of the session in which it is created. The temporary table can have the same name as a permanent table. The temporary table is created in a separate, session-specific schema. (You can't specify a name for this schema.) This temporary schema becomes the first schema in the search path, so the temporary table will take precedence over the permanent table unless you qualify the table name with the schema name to access the permanent table. For more information about schemas and precedence, see search_path.

utkarsharma2

LGTM! Nice work! :)

pankajkoti requested review from dimberman, tatiana, utkarsharma2, sunank200, pankajastro and feluelle as code owners August 26, 2022 12:31

utkarsharma2 reviewed Aug 30, 2022

View reviewed changes

pankajkoti commented Aug 30, 2022

View reviewed changes

src/astro/databases/aws/redshift.py Outdated Show resolved Hide resolved

pankajkoti requested a review from utkarsharma2 August 30, 2022 17:51

kaxil force-pushed the 612-merge-operator branch from 05b2e1c to 7564b4e Compare August 31, 2022 11:00

pankajkoti and others added 11 commits September 1, 2022 11:58

Reduce complexity of merge_table function

bd0ed27

[pre-commit.ci] auto fixes from pre-commit.com hooks

bd92d52

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

58d9074

for more information, see https://pre-commit.ci

Apply suggestions from code review

223a9d6

Co-authored-by: Utkarsh Sharma <utkarsharma2@gmail.com>

Update src/astro/databases/aws/redshift.py

1e225a4

[pre-commit.ci] auto fixes from pre-commit.com hooks

9c81065

for more information, see https://pre-commit.ci

Remove unused imports, variables and fix missing import

27374db

[pre-commit.ci] auto fixes from pre-commit.com hooks

05b9994

for more information, see https://pre-commit.ci

Rework on logic for merge operator implementation

a2516f6

Revert parameter for Redshift in test which uses bigquery connection

c702980

pankajkoti force-pushed the 612-merge-operator branch from 7564b4e to c702980 Compare September 1, 2022 06:28

pankajkoti requested review from phanikumv and kaxil September 1, 2022 06:33

utkarsharma2 reviewed Sep 1, 2022

View reviewed changes

utkarsharma2 approved these changes Sep 1, 2022

View reviewed changes

pankajkoti merged commit 65c7a6a into main Sep 1, 2022

pankajkoti deleted the 612-merge-operator branch September 1, 2022 13:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add merge_operator for Redshift database #753

Add merge_operator for Redshift database #753

pankajkoti commented Aug 26, 2022 •

edited

Loading

codecov bot commented Aug 30, 2022 •

edited

Loading

utkarsharma2 Sep 1, 2022

pankajkoti Sep 1, 2022 •

edited

Loading

utkarsharma2 left a comment

Add merge_operator for Redshift database #753

Add merge_operator for Redshift database #753

Conversation

pankajkoti commented Aug 26, 2022 • edited Loading

Description

What is the current behavior?

What is the new behavior?

Does this introduce a breaking change?

Checklist

codecov bot commented Aug 30, 2022 • edited Loading

Codecov Report

utkarsharma2 Sep 1, 2022

Choose a reason for hiding this comment

pankajkoti Sep 1, 2022 • edited Loading

Choose a reason for hiding this comment

utkarsharma2 left a comment

Choose a reason for hiding this comment

pankajkoti commented Aug 26, 2022 •

edited

Loading

codecov bot commented Aug 30, 2022 •

edited

Loading

pankajkoti Sep 1, 2022 •

edited

Loading