Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add merge_operator for Redshift database #753

Merged
merged 11 commits into from
Sep 1, 2022
Merged

Conversation

pankajkoti
Copy link
Contributor

@pankajkoti pankajkoti commented Aug 26, 2022

Description

Redshift does not enforce constraints although it accepts those
in the table DDL.
Ref: https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html
As a result, in case of conflicts, e.g. uniqueness constraints, it
does allow insertion of duplicate values and expects you to clean
up duplicates.

Additionally, for merge redshift documentation suggests making use
of a staging table.
Ref: https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html

The merge implementation involves the following steps:

  i. Create a staging table having schema identical to the target table
 ii. Load data from target table into stage table. 
     The idea is that we make all the work in the stage table and then finally 
     replace target table data with stage table data.
iii. Handle conflicts. There are two strategies to handle conflict and
     they are explained in detail as below:
       a. ignore: 
               To ignore rows in case of conflicts, we only fetch the rows
               from source table which do not have matching rows in the stage table
               based on the conflict columns and then insert those rows in the stage table.
       b. update: 
              To update rows in case of conflicts, we first update the given target columns from
	      matching rows in stage table that are present in the source table using the given
              source_to_target_column_map and then additionally insert the non-matching rows
              from the source table into the stage table.
 iv. Truncate the target table
  v. Load the stage table data into target table

The above merge steps happen within a transaction block; meaning if one of the statement
fails, everything will be rolled back.

What is the current behavior?

No support for merge_operator in case of Redshift database.

What is the new behavior?

Implements support for merge_operator in case of Redshift database.

Does this introduce a breaking change?

Checklist

  • Created tests which fail without the change (if possible)

closes: #754

src/astro/databases/aws/redshift.py Outdated Show resolved Hide resolved
src/astro/databases/aws/redshift.py Outdated Show resolved Hide resolved
src/astro/databases/aws/redshift.py Outdated Show resolved Hide resolved
src/astro/databases/aws/redshift.py Outdated Show resolved Hide resolved
src/astro/databases/aws/redshift.py Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Aug 30, 2022

Codecov Report

Merging #753 (05b2e1c) into main (4585139) will increase coverage by 0.05%.
The diff coverage is 95.83%.

❗ Current head 05b2e1c differs from pull request most recent head c702980. Consider uploading reports for the commit c702980 to get more accurate results

@@            Coverage Diff             @@
##             main     #753      +/-   ##
==========================================
+ Coverage   93.52%   93.57%   +0.05%     
==========================================
  Files          43       43              
  Lines        1776     1822      +46     
  Branches      216      226      +10     
==========================================
+ Hits         1661     1705      +44     
  Misses         93       93              
- Partials       22       24       +2     
Impacted Files Coverage Δ
src/astro/databases/aws/redshift.py 97.75% <95.83%> (-2.25%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

pankajkoti and others added 11 commits September 1, 2022 11:58
Redshift does not enforce constraints although it accepts those
in the table DDL. Ref: https://docs.aws.amazon.com/redshift/latest/dg/t_Defining_constraints.html
As a result, in case of conflicts, e.g. uniqueness constraint, it
does allow insertion of duplicate values and expects you to clean
up duplicated.

Additionally, for merge redshift documentation suggests making use
of a staging table. Ref: https://docs.aws.amazon.com/redshift/latest/dg/t_updating-inserting-using-staging-tables-.html
The merge implementation involves the following steps:
i. Create a staging table having schema identical to the target table
ii. Load data from source table into stage table by taking care of
    mapping source table column to stage table column using the given
    source_to_target_columns_map.
iii. Handle conflicts. There are two strategies to handle conflict and
     they are explained in detail as below:
     a. ignore: To ignore rows in case of conflicts, we remove the
                matching rows from stage table that are present in the
                target table.
     b. update: To update rows in case of conflicts, we remove the
		matching rows from the target table that are present
		in the stage table.
iv. Load the stage table data into target table
Co-authored-by: Utkarsh Sharma <utkarsharma2@gmail.com>
"""
source_table_name = self.get_table_qualified_name(source_table)
target_table_name = self.get_table_qualified_name(target_table)
stage_table_name = self.get_table_qualified_name(Table())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pankajkoti I think it would be a good idea to create this table in the temp schema. right now we are creating this in the default database scheme. WDYT?

Copy link
Contributor Author

@pankajkoti pankajkoti Sep 1, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Redshift does not accept schema option for temp tables. As per the docs here: https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_TABLE_NEW.html

TEMPORARY | TEMP
Keyword that creates a temporary table that is visible only within the current session. The table is automatically dropped at the end of the session in which it is created. The temporary table can have the same name as a permanent table. The temporary table is created in a separate, session-specific schema. (You can't specify a name for this schema.) This temporary schema becomes the first schema in the search path, so the temporary table will take precedence over the permanent table unless you qualify the table name with the schema name to access the permanent table. For more information about schemas and precedence, see search_path.

Copy link
Collaborator

@utkarsharma2 utkarsharma2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Nice work! :)

@pankajkoti pankajkoti merged commit 65c7a6a into main Sep 1, 2022
@pankajkoti pankajkoti deleted the 612-merge-operator branch September 1, 2022 13:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement merge_table operator for Redshift
2 participants