Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add validation mode support in COPY INTO in snowflake #1689

Merged
merged 10 commits into from
Feb 3, 2023

Conversation

sunank200
Copy link
Contributor

@sunank200 sunank200 commented Jan 31, 2023

Description

What is the current behavior?

Currently, the load_file operator can hide errors when doing native Snowflake transfers. We should raise an exception and display the errors to the users, not hide this problem.

As it stands, a user may be loading 10k rows into a table, end up with five rows, and not realise there is an issue.

closes: #581

What is the new behaviour?

  • Add validation mode as part of the COPY INTO command. Specify the supported validation mode; RETURN_n_ROWS or RETURN_ERRORS or RETURN_ALL_ERRORS. This instructs the COPY command to validate the data files instead of loading them into the specified table; i.e. the COPY command tests the files for errors but does not load them. Read more at:
    https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#optional-parameters
  • Remove the ON_ERROR=Continue hardcoded for CSV file types
  • Add an option for the user to pass ON_ERROR=Continue as part of copy_options

Example:

aql.load_file(
        input_file=File("s3://astro-sdk/python_sdk/example_dags/data/sample.csv", conn_id="aws_conn"),
        output_table=Table(
            conn_id=SNOWFLAKE_CONN_ID,
        ),
        load_options=[
            SnowflakeLoadOptions(
                file_options={"SKIP_HEADER": 1, "SKIP_BLANK_LINES": True},
                copy_options={"ON_ERROR": "CONTINUE"},
                validation_mode="RETURN_ALL_ERRORS",
            )
        ],
    )

Does this introduce a breaking change?

Yes

Checklist

  • Created tests which fail without the change (if possible)
  • Extended the README / documentation, if necessary

@codecov
Copy link

codecov bot commented Jan 31, 2023

Codecov Report

Base: 97.72% // Head: 97.72% // No change to project coverage 👍

Coverage data is based on head (89a5427) compared to base (2c63104).
Patch has no changes to coverable lines.

❗ Current head 89a5427 differs from pull request most recent head dc5f4f2. Consider uploading reports for the commit dc5f4f2 to get more accurate results

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1689   +/-   ##
=======================================
  Coverage   97.72%   97.72%           
=======================================
  Files          21       21           
  Lines         835      835           
=======================================
  Hits          816      816           
  Misses         19       19           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@utkarsharma2
Copy link
Collaborator

utkarsharma2 commented Feb 1, 2023

@sunank200 Currently, why we don't need ON_ERROR=CONTINUE for other filetypes - NDJSON or PARQUET??
Post this change this option will be applied to all the file types, right?

@sunank200
Copy link
Contributor Author

@sunank200 Currently, why we don't need ON_ERROR=CONTINUE for other filetypes - NDJSON or PARQUET?? Post this change this option will be applied to all the file types, right?

@utkarsharma2 This option was even available previously for all file types. The user could previously pass ON_ERROR=CONTINUE as copy_options

@sunank200
Copy link
Contributor Author

Screenshot 2023-02-02 at 9 09 18 PM

@tatiana added the documentation for `SnowflakeLoadOptions`.

Screenshot 2023-02-02 at 9 10 06 PM

Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for addressing all the feedback, @sunank200!
It will be essential to add details about this change in the 1.5 changelogs.

@sunank200
Copy link
Contributor Author

Thanks for addressing all the feedback, @sunank200! It will be essential to add details about this change in the 1.5 changelogs.

Sure @tatiana

@sunank200 sunank200 merged commit 1aee3ce into main Feb 3, 2023
@sunank200 sunank200 deleted the snowflake-load-option branch February 3, 2023 03:58
utkarsharma2 pushed a commit that referenced this pull request Feb 8, 2023
# Description
## What is the current behavior?
<!-- Please describe the current behavior that you are modifying. -->
Currently, the load_file operator can hide errors when doing native
Snowflake transfers. We should raise an exception and display the errors
to the users, not hide this problem.

As it stands, a user may be loading 10k rows into a table, end up with
five rows, and not realise there is an issue.

<!--
Issues are required for both bug fixes and features.
Reference it using one of the following:

closes: #ISSUE
related: #ISSUE
-->
closes: #581


## What is the new behaviour?
<!-- Please describe the behaviour or changes that are being added by
this PR. -->
- Add validation mode as part of the `COPY INTO` command. Specify the
supported validation mode; `RETURN_n_ROWS` or `RETURN_ERRORS` or
`RETURN_ALL_ERRORS`. This instructs the COPY command to validate the
data files instead of loading them into the specified table; i.e. the
COPY command tests the files for errors but does not load them. Read
more at:

https://docs.snowflake.com/en/sql-reference/sql/copy-into-table.html#optional-parameters
- Remove the `ON_ERROR=Continue` hardcoded for CSV file types
- Add an option for the user to pass `ON_ERROR=Continue` as part of
`copy_options`

Example:
```
aql.load_file(
        input_file=File("s3://astro-sdk/python_sdk/example_dags/data/sample.csv", conn_id="aws_conn"),
        output_table=Table(
            conn_id=SNOWFLAKE_CONN_ID,
        ),
        load_options=[
            SnowflakeLoadOptions(
                file_options={"SKIP_HEADER": 1, "SKIP_BLANK_LINES": True},
                copy_options={"ON_ERROR": "CONTINUE"},
                validation_mode="RETURN_ALL_ERRORS",
            )
        ],
    )
```

## Does this introduce a breaking change?
Yes

### Checklist
- [x] Created tests which fail without the change (if possible)
- [x] Extended the README / documentation, if necessary
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve native Snowflake load_file to be verbose about ON_ERROR issues
5 participants