Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add redshift native path #700

Merged
merged 50 commits into from
Sep 2, 2022
Merged

Add redshift native path #700

merged 50 commits into from
Sep 2, 2022

Conversation

feluelle
Copy link
Member

@feluelle feluelle commented Aug 19, 2022

Description

What is the current behavior?

We currently do not have a native path available for transferring data to redshift.

closes: #613

What is the new behavior?

This PR adds native path for transferring data to redshift.

Does this introduce a breaking change?

No

Checklist

  • Created tests which fail without the change (if possible)
  • Extended the README / documentation, if necessary

pankajkoti and others added 29 commits August 10, 2022 15:57
- pre-commit now runs in pre-commit.ci

Co-authored-by: feluelle <felix.uellendall@astronomer.io>
Locally spin-up airflow container to test the DAG 

closes: #572
Adds a `get_file_list` task that allows using it with Dynamic Task Mapping that was introduced in Airflow 2.3.

closes: #507
Minor fix to Drop Table operator docs. Wanted to give it a clear name in the docs, "Drop operator" isn't clear.


Co-authored-by: Pankaj Koti <pankajkoti699@gmail.com>
Looks like we already had a custom exceptions file with 2 custom exceptions. So just making it consistent and moving `DatabaseCustomError` to this file
# Description
## What is the current behavior?
An example:
tests/integration_test_dag.py::test_full_dag[bigquery]

https://github.com/astronomer/astro-sdk/runs/7386289598?check_suite_focus=true
```
[2022-07-18 09:20:54,978] ***debug_executor.py:87*** ERROR - Failed to execute task: Reason: 403 Exceeded rate limits: too many table update operations for this table. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas.
Traceback (most recent call last):
  File "/home/runner/work/astro-sdk/astro-sdk/.nox/test-3-8-airflow-2-2-5/lib/python3.8/site-packages/pandas_gbq/gbq.py", line 591, in load_data
    chunks = load.load_chunks(
  File "/home/runner/work/astro-sdk/astro-sdk/.nox/test-3-8-airflow-2-2-5/lib/python3.8/site-packages/pandas_gbq/load.py", line 238, in load_chunks
    load_parquet(
  File "/home/runner/work/astro-sdk/astro-sdk/.nox/test-3-8-airflow-2-2-5/lib/python3.8/site-packages/pandas_gbq/load.py", line 130, in load_parquet
    client.load_table_from_dataframe(
  File "/home/runner/work/astro-sdk/astro-sdk/.nox/test-3-8-airflow-2-2-5/lib/python3.8/site-packages/google/cloud/bigquery/job/base.py", line 728, in result
    return super(_AsyncJob, self).result(timeout=timeout, **kwargs)
  File "/home/runner/work/astro-sdk/astro-sdk/.nox/test-3-8-airflow-2-2-5/lib/python3.8/site-packages/google/api_core/future/polling.py", line 137, in result
    raise self._exception
google.api_core.exceptions.Forbidden: 403 Exceeded rate limits: too many table update operations for this table. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas
```
There are a couple of ways to improve this, including:

    reducing the number of requests we make to BQ during load (e.g. not iterate through all files within the Python Astro SDK - but using the wildcard support in the native implementations)
    increasing the quota
    running BQ tests sequentially (? not sure if this would work)
    Adding a retry to bq dags


closes: #553


## What is the new behavior?
 Adding a retry to bq dags only when a specific error will be raised. Using Tenacity.

## Does this introduce a breaking change?
Nope

### Checklist
- [X] Bigquery Dags should not fail with earlier 429 and rate limit reach errors.
updates:
- [github.com/Lucas-C/pre-commit-hooks: v1.2.0 → v1.3.0](Lucas-C/pre-commit-hooks@v1.2.0...v1.3.0)
- [github.com/psf/black: 22.3.0 → 22.6.0](psf/black@22.3.0...22.6.0)
- [github.com/PyCQA/flake8: 4.0.1 → 5.0.4](PyCQA/flake8@4.0.1...5.0.4)
- https://github.com/timothycrosley/isorthttps://github.com/PyCQA/isort
- [github.com/pre-commit/mirrors-mypy: v0.961 → v0.971](pre-commit/mirrors-mypy@v0.961...v0.971)
- [github.com/asottile/pyupgrade: v2.34.0 → v2.37.3](asottile/pyupgrade@v2.34.0...v2.37.3)

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
* Updat query for bigquery schema from  to specfic cols

* Update tests/databases/test_bigquery.py

Co-authored-by: Felix Uellendall <feluelle@users.noreply.github.com>a
Description
What is the current behavior?

There is a failing test case in CI - test_bigquery_create_table_with_columns

Can be because of PR - #641 (comment)
Action run failure - https://github.com/astronomer/astro-sdk/runs/7830499547?check_suite_focus=true

Astro: 1.0.0b1

closes: #660
What is the new behavior?

Bigquery added an extra col in the INFORMATION_SCHEMA.COLUMNS and the test broke in CI since we checked for the entire cols involved. Now we are only looking for specific columns, this should make the test more robust.
Does this introduce a breaking change?

Nope
Checklist

    Created tests that fail without the change (if possible)
Add example docs for run_raw_sql
* Fix development environments

- fix local.mk to respect the correct directory
- fix container.mk to use native docker compose from Docker Desktop
- improve development docs

* Change some rst to md

- fix python version prerequisites
@feluelle feluelle changed the title Feature/613 redshift native path Add redshift native path Aug 19, 2022
@pankajkoti
Copy link
Contributor

Guide for creating IAM role with needed permissions for COPY command authorization: https://www.dataliftoff.com/iam-roles-for-loading-data-from-s3-into-redshift/

@codecov
Copy link

codecov bot commented Aug 30, 2022

Codecov Report

Merging #700 (f4a3305) into main (4585139) will decrease coverage by 0.24%.
The diff coverage is 78.57%.

❗ Current head f4a3305 differs from pull request most recent head 0a65df5. Consider uploading reports for the commit 0a65df5 to get more accurate results

@@            Coverage Diff             @@
##             main     #700      +/-   ##
==========================================
- Coverage   93.52%   93.28%   -0.25%     
==========================================
  Files          43       43              
  Lines        1776     1801      +25     
  Branches      216      219       +3     
==========================================
+ Hits         1661     1680      +19     
- Misses         93       97       +4     
- Partials       22       24       +2     
Impacted Files Coverage Δ
src/astro/databases/aws/redshift.py 91.17% <78.57%> (-8.83%) ⬇️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

src/astro/databases/aws/redshift.py Outdated Show resolved Hide resolved
src/astro/databases/aws/redshift.py Outdated Show resolved Hide resolved
tests/databases/test_redshift.py Outdated Show resolved Hide resolved
@pankajkoti
Copy link
Contributor

Based on the offline discussion, we're removing the options to allow using creds. We're only supporting IAM_ROLE option now.

@kaxil
Copy link
Collaborator

kaxil commented Sep 1, 2022

Based on the offline discussion, we're removing the options to allow using creds. We're only supporting IAM_ROLE option now.

We should cover it in the docs on why we don't support creds though

@pankajkoti
Copy link
Contributor

Based on the offline discussion, we're removing the options to allow using creds. We're only supporting IAM_ROLE option now.

We should cover it in the docs on why we don't support creds though

We have enhanced the docs now. It appears as below . Thanks @kaxil for directing to use the Sphinx note directive.
Screenshot 2022-09-01 at 9 29 56 PM

cc: @utkarsharma2

# With this option, matching is case-sensitive. Column names in Amazon Redshift tables are always lowercase,
# so when you use the 'auto' option, matching JSON field names must also be lowercase.
# Refer: https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-json
FileType.JSON: "JSON 'auto'",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pankajkoti can you add a test case where we have mixed casing in cols names for load_file operator? use - tests/data/homes_upper.csv.

I just want to make sure this works with mixed col casing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a test. Also usingjson 'auto ignorecase' instead of json 'auto' option to allow non-lower case columns as mentioned here https://docs.aws.amazon.com/redshift/latest/dg/copy-parameters-data-format.html#copy-json

Copy link
Collaborator

@utkarsharma2 utkarsharma2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@pankajkoti pankajkoti merged commit bd67729 into main Sep 2, 2022
@pankajkoti pankajkoti deleted the feature/613-redshift-native-path branch September 2, 2022 10:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for Redshift DB - Native Path
5 participants