Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map File & Table classes to Dataset (AIP-48) #786

Merged
merged 2 commits into from
Sep 8, 2022
Merged

Map File & Table classes to Dataset (AIP-48) #786

merged 2 commits into from
Sep 8, 2022

Conversation

kaxil
Copy link
Collaborator

@kaxil kaxil commented Sep 6, 2022

Part of #611

This PR will map File and Table classes to inherit from Dataset object (if it is available - will be released in Airflow 2.4). This will allow users to schedule DAGs on a File and Table object when used with Airflow 2.4 and above.

As a follow-up to this PR, I will create a PR to automatically add "inlets" and "outlets" to all of the operators in SDK so that users can see the datasets that are produced or consumed by tasks.

Does this introduce a breaking change?

No

Checklist

  • Created tests which fail without the change (if possible)
  • Extended the README / documentation, if necessary -- will do that in a follow-up PR once the entire feature is added

@kaxil
Copy link
Collaborator Author

kaxil commented Sep 6, 2022

Current test failures are because we exceeded rate limit, will re-run in few hours:

pandas_gbq.exceptions.GenericGBQException: Reason: 403 Exceeded rate limits: too many table update operations for this table. For more information, see https://cloud.google.com/bigquery/docs/troubleshoot-quotas

@utkarsharma2
Copy link
Collaborator

@kaxil I'm seeing the below issue in the main branch, I think is related to attrs dep we added as part of #611 ticket

Screenshot 2022-09-07 at 3 24 54 AM
Screenshot 2022-09-07 at 3 24 18 AM

@utkarsharma2
Copy link
Collaborator

@kaxil I think we need to pin attrs to the correct version

Screenshot 2022-09-07 at 3 37 52 AM

@kaxil
Copy link
Collaborator Author

kaxil commented Sep 6, 2022

@kaxil I think we need to pin attrs to the correct version

Screenshot 2022-09-07 at 3 37 52 AM

aah yes, looks like Airflow has a limit set for 2.3.3 - https://github.com/apache/airflow/blob/2.3.3/setup.cfg#L88-L89 but not for 2.3.4.

If you upgrade to Airflow 2.3.4 it will work -- but yeah I will change it to using attr -- not a pblm

@kaxil
Copy link
Collaborator Author

kaxil commented Sep 6, 2022

@utkarsharma2 Updated in ee1389f

@utkarsharma2
Copy link
Collaborator

utkarsharma2 commented Sep 6, 2022

@kaxil Assuming this is the last PR for #611. Should we add a test having dags with inlets and outlets? which can run for airflow 2.4 >=

@kaxil
Copy link
Collaborator Author

kaxil commented Sep 6, 2022

@kaxil Assuming this is the last PR for #611. Should we add a test having dags with inlets and outlets? which can run for airflow 2.4 >=

Not yet, check the PR description - "As a follow-up to this PR, I will create a PR to automatically add "inlets" and "outlets" to all of the operators in SDK so that users can see the datasets that are produced or consumed by tasks."

@codecov
Copy link

codecov bot commented Sep 7, 2022

Codecov Report

Merging #786 (f8cb259) into main (45fd6e1) will decrease coverage by 0.00%.
The diff coverage is 95.74%.

❗ Current head f8cb259 differs from pull request most recent head af1fc05. Consider uploading reports for the commit af1fc05 to get more accurate results

@@            Coverage Diff             @@
##             main     #786      +/-   ##
==========================================
- Coverage   93.31%   93.30%   -0.01%     
==========================================
  Files          43       44       +1     
  Lines        1869     1898      +29     
  Branches      234      235       +1     
==========================================
+ Hits         1744     1771      +27     
- Misses         97       99       +2     
  Partials       28       28              
Impacted Files Coverage Δ
python-sdk/src/astro/files/locations/__init__.py 100.00% <ø> (ø)
python-sdk/src/astro/airflow/datasets.py 83.33% <83.33%> (ø)
python-sdk/src/astro/constants.py 92.50% <83.33%> (-1.62%) ⬇️
python-sdk/src/astro/files/base.py 94.52% <100.00%> (+1.41%) ⬆️
python-sdk/src/astro/sql/table.py 100.00% <100.00%> (ø)
python-sdk/src/astro/sql/operators/raw_sql.py 84.21% <0.00%> (-2.16%) ⬇️
...thon-sdk/src/astro/sql/operators/base_decorator.py 93.00% <0.00%> (-1.06%) ⬇️
python-sdk/src/astro/files/__init__.py 100.00% <0.00%> (ø)
python-sdk/src/astro/files/operators/files.py 100.00% <0.00%> (ø)
... and 1 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@@ -51,6 +51,10 @@
"show-inheritance",
"show-module-summary",
]

suppress_warnings = [
"autoapi.python_import_resolution",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaxil why are we suppressing this warning?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I should have added a comment -- this is because airflow.dataset is still not available for Airflow -- so Sphinx fails to resolve it -- which is why I am suppressing that error.

python-sdk/src/astro/sql/table.py Show resolved Hide resolved
python-sdk/src/astro/files/base.py Outdated Show resolved Hide resolved
@@ -0,0 +1,9 @@
try:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we planning to introduce more things inside the package airflow inside Astro?

Since this is a configuration DATASET_SUPPORT), to some extent, have you considered adding this to settings.py or constants.py (AIRFLOW_DATASET_SUPPORT)?

Copy link
Collaborator Author

@kaxil kaxil Sep 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I am thinking of adding kwargs_with_datasets --> same as https://github.com/astronomer/astro-sdk/blob/aip-48-mapping/src/astro/utils/compat.py and setting.py for it feels wrong and I am trying to avoid utils.

I made that airflow module to put "custom" airflow related things which in future might include Custom XCom backend or Custom Dataset Event Manager - probably 🤷

# Airflow >= 2.4
from airflow.datasets import Dataset

DATASET_SUPPORT = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is probably worth to run a test in Nox, so we cover this branch

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What sort of test are you thinking about for it?

@kaxil
Copy link
Collaborator Author

kaxil commented Sep 7, 2022

PostgresSQL: astro+postgresql
SQLite: "astro+file"

@tatiana To identify the type of connection from conn_id, we need to make a DB call to Airflow metadata DB --- this is done to avoid it and so that users see the a similar thing to what they pass to Table or File object -- as below

@pytest.mark.parametrize(
"table,dataset_uri",
[
(Table(name="test_table"), "astro://@?table=test_table"),
(
Table(name="test_table", conn_id="test_conn"),
"astro://test_conn@?table=test_table",
),
(
Table(name="test_table", conn_id="test_conn", temp=True),
"astro://test_conn@?table=test_table",
),
(
Table(
name="test_table",
conn_id="test_conn",
metadata=Metadata(schema="schema", database="database"),
),
"astro://test_conn@?schema=schema&database=database&table=test_table",
),
(
Table(
name="test_table",
conn_id="test_conn",
metadata=Metadata(schema="schema"),
),
"astro://test_conn@?schema=schema&table=test_table",
),
],
)

@kaxil kaxil force-pushed the add-datasets branch 2 times, most recently from a783557 to 5fdd872 Compare September 8, 2022 12:35
python-sdk/src/astro/sql/table.py Outdated Show resolved Hide resolved
python-sdk/src/astro/files/base.py Outdated Show resolved Hide resolved
python-sdk/src/astro/sql/table.py Show resolved Hide resolved
Part of #611

This PR will map `File` and `Table` classes to inherit from `Dataset` object (if it is available - will be released in Airflow 2.4). This will allow users to schedule DAGs on a `File` and `Table` object when used with Airflow 2.4 and above.

As a follow-up to this PR, I will create a PR to automatically add "inlets" and "outlets" to all of the operators in SDK so that users can see the datasets that are produced or consumed by tasks.

Fix doc build error

Import from `attr` instead of `attrs`

Looks like Airflow has a limit set for 2.3.3 - https://github.com/apache/airflow/blob/2.3.3/setup.cfg#L88-L89 but not for 2.3.4.

This was fixed in Airflow 2.3.4 but I will change it to using attr.

`attrs` is a new namespace while `attr` is the old but working one
@kaxil kaxil merged commit cc5fbfe into main Sep 8, 2022
@kaxil kaxil deleted the add-datasets branch September 8, 2022 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants