Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unify DAG creation/database cleaning fixtures for testing #3361

Merged
merged 7 commits into from Dec 14, 2023

Conversation

ngken0995
Copy link
Collaborator

@ngken0995 ngken0995 commented Nov 15, 2023

Fixes

Fixes #2622 by @AetherUnbound

Description

There will be one clean_db() in conftest.py file because tests from multiple test modules in the directory can access the fixture function. Pool and TaskInstance from airflow.models will have a if condition because not all tests runs a filter and deletion for them. There is a override method for fixture on a test module level(link) There are default values get_test_dag_id, get_test_id, isPool, isTaskInstance in conftest.py. On each test involving clean_db(), there should be a get_test_dag_id() fixture and options to run isPool fixture and isTaskInstance fixture by returning True

Testing Instructions

  • run just down -v
  • run just catalog/up
  • run just catalog/init
  • run just catalog/test

Checklist

  • My pull request has a descriptive title (not a vague title likeUpdate index.md).
  • My pull request targets the default branch of the repository (main) or a parent feature branch.
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated tests for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no visible errors.
  • I ran the DAG documentation generator (if applicable).

Developer Certificate of Origin

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

@ngken0995 ngken0995 requested a review from a team as a code owner November 15, 2023 18:57
@github-actions github-actions bot added the 🧱 stack: catalog Related to the catalog and Airflow DAGs label Nov 15, 2023
@openverse-bot openverse-bot added 🟩 priority: low Low priority and doesn't need to be rushed ✨ goal: improvement Improvement to an existing user-facing feature 🤖 aspect: dx Concerns developers' experience with the codebase labels Nov 15, 2023
Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for taking this on Kenneth! I think there are a few changes I'd like to see with this. A pytest fixture that exists in the highest level conftest.py file and also has autouse=True means that the fixture will be used in ever test that's run. Most of our tests don't even interact with the Airflow database, so this becomes quite excessive and unnecessary. Additionally, we don't need the extra fixtures for conditionally cleaning task instances and pools if we opt to 1) automatically define a task instance/pool name for each test and 2) always clean them up by selecting records that match those names (whether they exist or not doesn't matter).

I think one way to do this is to create some helper fixtures that generate a unique DAG ID and pool based on the test module. (I say module here instead of test name because we have the test multi-processing binned by module, see here and docs. Here's an example of what that might look like:

# in conftest.py
def _normalize_test_module_name(request) -> str:
    # Extract the test name
    name = request.module.__name__
    # Replace periods with two underscores
    return name.replace(".", "__")


@pytest.fixture
def get_test_dag_id(request):
    return f"{_normalize_test_module_name(request)}_dag"


@pytest.fixture
def get_test_pool(request):
    return f"{_normalize_test_module_name(request)}_pool"


@pytest.fixture
def clean_db(get_test_dag_id, get_test_pool):
    with create_session() as session:
        # synchronize_session='fetch' required here to refresh models
        # https://stackoverflow.com/a/51222378 CC BY-SA 4.0
        session.query(DagRun).filter(DagRun.dag_id == get_test_dag_id).delete()
        session.query(TaskInstance).filter(
            TaskInstance.dag_id.startswith(get_test_dag_id)
        ).delete(synchronize_session="fetch")
        session.query(Pool).filter(id == get_test_pool).delete()

Tests that required a clean database, test DAG, test pool, etc, could then request them as fixtures in their function signature. For example, the test in test_ingestion_server.py could be changed to:

@pytest.fixture()
def index_readiness_dag(get_test_dag_id, clean_db):
    # Create a DAG that just has an index_readiness_check task
    with DAG(dag_id=get_test_dag_id, schedule=None, start_date=TEST_START_DATE) as dag:
        ingestion_server.index_readiness_check(
            media_type="image", index_suffix="my_test_suffix", timeout=timedelta(days=1)
        )

    return dag

(note the usage of get_test_dag_id and clean_db when creating the fixture)

This might require a few more changes to the affected tests, but I think it's much better than adding the clean_db step to every test!

Full diff I had locally for this test if you're interested
diff --git a/catalog/tests/conftest.py b/catalog/tests/conftest.py
index b5b0f0600..4cb33508b 100644
--- a/catalog/tests/conftest.py
+++ b/catalog/tests/conftest.py
@@ -28,35 +28,30 @@ def pytest_addoption(parser):
 mark_extended = pytest.mark.skipif("not config.getoption('extended')")
 
 
-@pytest.fixture()
-def get_test_dag_id():
-    return ""
+def _normalize_test_module_name(request) -> str:
+    # Extract the test name
+    name = request.module.__name__
+    # Replace periods with two underscores
+    return name.replace(".", "__")
 
 
-@pytest.fixture()
-def get_test_pool():
-    return ""
+@pytest.fixture
+def get_test_dag_id(request):
+    return f"{_normalize_test_module_name(request)}_dag"
 
 
-@pytest.fixture()
-def isTaskInstance():
-    return False
+@pytest.fixture
+def get_test_pool(request):
+    return f"{_normalize_test_module_name(request)}_pool"
 
 
-@pytest.fixture()
-def isPool():
-    return False
-
-
-@pytest.fixture(autouse=True)
-def clean_db(get_test_dag_id, get_test_pool, isTaskInstance, isPool):
+@pytest.fixture
+def clean_db(get_test_dag_id, get_test_pool):
     with create_session() as session:
         # synchronize_session='fetch' required here to refresh models
         # https://stackoverflow.com/a/51222378 CC BY-SA 4.0
         session.query(DagRun).filter(DagRun.dag_id == get_test_dag_id).delete()
-        if isTaskInstance:
-            session.query(TaskInstance).filter(
-                TaskInstance.dag_id == get_test_dag_id
-            ).delete(synchronize_session="fetch")
-        if isPool:
-            session.query(Pool).filter(id == get_test_pool).delete()
+        session.query(TaskInstance).filter(
+            TaskInstance.dag_id == get_test_dag_id
+        ).delete(synchronize_session="fetch")
+        session.query(Pool).filter(id == get_test_pool).delete()
diff --git a/catalog/tests/dags/common/test_ingestion_server.py b/catalog/tests/dags/common/test_ingestion_server.py
index 88c2e3b0d..f2ee0a0f0 100644
--- a/catalog/tests/dags/common/test_ingestion_server.py
+++ b/catalog/tests/dags/common/test_ingestion_server.py
@@ -13,23 +13,12 @@ from common import ingestion_server
 
 
 TEST_START_DATE = datetime(2022, 2, 1, 0, 0, 0)
-TEST_DAG_ID = "api_healthcheck_test_dag"
 
 
 @pytest.fixture()
-def get_test_dag_id():
-    return TEST_DAG_ID
-
-
-@pytest.fixture()
-def isTaskInstance():
-    return True
-
-
-@pytest.fixture()
-def index_readiness_dag():
+def index_readiness_dag(get_test_dag_id, clean_db):
     # Create a DAG that just has an index_readiness_check task
-    with DAG(dag_id=TEST_DAG_ID, schedule=None, start_date=TEST_START_DATE) as dag:
+    with DAG(dag_id=get_test_dag_id, schedule=None, start_date=TEST_START_DATE) as dag:
         ingestion_server.index_readiness_check(
             media_type="image", index_suffix="my_test_suffix", timeout=timedelta(days=1)
         )

@ngken0995
Copy link
Collaborator Author

@AetherUnbound unittest.TestCase only allow auto-use fixture(link) In test_single_run_external_dags_sensor.py, TEST_POOL and DAG_PREFIX are hardcode with the same outcome to fixture get_test_dag_id and get_test_pool. In clean_db, DagRun, TaskInstance and Pool will filter with startswith to search for prefix of names and delete them.

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for reworking this, and sorry for the delay on review!

That's frustrating about the unittest.TestCase scenario 😭 I'll make an issue for converting all our catalog tests from unittest, I don't think there's a specific reason they need to be in that format (but that's not the purview of this particular PR!). Do you mind adding a "TODO" comment right above the test pool/test prefix constants in test_single_run_external_dags_sensor.py noting that they can be removed and the fixtures can be used instead once this test is converted to pytest?

Also, I apologize for suggestion one thing then requesting another, but I'm realizing the get_* prefix doesn't really fit our pytest fixtures standard. Would you mind renaming get_test_dag_id to sample_dag_id_fixture and get_test_pool to sample_pool_fixture?

@openverse-bot
Copy link
Collaborator

Based on the low urgency of this PR, the following reviewers are being gently reminded to review this PR:

@obulat
@stacimc
This reminder is being automatically generated due to the urgency configuration.

Excluding weekend1 days, this PR was ready for review 11 day(s) ago. PRs labelled with low urgency are expected to be reviewed within 5 weekday(s)2.

@ngken0995, if this PR is not ready for a review, please draft it to prevent reviewers from getting further unnecessary pings.

Footnotes

  1. Specifically, Saturday and Sunday.

  2. For the purpose of these reminders we treat Monday - Friday as weekdays. Please note that the operation that generates these reminders runs at midnight UTC on Monday - Friday. This means that depending on your timezone, you may be pinged outside of the expected range.

@AetherUnbound AetherUnbound self-requested a review December 1, 2023 01:10
@AetherUnbound AetherUnbound marked this pull request as draft December 1, 2023 01:11
@AetherUnbound
Copy link
Contributor

(Drafting so our auto-ping ignores this for now, please feel free to undraft again once changes are made Kenneth!)

Copy link
Contributor

@AetherUnbound AetherUnbound left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic, everything looks great! Thanks Kenneth!

Copy link
Contributor

@stacimc stacimc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great, thanks @ngken0995!

@AetherUnbound AetherUnbound merged commit 80a75e6 into main Dec 14, 2023
42 checks passed
@AetherUnbound AetherUnbound deleted the 2622-unify-dag-creation branch December 14, 2023 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🤖 aspect: dx Concerns developers' experience with the codebase ✨ goal: improvement Improvement to an existing user-facing feature 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

Unify DAG creation/database cleaning fixtures for testing
4 participants