Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reingestion workflows report misleading date range to Slack #3964

Closed
stacimc opened this issue Mar 25, 2024 · 5 comments · Fixed by #4093
Closed

Reingestion workflows report misleading date range to Slack #3964

stacimc opened this issue Mar 25, 2024 · 5 comments · Fixed by #4093
Assignees
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix good first issue New-contributor friendly help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs

Comments

@stacimc
Copy link
Contributor

stacimc commented Mar 25, 2024

Description

At the end of all provider workflows we report the number of records ingested to Slack. This slack message also includes information about the date range covered by the run: https://github.com/WordPress/openverse/blob/main/catalog/dags/providers/provider_dag_factory.py#L337

For reingestion workflows this range does not make sense. Most are run on a weekly basis, so the Slack message reports the week as the data interval; in actuality the DAG is performing ingestion for a series of many disparate reingestion days.

The most sensible thing to do is probably either:

  • change the text of "date range" to something less suggestive. "data interval" is misleading in the same way, unfortunately.
  • remove this part of the message entirely for reingestion workflows

Screenshots

Screenshot 2024-03-25 at 1 14 50 PM
@stacimc stacimc added 🟩 priority: low Low priority and doesn't need to be rushed 🛠 goal: fix Bug fix 💻 aspect: code Concerns the software code in the repository 🧱 stack: catalog Related to the catalog and Airflow DAGs labels Mar 25, 2024
@AetherUnbound AetherUnbound added good first issue New-contributor friendly help wanted Open to participation from the community labels Apr 8, 2024
@BaileyMcKelway
Copy link
Contributor

Hi all, I was hoping to start contributing and I am looking for a first issue. Would updating the create_report_load_completion function to something like this be a proper solution?

It checks for if the DAG has "reingestion" in the id and then conditional adds the date_range_start and date_range_end.

@stacimc @AetherUnbound

def create_report_load_completion(
    dag_id,
    media_types,
    ingestion_metrics,
    dated,
):
    is_reingestion_workflow = "reingestion" in dag_id

    op_kwargs = {
        "dag_id": dag_id,
        "media_types": media_types,
        "duration": ingestion_metrics["duration"],
        "record_counts_by_media_type": ingestion_metrics["record_counts_by_media_type"],
        "dated": dated,
        "is_reingestion_workflow": is_reingestion_workflow,
    }

    if not is_reingestion_workflow:
        op_kwargs["date_range_start"] = "{{ data_interval_start | ds }}"
        op_kwargs["date_range_end"] = "{{ data_interval_end | ds }}"

    return PythonOperator(
        task_id="report_load_completion",
        python_callable=reporting.report_completion,
        op_kwargs=op_kwargs,
        trigger_rule=TriggerRule.ALL_DONE,
    )

@AetherUnbound
Copy link
Contributor

@BaileyMcKelway welcome! That definitely looks like an appropriate start! I have a few suggestions around the structure of the Python itself, but those can be made directly on the PR. Do you mind opening a pull request for this particular change?

@BaileyMcKelway
Copy link
Contributor

@AetherUnbound Thanks!

I have a PR up here #4093.

It is currently in draft because I have a few questions on how to properly test my change. Currently all tests are passing for the DAGs.

I guess it would be better to ask these questions on the Slack? I tried to register an account last week ago and I am still waiting approval.

Screenshot 2024-04-11 at 8 23 09 PM

@AetherUnbound
Copy link
Contributor

That's...really odd 😮 Slack would indeed be the best way, but it's unfortunate you can't get connected 😕 Do feel free to ask your questions here, or if you'd prefer to have a 1:1 conversation about it you can message me on LinkedIn if that's easier!

@BaileyMcKelway
Copy link
Contributor

@AetherUnbound Everything is now sorted! I signed up with a different email and was approved instantly. I now have access to Slack. Not sure why my main gmail didn't work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💻 aspect: code Concerns the software code in the repository 🛠 goal: fix Bug fix good first issue New-contributor friendly help wanted Open to participation from the community 🟩 priority: low Low priority and doesn't need to be rushed 🧱 stack: catalog Related to the catalog and Airflow DAGs
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants