Wait for iNaturalist load completion before compiling statistics #4104
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Fixes
Fixes #4103 by @AetherUnbound
Description
This has to be the longest time I've spent on such a small change 馃槄
This PR modifies the iNaturalist flow control so that the step which consolidates the statistics from the load data mapped task waits to run until after all tasks have completed. Previous behavior led this task to be marked as
upstream_failed
as soon as any one of the loading tasks within the mapped task failed. See the issue for more information on that.The new approach 1) simplifies an XCom access (just for readability) and, more importantly, 2) ensures that the
consolidate_load_statistics
step waits until allload_transformed_data
tasks complete before moving on. I've left theNONE_SKIPPED
trigger rule on the post ingestion tasks because I think it is worth having this DAG clean up in case anything fails (rather than letting anupstream_failed
cascade to them too if something goes wrong with consolidation). However, we only want it cleaning up after it's actually done running!!The upshot of this is that the all tasks which depend on the task group will now read the task group as having passed, since the direct parent (
consolidate_load_statistics
) will likely always succeed. Again, I think this is okay given the post ingestion tasks are cleanup & reporting. But it's a useful point to convey to @WordPress/openverse-catalog that the last task in a task group will be what any task downstream of the group uses to determine its parent state. I played around with using anEmptyOperator
alongside the final task to try and determine the state, and that helped, so it's an option to look into if it's important we both wait and fail, for instance. But this feels like an Airflow "gotcha" that I wanted to make sure folks saw!Testing Instructions
The easiest way to play around with this actually is to add the following file to your DAGs folder and play around with different executions:
This is what I used to test different assumptions about the trigger rules, how they operated within the task group, etc.
You could also run the iNaturalist DAG locally and cause an error to occur in one of the load mapped tasks, to observe the change.
Checklist
Update index.md
).main
) or a parent feature branch.Developer Certificate of Origin
Developer Certificate of Origin