[WIP] Fix try_number calculation and add a new column poke_number#30669
[WIP] Fix try_number calculation and add a new column poke_number#30669hussein-awala wants to merge 3 commits intoapache:mainfrom
Conversation
ashb
left a comment
There was a problem hiding this comment.
Eeek, let's think very very carefully before we repeat the same mistake we had with Try Number with a new column.
The facts that try number is mutated in place has been causing us problems for over 5 years. We shouldn't ever mutate a row in place like this really.
|
I agree, but I will continue working on the PR at least to identify all the problems, and I'm open for all suggestions that's why I opened this draft PR. The new column is added to solve the issue with the poke number where it's state is not stored between the different reschedules: As I know, there is two methods to save the state between the different executions, by adding a new column in the metadata db, or storing it in a xcom (not a good solution in our case) |
|
Yes. I agree with @ashb that the try_number mutation has been a bummer and has some historical connotations that are non-obvious and it should be very, very carefully checked. Especially all the more exotic scenarions: retries on failure, backfills, manual runs, etc. etc. Especially I think it might be worth to look at past PRs and issues where "try_num" has been mentioned and see all the times it's been attempted to fix it. It might be solved, sure, but it should be carefully tested - not only via unit tests but also including likely manually going trough set of test cases that will be worked out based on those historical context - and maybe even working out some new test scenarios. For me this one is the kind of issues that are close to one of the best comments described here:
Maybe not as difficult, but likely with similar level of non-obviousness. |
|
Now that we have Another possibility (if doing |
I would be for doing that. This would solve some of the problems (for example retrieving logs from a different celery worker when it is run multiple times) |
|
Hi all, I'm the author of #30653 and just wanted to share results of monkey patching our large airflow instance containing thousands of DAGs with that PR: we've had no more issues and our Sentry alerts for this issue are silent. 🎉 Remember that scheduler_job_runner.py works around this issue. All I did was make backfill job runner behave the same, as I understood from looking at TaskInstanceKey and traversing years of history that a wider change would be difficult. Totally happy to see official airflow get some fix, even if it isn't mine, but thought I'd share results ^ since my PR got closed. Cheers! |
Ah- thanks for the context. I re-opened it then. Maybe it is indeed worth to implement it then (I will take a closer look - because regardless of try_num calculation fix - which will only be possible to implement in 2.7 , the quick-fix to backfill job of yours might be applicable as a patch in 2.6.* |
|
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions. |
pankajkoti
left a comment
There was a problem hiding this comment.
With the recent fixes around try_number from Daniel Standish would we still need this?
|
cc @dstandish |
looks like we don't need the try number stuff; if we need the other stuff, probably best to make a new pr |
closes: #30552
closes: #30572
related: #26993
related: #18080 (it may close it too, I'll check before merging)
closes: #15645
This PR aims to address two interrelated issues:
try_numberto compute thenext_poke_interval, which has an impact on the newreschedule_date.Although the two issues are closely linked, we may choose to divide them into separate pull requests based on the final implementation.
I will fix the unit tests, add new tests and add the migration scripts for the new column(s) once everything works as expected in my test dag.