New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scheduler encounters database update error, then gets stuck in endless loop, yet still shows as healthy #27300
Comments
It looks like some weird race condition @ashb @ephraimbuddy @uranusjr - we might want to take close look at potential culrprit (I saw similar issues recently). @ejstembler - can you please provider more detailed stacktrace from the issue you saw - just one line is not nearly enough to diagnose it. |
Same error message was observed in #27259 |
|
Incidentally, two Astronomer engineers familiar with the issue: @alex-astronomer and @wolfier |
Dong this query returns multiple rows for the same dag_id
|
@ejstembler @lihan - which database are you using ? |
Hey @ephraimbuddy @ashb @uranusjr @jedcunningham @Taragolis @alex-astronomer and @wolfier (just raising awareness for those who might have some clues or do do some more thorough investigation or mentioned above as familiar with this issue) Maybe some of us already had some ideas we might want to take a very close look at this one before 2.5.1 and try to investigate it more thoroughly rather than moving to the next release (as happened before few times). Seems it continues to happen and other uses report the same problems - example here: #28531 We already had very similar issues reported by other users:
However #28531 is the same issue happening on fully supported version of MySQL: Just want to make sure to mention that one - because it impacts the perception of Airflow scheduler being "stable" and "solid" and I think this should be one of the super important properties of Airflow that we should focus on. |
The OP at least is on postgres - I can't say for certain what version, but the logs in the screenshot are from Astro and we have only ever supported postgres. |
I am also facing similar issue in Airlfow 2.5.0. it is happening inconsistently while A dag having Dynamic Task creation . I am using Mysql5.7 as backend. |
Can you please provide all the information, and logs and description of your circumstances? Eveyrything you can find that could help to diagnose it? I am afraid announcing "I have the same problem" without adding the details does not bring us any closer to diagnosing the problem. On the other hand if you can spend a little time on trying to provide some evidences, it might actually help those who might be able to solve your problem @radiant0619 |
Hi @potiuk @ashb @ephraimbuddy , hopes my findings may offer some clues about this
I turn on general log to capture the SQL when exception happens, found that there are two session (537138 and 535083) execute the same statement
After dig into the code in
After change |
Do you still experience the StaleDataError in 2.5.1. Should have been fixed by #28689 |
when upgraded 2.5.1, I am still experience the StaleDataError. |
@tongtie Could you provide a bit more details about your DB Backend? |
My db is mysql5.7, |
That would be enough #28689 fix only for DB backends which supported SELECT FOR UPDATE, unfortunetly MySQL 5.7 not supported this. Potentially someone could found a solution for MySQL 5.7 before EOL, but for avoid waiting this for days or months I would recommend upgrade to MySQL 8.0 now. Or if you could afford lost all of history and create everything from scratch you might choose Postgres as backend. And just in case I would like to reminder for someone who found this issue that MariaDB is not supported database backend for Airflow. |
I think I'm experiencing this same issue on Airflow 2.6.1 when using dynamic task mapping over task groups and using backfill to update a month of data. The issue I'm talking about is the stale data issue.
I'm using Airflow 2.6.1 (with astro), postgres: I have an operator that generates a list of dates (dicts), and passes those to the task group (insert_new_daily_data). I use: The definition of the task group:
And inside the task group I have 2 BigQueryJobOperators: |
I've re-run again, and it seems to be ok, but I'm getting this issue, although the task is marked as success:
|
Apache Airflow version
Other Airflow 2 version (please specify below)
What happened
Airflow version:
v2.3.3+astro.2
.We've encounter this issue twice this year. Something causes the Scheduler to get stuck in an endless loop, yet it shows as healthy even though nothing is being processed.
The last time we encounter this issue, this week. The Scheduler encountered a database update error:
As a result, the Schedule logs should it's stuck in an endless loop, the same messages are repeating over-and-over.
Because of this, nothing runs, and the entire Airflow instance is considered down.
In this particular case, the issue was resolved by manually deleting the duplicate row in the
dag
table.When we encounter a similar case earlier in the year, the root cause was different and required a different solution. (Upsizing workers).
What you think should happen instead
The Scheduler should not crash or get stuck in an endless loop. It should handle exceptional cases gracefully. It should not be reported as healthy if it is crashing continuously or stuck in an endless loop.
Some strategies for handling this, off the top of my head:
How to reproduce
Enter a duplicate row in the
dags
table. There are probably other ways. Earlier in the year we encounter this same issues when Workers were not properly upsized.Operating System
Debian GNU/Linux 11 (bullseye)
Versions of Apache Airflow Providers
apache-airflow-providers-http==2.0.1
apache-airflow-providers-jdbc==2.0.1
simple-salesforce==1.1.0
csvvalidator==1.2
pandas==1.3.5
pre-commit
pylint==2.15
pytest==6.2.5
pyspark==3.3.0
apache-airflow-providers-google==6.4.0
Deployment
Astronomer
Deployment details
Astronomer
Anything else
No response
Are you willing to submit PR?
Code of Conduct
Possibly Similar Issues
The text was updated successfully, but these errors were encountered: