[23.1] Remove rollback from `__check_jobs_at_startup` #17085

mvdbeek · 2023-11-26T15:30:19Z

I think the premise that we want to do a rollback on exceptions in this method is wrong (it may be correct apprach in other places in the codebase e.g. in
Tool.handle_single_execution()). Here it prevents us from comitting anything inside the with statement (as the job_wrapper.fail method does).
Here's the simplified issue:

❯ python -i scripts/db_shell.py -c config/galaxy.yml
>>> with sa_session() as session, session.begin():
...      sa_session.execute(update(Job).where(Job.id == 1).values(state="error"))
...      sa_session.commit()
...      sa_session.execute(update(Job).where(Job.id == 1).values(state="ok"))
...      sa_session.commit()
...
<sqlalchemy.engine.cursor.LegacyCursorResult object at 0x11f1be350>
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "<string>", line 2, in execute
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1711, in execute
    conn = self._connection_for_bind(bind, close_with_result=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind
    TransactionalContext._trans_ctx_check(self)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check
    raise exc.InvalidRequestError(
sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager.  Please complete the context manager before emitting further commands.

It is probably still worthwhile to have the job recovery be minimal and do things such as calling the job wrapper fail method that does actual work to the job handler as in
#17083, but that's refactoring that can be done on the dev branch and it still seems risky in the sense that we then need to be very careful in ensuring we don't commit anywhere else inside the scope of the begin() statement.

Finally I don't think it makes sense that the startup check should ever cause the boot process to fail. This isn't a misconfiguration or even anything catastrophic for the remaining jobs and places unnecessary stress on admins and can basically break at any time and shouldn't cause a complete service failure.

Fixes #17079

How to test the changes?

(Select all options that apply)

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

I think the premise that we want to do a rollback on exceptions in this method is wrong (it **may** be correct apprach in other places in the codebase e.g. in `Tool.handle_single_execution()`). Here it prevents us from comitting anything inside the with statement (as the job_wrapper.fail method does). Here's the simplified issue: ```shell ❯ python -i scripts/db_shell.py -c config/galaxy.yml >>> with sa_session() as session, session.begin(): ... sa_session.execute(update(Job).where(Job.id == 1).values(state="error")) ... sa_session.commit() ... sa_session.execute(update(Job).where(Job.id == 1).values(state="ok")) ... sa_session.commit() ... <sqlalchemy.engine.cursor.LegacyCursorResult object at 0x11f1be350> Traceback (most recent call last): File "<stdin>", line 4, in <module> File "<string>", line 2, in execute File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1711, in execute conn = self._connection_for_bind(bind, close_with_result=True) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind TransactionalContext._trans_ctx_check(self) File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check raise exc.InvalidRequestError( sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager. Please complete the context manager before emitting further commands. ``` It is probably still worthwhile to have the job recovery be minimal and do things such as calling the job wrapper fail method that does actual work to the job handler as in galaxyproject#17083, but that's refactoring that can be done on the dev branch and it still seems risky in the sense that we then need to be very careful in ensuring we don't commit anywhere else inside the scope of the begin() statement. Finally I don't think it makes sense that the startup check should ever cause the boot process to fail. This isn't a misconfiguration or even anything catastrophic for the remaining jobs and places unnecessary stress on admins and can basically break at any time and shouldn't cause a complete service failure. Fixes galaxyproject#17079

mvdbeek · 2023-11-27T10:29:58Z

Test errors are unrelated

davelopez

Thank you Marius!

jdavcs · 2023-11-27T16:54:01Z

@mvdbeek I don't think this is correct. The previous version rolled back one transaction with one job in case of an exception and did not proceed. The new version ignores the error and commits everything - including whatever was related to the exception. In other words, previous version:
job1 (ok) committed
job2 (exception) not committed
job3 (ok) not committed

This version:
job1 (ok) committed
job2 (exception) ALSO committed (this is bad, I think)
job3 (ok) committed

What we need, I suppose, is this:
job1 (ok) committed
job2 (exception) NOT committed
job3 (ok) committed

Here's a gist that demonstrates the problem and (I think) the solution:
https://gist.github.com/jdavcs/1402a63c8afc9150646e809f3d0ae2d9

mvdbeek · 2023-11-27T18:07:14Z

This is correct. Have you read the PR description and the example given ?

job2 (exception) ALSO committed (this is bad, I think)

Yes, this is correct (https://gist.github.com/jdavcs/1402a63c8afc9150646e809f3d0ae2d9?permalink_comment_id=4774237#gistcomment-4774237).

jdavcs · 2023-11-27T18:25:07Z

Yes, I read the PR description. This logic commits all the changes unconditionally, even if some may have caused an exception - which doesn't seem right? The PR description says that the rollback prevents us from committing anything inside the with statement - which is not entirely correct: there are two with blocks, and the rollback prevents the inner block from being committed, not the outer block.

mvdbeek · 2023-11-27T18:29:04Z

Well, I did say

the premise that we want to do a rollback on exceptions in this method is wrong

and

Yes, this is correct

which I don't see acknowledged. Why would we keep doing the rollback on every startup ?

jdavcs · 2023-11-27T18:42:43Z

Yes, you said it's wrong, but the justification didn't seem right to me - which is why I questioned it. I replied to your comment on the gist here.

We might do a rollback on every startup because if we make a change to the database within a transaction, and something causes an exception, I suppose we don't want that change to be written to the database? We either fail fast (which is what we don't want to do), or we rollback the changes that may have led to the exception.

Yes, this is correct

which I don't see acknowledged...

We are discussing it, right?

mvdbeek · 2023-11-27T19:04:50Z

I suppose we don't want that change to be written to the database?

I suppose we do, so that we don't keep on trying to commit and rollback the same exception on startup.

We either fail fast (which is what we don't want to do)

With fail fast you mean no try/except, or to keep running into the same exception on every startup ?

we rollback the changes that may have led to the exception.

It's not changes we've made in the recover method that are leading to an exception, it's the state of the job that leads to the exception. If there's an exception recovering the job -- that's it, we're done. We're not ending up with some weird database state that we have to prevent from reaching. The example in #17079 is a missing job directory, and I have a hard time coming up with a scenario where would not want to commit changes that were made ... and ideally we can just batch that in one commit.

jdavcs · 2023-11-27T19:16:11Z

OK, I think I understand now why we want to commit for each job. I guess I didn't think of the cause of the exception as something that has already happened to the job prior to the check on startup (e.g. like the missing job dir). Thank you for the detailed explanation - this was helpful!

mvdbeek added kind/bug area/jobs labels Nov 26, 2023

github-actions bot added this to the 23.2 milestone Nov 26, 2023

mvdbeek force-pushed the always_commit_no_rollback branch from 57e8fe6 to 62c153a Compare November 27, 2023 09:29

mvdbeek force-pushed the always_commit_no_rollback branch from 62c153a to da387d4 Compare November 27, 2023 09:37

kysrpex mentioned this pull request Nov 27, 2023

Galaxy job handlers may crash during startup #17079

Closed

davelopez approved these changes Nov 27, 2023

View reviewed changes

mvdbeek merged commit 5a1f3f9 into galaxyproject:release_23.1 Nov 27, 2023
35 of 37 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[23.1] Remove rollback from `__check_jobs_at_startup` #17085

[23.1] Remove rollback from `__check_jobs_at_startup` #17085

mvdbeek commented Nov 26, 2023 •

edited

mvdbeek commented Nov 27, 2023

davelopez left a comment

jdavcs commented Nov 27, 2023

mvdbeek commented Nov 27, 2023 •

edited

jdavcs commented Nov 27, 2023

mvdbeek commented Nov 27, 2023

jdavcs commented Nov 27, 2023

mvdbeek commented Nov 27, 2023

jdavcs commented Nov 27, 2023

[23.1] Remove rollback from __check_jobs_at_startup #17085

[23.1] Remove rollback from __check_jobs_at_startup #17085

Conversation

mvdbeek commented Nov 26, 2023 • edited

How to test the changes?

License

mvdbeek commented Nov 27, 2023

davelopez left a comment

Choose a reason for hiding this comment

jdavcs commented Nov 27, 2023

mvdbeek commented Nov 27, 2023 • edited

jdavcs commented Nov 27, 2023

mvdbeek commented Nov 27, 2023

jdavcs commented Nov 27, 2023

mvdbeek commented Nov 27, 2023

jdavcs commented Nov 27, 2023

[23.1] Remove rollback from `__check_jobs_at_startup` #17085

[23.1] Remove rollback from `__check_jobs_at_startup` #17085

mvdbeek commented Nov 26, 2023 •

edited

mvdbeek commented Nov 27, 2023 •

edited