Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[23.1] Remove rollback from __check_jobs_at_startup #17085

Merged

Conversation

mvdbeek
Copy link
Member

@mvdbeek mvdbeek commented Nov 26, 2023

I think the premise that we want to do a rollback on exceptions in this method is wrong (it may be correct apprach in other places in the codebase e.g. in
Tool.handle_single_execution()). Here it prevents us from comitting anything inside the with statement (as the job_wrapper.fail method does).
Here's the simplified issue:

❯ python -i scripts/db_shell.py -c config/galaxy.yml
>>> with sa_session() as session, session.begin():
...      sa_session.execute(update(Job).where(Job.id == 1).values(state="error"))
...      sa_session.commit()
...      sa_session.execute(update(Job).where(Job.id == 1).values(state="ok"))
...      sa_session.commit()
...
<sqlalchemy.engine.cursor.LegacyCursorResult object at 0x11f1be350>
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "<string>", line 2, in execute
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1711, in execute
    conn = self._connection_for_bind(bind, close_with_result=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind
    TransactionalContext._trans_ctx_check(self)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check
    raise exc.InvalidRequestError(
sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager.  Please complete the context manager before emitting further commands.

It is probably still worthwhile to have the job recovery be minimal and do things such as calling the job wrapper fail method that does actual work to the job handler as in
#17083, but that's refactoring that can be done on the dev branch and it still seems risky in the sense that we then need to be very careful in ensuring we don't commit anywhere else inside the scope of the begin() statement.

Finally I don't think it makes sense that the startup check should ever cause the boot process to fail. This isn't a misconfiguration or even anything catastrophic for the remaining jobs and places unnecessary stress on admins and can basically break at any time and shouldn't cause a complete service failure.

Fixes #17079

How to test the changes?

(Select all options that apply)

  • I've included appropriate automated tests.
  • This is a refactoring of components with existing test coverage.
  • Instructions for manual testing are as follows:
    1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

  • I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

I think the premise that we want to do a rollback on exceptions in this
method is wrong (it **may** be correct apprach in other places in the
codebase e.g. in
`Tool.handle_single_execution()`). Here it prevents us from comitting
anything inside the with statement (as the job_wrapper.fail method
does).
Here's the simplified issue:

```shell
❯ python -i scripts/db_shell.py -c config/galaxy.yml
>>> with sa_session() as session, session.begin():
...      sa_session.execute(update(Job).where(Job.id == 1).values(state="error"))
...      sa_session.commit()
...      sa_session.execute(update(Job).where(Job.id == 1).values(state="ok"))
...      sa_session.commit()
...
<sqlalchemy.engine.cursor.LegacyCursorResult object at 0x11f1be350>
Traceback (most recent call last):
  File "<stdin>", line 4, in <module>
  File "<string>", line 2, in execute
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1711, in execute
    conn = self._connection_for_bind(bind, close_with_result=True)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/orm/session.py", line 1552, in _connection_for_bind
    TransactionalContext._trans_ctx_check(self)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.11/site-packages/sqlalchemy/engine/util.py", line 199, in _trans_ctx_check
    raise exc.InvalidRequestError(
sqlalchemy.exc.InvalidRequestError: Can't operate on closed transaction inside context manager.  Please complete the context manager before emitting further commands.
```

It is probably still worthwhile to have the job recovery be minimal and
do things such as calling the job wrapper fail method that does actual
work to the job handler as in
galaxyproject#17083, but that's
refactoring that can be done on the dev branch and it still seems risky
in the sense that we then need to be very careful in ensuring we don't
commit anywhere else inside the scope of the begin() statement.

Finally I don't think it makes sense that the startup check should
ever cause the boot process to fail. This isn't a misconfiguration
or even anything catastrophic for the remaining jobs and places
unnecessary stress on admins and can basically break at any time
and shouldn't cause a complete service failure.

Fixes galaxyproject#17079
@mvdbeek
Copy link
Member Author

mvdbeek commented Nov 27, 2023

Test errors are unrelated

Copy link
Contributor

@davelopez davelopez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you Marius!

@mvdbeek mvdbeek merged commit 5a1f3f9 into galaxyproject:release_23.1 Nov 27, 2023
35 of 37 checks passed
@jdavcs
Copy link
Member

jdavcs commented Nov 27, 2023

@mvdbeek I don't think this is correct. The previous version rolled back one transaction with one job in case of an exception and did not proceed. The new version ignores the error and commits everything - including whatever was related to the exception. In other words, previous version:
job1 (ok) committed
job2 (exception) not committed
job3 (ok) not committed

This version:
job1 (ok) committed
job2 (exception) ALSO committed (this is bad, I think)
job3 (ok) committed

What we need, I suppose, is this:
job1 (ok) committed
job2 (exception) NOT committed
job3 (ok) committed

Here's a gist that demonstrates the problem and (I think) the solution:
https://gist.github.com/jdavcs/1402a63c8afc9150646e809f3d0ae2d9

@mvdbeek
Copy link
Member Author

mvdbeek commented Nov 27, 2023

This is correct. Have you read the PR description and the example given ?

job2 (exception) ALSO committed (this is bad, I think)

Yes, this is correct (https://gist.github.com/jdavcs/1402a63c8afc9150646e809f3d0ae2d9?permalink_comment_id=4774237#gistcomment-4774237).

@jdavcs
Copy link
Member

jdavcs commented Nov 27, 2023

Yes, I read the PR description. This logic commits all the changes unconditionally, even if some may have caused an exception - which doesn't seem right? The PR description says that the rollback prevents us from committing anything inside the with statement - which is not entirely correct: there are two with blocks, and the rollback prevents the inner block from being committed, not the outer block.

@mvdbeek
Copy link
Member Author

mvdbeek commented Nov 27, 2023

Well, I did say

the premise that we want to do a rollback on exceptions in this method is wrong

and

Yes, this is correct

which I don't see acknowledged. Why would we keep doing the rollback on every startup ?

@jdavcs
Copy link
Member

jdavcs commented Nov 27, 2023

Yes, you said it's wrong, but the justification didn't seem right to me - which is why I questioned it. I replied to your comment on the gist here.

We might do a rollback on every startup because if we make a change to the database within a transaction, and something causes an exception, I suppose we don't want that change to be written to the database? We either fail fast (which is what we don't want to do), or we rollback the changes that may have led to the exception.

Yes, this is correct

which I don't see acknowledged...

We are discussing it, right?

@mvdbeek
Copy link
Member Author

mvdbeek commented Nov 27, 2023

I suppose we don't want that change to be written to the database?

I suppose we do, so that we don't keep on trying to commit and rollback the same exception on startup.

We either fail fast (which is what we don't want to do)

With fail fast you mean no try/except, or to keep running into the same exception on every startup ?

we rollback the changes that may have led to the exception.

It's not changes we've made in the recover method that are leading to an exception, it's the state of the job that leads to the exception. If there's an exception recovering the job -- that's it, we're done. We're not ending up with some weird database state that we have to prevent from reaching. The example in #17079 is a missing job directory, and I have a hard time coming up with a scenario where would not want to commit changes that were made ... and ideally we can just batch that in one commit.

@jdavcs
Copy link
Member

jdavcs commented Nov 27, 2023

OK, I think I understand now why we want to commit for each job. I guess I didn't think of the cause of the exception as something that has already happened to the job prior to the check on startup (e.g. like the missing job dir). Thank you for the detailed explanation - this was helpful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants