Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error in SQLA (DetachedInstanceError) (in AiiDA v1.5 and newly reported in AiiDA v2.0.1) #4596

Closed
giovannipizzi opened this issue Nov 28, 2020 · 11 comments · Fixed by #6208
Closed
Assignees
Labels

Comments

@giovannipizzi
Copy link
Member

Describe the bug

Roughly half of my calculations are failing on a SQLA backend:

aiida.orm.nodes.process.calculation.calcjob.CalcJobNode: [ERROR] iteration 4 of do_update excepted, retrying after 160 seconds
Traceback (most recent call last):
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/engine/utils.py", line 172, in exponential_backoff_retry
    for iteration in range(max_attempts):
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/tornado/gen.py", line 307, in wrapper
    yielded = next(result)
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/engine/processes/calcjobs/tasks.py", line 184, in do_update
    with job_manager.request_job_info_update(authinfo, job_id) as update_request:
  File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/engine/processes/calcjobs/manager.py", line 279, in request_job_info_update
    with self.get_jobs_list(authinfo).request_job_info_update(job_id) as request:
  File "/usr/lib/python3.7/contextlib.py", line 112, in __enter__
    return next(self.gen)
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/engine/processes/calcjobs/manager.py", line 165, in request_job_info_update
    self._ensure_updating()
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/engine/processes/calcjobs/manager.py", line 188, in _ensure_updating
    self._update_handle = self._loop.call_later(self._get_next_update_delay(), updating)
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/engine/processes/calcjobs/manager.py", line 221, in _get_next_update_delay
    minimum_interval = self.get_minimum_update_interval()
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/engine/processes/calcjobs/manager.py", line 75, in get_minimum_update_interval
    return self._authinfo.computer.get_minimum_job_poll_interval()
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/orm/authinfos.py", line 81, in computer
    return computers.Computer.from_backend_entity(self._backend_entity.computer)
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/orm/implementation/sqlalchemy/authinfos.py", line 76, in computer
    return self.backend.computers.from_dbmodel(self._dbmodel.dbcomputer)
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/orm/implementation/sqlalchemy/utils.py", line 64, in __getattr__
    if self.is_saved() and self._is_mutable_model_field(item) and not self._in_transaction():
  File "/home/pizzi/.virtualenvs/aiida-prod/codes/aiida-core/aiida/orm/implementation/sqlalchemy/utils.py", line 87, in is_saved
    return self._model.id is not None
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/sqlalchemy/orm/attributes.py", line 287, in __get__
    return self.impl.get(instance_state(instance), dict_)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/sqlalchemy/orm/attributes.py", line 718, in get
    value = state._load_expired(state, passive)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/sqlalchemy/orm/state.py", line 652, in _load_expired
    self.manager.deferred_scalar_loader(self, toload)
  File "/home/pizzi/.virtualenvs/aiida-prod/lib/python3.7/site-packages/sqlalchemy/orm/loading.py", line 944, in load_scalar_attributes
    "attribute refresh operation cannot proceed" % (state_str(state))
sqlalchemy.orm.exc.DetachedInstanceError: Instance <DbAuthInfo at 0x7f1a6a57da10> is not bound to a Session; attribute refresh operation cannot proceed (Background on this error at: http://sqlalche.me/e/13/bhk3)

I'm quite sure this was a production AiiDA 1.4.2 environment where I hadn't done anything and things were working fine until a few weeks ago.
Running yesterday, maybe calculations were pased with 5 consecutive errors and the error above.
I decided to stop the deamon, reinstall AiiDA 1.5.0 and replay them, but they fail again with the same error.

Any idea of what could be causing this?

@sphuber @CasperWA @chrisjsewell

@chrisjsewell
Copy link
Member

this is caused due to the default user variable set on the UserCollection which, if not reset when the session/storage is closed, will now point to a detached SqlaUser model.

This is fixed by:

def close(self) -> None:
if self._session_factory is None:
return # the instance is already closed, and so this is a no-op
# reset the cached default user instance, since it will now have no associated session
User.objects(self).reset()
# close the connection

(Note, you should never directly close the SQLA session of a PsqlDosBackend instance, always go through PsqlDosBackend.close)

@giovannipizzi
Copy link
Member Author

Thanks @chrisjsewell ! Just to make sure I understood - this is already fixed now in develop, or not yet? Also thanks for the comment on not closing manually - but I think me "as a user" I was not doing it (I think), it was probably some part of AiiDA doing it?

@chrisjsewell
Copy link
Member

this is already fixed now in develop

It should be yes; there is now nowhere that directly closes the sqlalchemy session, except for the actual PsqlDosBackend instance (when it is closed)
(well apart from in the REST API, but that's another matter)

Pre v2, it was interfacing with the session all over the place (see the diagrams in #5330), so I don't know exactly where it would have been expunged (which is what happens when it is closed)

@chrisjsewell
Copy link
Member

But yeh obviously we can re-test with aiida v2, to check for sure that this is no longer occurring

@sphuber
Copy link
Contributor

sphuber commented Apr 28, 2022

Closing this for now. Feel free to reopen if you encounter it with v2.0

@sphuber sphuber closed this as completed Apr 28, 2022
@unkcpz
Copy link
Member

unkcpz commented May 11, 2022

Unfortunately, I encounter this again when I launch > 400 quick calcjobs locally. Here is the traceback. Let me know how can I future debug it.

+-> ERROR at 2022-05-11 11:45:25.792422+02:00
 | Traceback (most recent call last):                                                                                                                                                                                                      
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/utils.py", line 187, in exponential_backoff_retry
 |     result = await coro()                                                                                                                                                                                                               
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/tasks.py", line 189, in do_update
 |     with job_manager.request_job_info_update(authinfo, job_id) as update_request:                                                                                                                                                       
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/contextlib.py", line 119, in __enter__
 |     return next(self.gen)                                                                                                                                                                                                               
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 286, in request_job_info_update
 |     with self.get_jobs_list(authinfo).request_job_info_update(job_id) as request:                                                                                                                                                       
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/contextlib.py", line 119, in __enter__
 |     return next(self.gen)                                                                                                                                                                                                               
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 167, in request_job_info_update
 |     self._ensure_updating()                      
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 195, in _ensure_updating
 |     self._get_next_update_delay(),                     
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 230, in _get_next_update_delay
 |     minimum_interval = self.get_minimum_update_interval()
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/engine/processes/calcjobs/manager.py", line 79, in get_minimum_update_interval
 |     return self._authinfo.computer.get_minimum_job_poll_interval()                                                                                                                                                                       |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/orm/authinfos.py", line 87, in computer
 |     return computers.Computer.from_backend_entity(self._backend_entity.computer)             
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/storage/psql_dos/orm/authinfos.py", line 74, in computer
 |     return self.backend.computers.ENTITY_CLASS.from_dbmodel(self.model.dbcomputer, self.backend)
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/storage/psql_dos/orm/utils.py", line 84, in __getattr__
 |     if self.is_saved() and self._is_mutable_model_field(item) and not self._in_transaction():
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/aiida/storage/psql_dos/orm/utils.py", line 110, in is_saved
 |     return self._model.id is not None
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 481, in __get__
 |     return self.impl.get(state, dict_)
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 941, in get
 |     value = self._fire_loader_callables(state, key, passive)
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/sqlalchemy/orm/attributes.py", line 972, in _fire_loader_callables
 |     return state._load_expired(state, passive)
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/sqlalchemy/orm/state.py", line 710, in _load_expired
 |     self.manager.expired_attribute_loader(self, toload, passive)
 |   File "/home/jyu/miniconda3/envs/opsp-ea/lib/python3.9/site-packages/sqlalchemy/orm/loading.py", line 1369, in load_scalar_attributes
 |     raise orm_exc.DetachedInstanceError(
 | sqlalchemy.orm.exc.DetachedInstanceError: Instance <DbAuthInfo at 0x7f75926d1a00> is not bound to a Session; attribute refresh operation cannot proceed (Background on this error at: https://sqlalche.me/e/14/bhk3)

@unkcpz unkcpz reopened this May 11, 2022
@unkcpz unkcpz changed the title Error in SQLA (DetachedInstanceError) (in AiiDA v1.5) Error in SQLA (DetachedInstanceError) (in AiiDA v1.5 and newly reported in AiiDA v2.0.1) May 11, 2022
@unkcpz
Copy link
Member

unkcpz commented May 11, 2022

I restart the daemon and restart Postgres backend server, and seems the issue does not show up.

@sphuber
Copy link
Contributor

sphuber commented May 17, 2023

I am not sure this is due to the User model being detached, because it is the DbAuthInfo that pops up in the error message. When this happens, whatever the reason, it is because the Sqlalchemy session is in an inconsistent state. The Python interpreter is still holding on to ORM instances that reference a database model that is no longer in the session.

The only thing that should help to remedy is, is to reset the daemon, as restarting the daemon workers will recreate the session and it should be in a consistent state again.

Now, as for why the session gets to this state, I am not sure. There used to be a similar bug related to the User model, as mentioned before in this thread. There, the User instance of the default user would be set on the Collection in memory. This wasn't cleared properly when the session got closed, and so when reopened, the same old instance would be used, but its database model was no longer attached to the new session. This problem has been solved, by explicitly unsetting this default user in memory when the storage was closed.

Here it seems to be about the AuthInfo though. Both in this report and in the duplicate #6024 the exception comes from the JobManager.request_job_info_update method that is called by the CalcJob in any of its tasks, for example, task_update_job. This manager keeps a mapping of JobsList instances for each AuthInfo it manages. The error comes when the JobsList calls get_minimum_update_interval at which point it access the computer attribute of the AuthInfo which causes the exception, since it needs to access the database at that point.

The behavior could be explained if the storage was closed and reopened during the life of a daemon worker. Because if the same Runner instance is kept, it still holds a reference to JobManager, which still has the old _job_lists mapping, where each JobsList still holds the original AuthInfo instance. The only problem with this theory, is that the daemon worker should never close the storage during its lifetime. So I don't see how this could happen.

Still, I think it probably has something to do with the JobManager keeping this mapping of JobsList that each holds on to an AuthInfo reference.

@sphuber
Copy link
Contributor

sphuber commented May 17, 2023

@unkcpz could you please try this branch on your environment: https://github.com/sphuber/aiida-core/tree/fix/4596/db-authinfo-detached

It is not relying on the AuthInfo used when the JobsList gets constructed at startup of the daemon worker, but rather use the AuthInfo that is actually used by the CalcJob when it calls request_job_info_update. Hopefully that instance is still attached to the session and so it should circumvent the old one by overwriting it.

@unkcpz
Copy link
Member

unkcpz commented Jun 21, 2023

@sphuber Sorry for the late reply, I didn't notice your message. I can not reproduce the issues, but once it appears again, I'll check out to your branch and try it. I am using this plugin this and next week, so there is a chance I may encounter the issue again.

@sphuber
Copy link
Contributor

sphuber commented Jun 22, 2023

Thanks, I will rebase the branch so it is up to date with main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants