Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug with aim.Run when using hash to locate a run #2999

Open
HoBeedzc opened this issue Sep 22, 2023 · 8 comments
Open

Bug with aim.Run when using hash to locate a run #2999

HoBeedzc opened this issue Sep 22, 2023 · 8 comments
Labels
help wanted Extra attention is needed type / bug Issue type: something isn't working

Comments

@HoBeedzc
Copy link

🐛 Bug

I encountered a bug while using AIM. When attempting to utilize the aim.Run function to locate a run using its hash, I encountered the following error:

>>> aim.Run(run_hash='51031438759943878c6f9808')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 70, in wrapper
    _SafeModeConfig.exception_callback(e, func)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 47, in reraise_exception
    raise e
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/exception_resistant.py", line 68, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/sdk/run.py", line 828, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, experiment=experiment, force_resume=force_resume)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/sdk/run.py", line 276, in __init__
    super().__init__(run_hash, repo=repo, read_only=read_only, force_resume=force_resume)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/sdk/base_run.py", line 50, in __init__
    self._lock.lock(force=force_resume)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/storage/lock_proxy.py", line 38, in lock
    return self._rpc_client.run_instruction(self._hash, self._handler, 'lock', (force,))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/transport/client.py", line 260, in run_instruction
    return self._run_read_instructions(queue_id, resource, method, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/transport/client.py", line 285, in _run_read_instructions
    raise_exception(status_msg.header.exception)
  File "/home/me/miniconda3/lib/python3.11/site-packages/aim/ext/transport/message_utils.py", line 76, in raise_exception
    raise exception(*args) if args else exception()
                                        ^^^^^^^^^^^
TypeError: Timeout.__init__() missing 1 required positional argument: 'lock_file'

To reproduce

It can be reproduced by simply running the following code (I have ensured that this code has already been placed in the target repository).

import aim
aim.Run(run_hash='51031438759943878c6f9808')

Expected behavior

Locate the run with the target hash, just as it is mentioned in the document.

Environment

  • Aim Version 3.17.5
  • Python version 3.11.4 (conda)
  • pip version 23.1.2
  • OS (problems with both Linux and Mac)
  • Any other relevant information

Additional context

@HoBeedzc HoBeedzc added help wanted Extra attention is needed type / bug Issue type: something isn't working labels Sep 22, 2023
@alberttorosyan
Copy link
Member

@HoBeedzc thanks for reporting this issue. As I understand, this happens only for the remote tracking server?

@HoBeedzc
Copy link
Author

Yes, I am using the remote tracking server. For privacy considerations, I've obscured the IP address of the aim remote repository. The actual code is as follows:

import aim
aim.Run(run_hash='51031438759943878c6f9808', repo="aim:ip:port")

I've experimented with various methods, involving both local and remote repositories, and I've arrived at the following findings:

  • When working with a local repository, the hash can be used to uniquely identify a run. However, if there's already an open run, it must be closed (use aim.Run.close() method) before utilizing the hash to fetch the run again.
  • However, when working with a remote repository, it's not feasible to employ the hash to identify the run, regardless of whether there are any open runs.

@sandeep-biddala
Copy link

Is there any solution for this? I have also been getting this error when I try to identify a run in a remote repository. I have tried closing and releasing the locks, but nothing seems to help.

Has this been fixed in later versions? I am using 3.17.5

@inc0
Copy link
Contributor

inc0 commented Jan 10, 2024

This looks to me like this function is timing out, then error is incorrectly handled. Having short timeout on lock function is definitely a bug, since well, whole idea is to wait until it's safe to acquire a lock.

@inc0
Copy link
Contributor

inc0 commented Jan 10, 2024

This also happens for us when running torch-lightning integration and parallel training jobs.

@inc0
Copy link
Contributor

inc0 commented Jan 11, 2024

Digging deeper into it. I actually no longer think it's issue with grpc, but softlock instead. For some reason, it uses softlock, but this function, when called on repo location, returns False

In [3]: FileSystemInspector.needs_soft_lock("/aim/repo/locks")
Out[3]: False

In [4]: FileSystemInspector.needs_soft_lock("/aim/")
Out[4]: False

In [5]: FileSystemInspector.needs_soft_lock("/aim")
The lock file /aim is on a filesystem of type `overlay` (device id: 207). Using soft file locks to avoid potential data corruption.
Out[5]: True

In [6]: FileSystemInspector.needs_soft_lock("/aim/repo")
Out[6]: False

In [7]:
Do you really want to exit ([y]/n)? y
root@bishop-aim-7dd75f756f-vtqnp:/aim/repo/.aim/locks# ls
0b988510edc143148283748b.softlock  6390468e64c24d68b30e9198.softlock  73f944123d904a048019f255.softlock  index
root@bishop-aim-7dd75f756f-vtqnp:/aim/repo/.aim/locks#

There might be bug somewhere in either softlock mechanism itself or detecting correct locktype to use. Notice that /aim is overlay - this server runs on top of Kubernetes, so it'll use container fs on root, there is volume mounted to /aim, that's ext4, so dirs under /aim should be able to use locks.

@sandeep-biddala
Copy link

sandeep-biddala commented Jan 15, 2024

@inc0 Does the fix you linked, work for all the cases or only when pytorch lightning is used?

@inc0
Copy link
Contributor

inc0 commented Jan 15, 2024

It'll only fix lightning, but you can add similar parameter to yours func call, it should fix your case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed type / bug Issue type: something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants