-
Notifications
You must be signed in to change notification settings - Fork 315
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug with aim.Run when using hash to locate a run #2999
Comments
@HoBeedzc thanks for reporting this issue. As I understand, this happens only for the remote tracking server? |
Yes, I am using the remote tracking server. For privacy considerations, I've obscured the IP address of the aim remote repository. The actual code is as follows: import aim
aim.Run(run_hash='51031438759943878c6f9808', repo="aim:ip:port") I've experimented with various methods, involving both local and remote repositories, and I've arrived at the following findings:
|
Is there any solution for this? I have also been getting this error when I try to identify a run in a remote repository. I have tried closing and releasing the locks, but nothing seems to help. Has this been fixed in later versions? I am using 3.17.5 |
This looks to me like this function is timing out, then error is incorrectly handled. Having short timeout on |
This also happens for us when running torch-lightning integration and parallel training jobs. |
Digging deeper into it. I actually no longer think it's issue with grpc, but softlock instead. For some reason, it uses softlock, but this function, when called on repo location, returns False
There might be bug somewhere in either softlock mechanism itself or detecting correct locktype to use. Notice that |
@inc0 Does the fix you linked, work for all the cases or only when pytorch lightning is used? |
It'll only fix lightning, but you can add similar parameter to yours func call, it should fix your case |
🐛 Bug
I encountered a bug while using AIM. When attempting to utilize the
aim.Run
function to locate a run using its hash, I encountered the following error:To reproduce
It can be reproduced by simply running the following code (I have ensured that this code has already been placed in the target repository).
Expected behavior
Locate the run with the target hash, just as it is mentioned in the document.
Environment
Additional context
The text was updated successfully, but these errors were encountered: