-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix tomato disconnection issue #59
Conversation
…o api.connect and api.disconnect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @NukP, thanks for making this PR. I cannot test how this works on the hardware, so I'll have to trust that it works for you. There are some code-quality changes I'd like you to do, though, mainly formatting the code with black
, and checking the interplay of the FileLock
with the large timeouts you allow for.
@@ -35,7 +35,7 @@ | |||
"toml", | |||
"pyyaml", | |||
"psutil", | |||
"yadg>=4.2<5.0", | |||
"yadg>=4.2.4,<5.0", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the merge issue is here.
src/tomato/drivers/biologic/main.py
Outdated
retries: int = 100, | ||
timeout: int = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default values - retries
and timeout
could be in principle pulled in from the device settings, the same way the dllpath
is. I would say a retries
count of 100 is unreasonably high as a default, but allowing user customisation via the settings file would be an easy way around it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure whether the dllpath is the path to the setting file? If this is the case, then I imagine I would just have to join this dllpath thie setting.yaml. (os.path.join(dllpath,'setting.yaml')) Then I will write a commad to open the setting file and retrive retires and timeout (which would be now be specified by the user there), then use it here. Is this the preferred method you are suggesting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have make changes as suggested. Now, the retries and time_out paraemters will be obtained from the setting file (the same way as dllpath). I also put a default value here as reduce the number of default value from 100 to 10.
src/tomato/drivers/biologic/main.py
Outdated
id_, device_info = api.Connect(address) | ||
return id_, device_info | ||
except Exception as e: | ||
time.sleep(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing I worry about with sleep
here, is that I think we keep the device lock on, which might lead to time-outs there. Please check the interplay. It might be prudent to release the lock and then reacquire it after the sleep period.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think is it safer if I remove the sleep here? In this case, if the connection issue is just a glitch in the connection (whatever causes it). Then, multiple retires should be sufficient to solve this. This will ensure that we don't create new issue with device locking here. Say I try without sleep here. Then, if this doesn't solve the issue, then I will re-introduce the sleep function here again and will implement the lock-releasing and reacquire the lock as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty certain that when you sleep this function for more than 60 seconds, every other job trying to get a lock will fail, see:
with FileLock(lockpath, timeout=60): |
How you deal with this is up to you; you could also wrap the whole with Filelock ...
block in a try-except-retries cycle, instead of just the connection. Might want to catch the particular Exception type raised, though, instead of a blank except Exception as e:
then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a modification by wrapping FileLock into safe_api_connect and safe_api_disconnect function as suggested.
src/tomato/drivers/biologic/main.py
Outdated
except Exception as e: | ||
logger.critical(f"{e=}") | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, I initially thought of also raising runtimeerror here saying fail to connect/disconnect. Then I realized such clause should be in the safe api connect/disconnect. So, I rewmoved the exception here but forgot to remove raise. I will clean this up.
Actually I see now that you're including one of my commits, too. You'll probably have to rebase this, either from |
src/tomato/drivers/biologic/main.py
Outdated
id_, device_info = api.Connect(address) | ||
return id_, device_info | ||
except Exception as e: | ||
time.sleep(timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am pretty certain that when you sleep this function for more than 60 seconds, every other job trying to get a lock will fail, see:
with FileLock(lockpath, timeout=60): |
How you deal with this is up to you; you could also wrap the whole with Filelock ...
block in a try-except-retries cycle, instead of just the connection. Might want to catch the particular Exception type raised, though, instead of a blank except Exception as e:
then.
src/tomato/drivers/biologic/main.py
Outdated
retries: int = 100, | ||
timeout: int = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment below.
… safe_api connect and disconnect functions
Hi @NukP, please test the code on the hardware and make sure it works, then push any changes here and resolve the conflicts, and then we can merge it. I have my doubts that the current modified use of |
Hi @PeterKraus . Ok, I will test this out on the hardware, this might take some time. |
@PeterKraus Quick update on this, Enea has tried to submitting job using this test version (we submitted directly through ketchup and not through Aiida). Enea used one of the payload file form one of his previous run to make sure that there will be no issue with the payload. We found that the job can be submitted. The job id was created but the job stuck in queue and was never sent for cycling. Enea submitted this even before the holidays break and the job was still in the queue. No error was raised. I think, indeed, the issue might have happened form the way I implement the sleep function as you pointed out. I will investigate this a bit more and will get back to you if I have any update. |
@PeterKraus Ok, let's discuss this next week. |
Summary