Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix tomato disconnection issue #59

Closed
wants to merge 6 commits into from

Conversation

NukP
Copy link

@NukP NukP commented Dec 7, 2023

Summary

  • Replace api.connect originally used in the script with safe_api_connect and safe_api_disconnect function.
  • The two functions is made by wrapping api.Connect and api.Disconnect in a try/except loop. Instead of try api.Connect or api.disconnect once, this wrapper function will try to connect/disconnect for 100 times with 10 seconds wait between trial, before rasing an exception.

Copy link
Contributor

@PeterKraus PeterKraus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @NukP, thanks for making this PR. I cannot test how this works on the hardware, so I'll have to trust that it works for you. There are some code-quality changes I'd like you to do, though, mainly formatting the code with black, and checking the interplay of the FileLock with the large timeouts you allow for.

@@ -35,7 +35,7 @@
"toml",
"pyyaml",
"psutil",
"yadg>=4.2<5.0",
"yadg>=4.2.4,<5.0",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the merge issue is here.

src/tomato/drivers/biologic/main.py Outdated Show resolved Hide resolved
Comment on lines 15 to 16
retries: int = 100,
timeout: int = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default values - retries and timeout could be in principle pulled in from the device settings, the same way the dllpath is. I would say a retries count of 100 is unreasonably high as a default, but allowing user customisation via the settings file would be an easy way around it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure whether the dllpath is the path to the setting file? If this is the case, then I imagine I would just have to join this dllpath thie setting.yaml. (os.path.join(dllpath,'setting.yaml')) Then I will write a commad to open the setting file and retrive retires and timeout (which would be now be specified by the user there), then use it here. Is this the preferred method you are suggesting?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have make changes as suggested. Now, the retries and time_out paraemters will be obtained from the setting file (the same way as dllpath). I also put a default value here as reduce the number of default value from 100 to 10.

id_, device_info = api.Connect(address)
return id_, device_info
except Exception as e:
time.sleep(timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing I worry about with sleep here, is that I think we keep the device lock on, which might lead to time-outs there. Please check the interplay. It might be prudent to release the lock and then reacquire it after the sleep period.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think is it safer if I remove the sleep here? In this case, if the connection issue is just a glitch in the connection (whatever causes it). Then, multiple retires should be sufficient to solve this. This will ensure that we don't create new issue with device locking here. Say I try without sleep here. Then, if this doesn't solve the issue, then I will re-introduce the sleep function here again and will implement the lock-releasing and reacquire the lock as you suggested.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty certain that when you sleep this function for more than 60 seconds, every other job trying to get a lock will fail, see:

with FileLock(lockpath, timeout=60):

How you deal with this is up to you; you could also wrap the whole with Filelock ... block in a try-except-retries cycle, instead of just the connection. Might want to catch the particular Exception type raised, though, instead of a blank except Exception as e: then.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made a modification by wrapping FileLock into safe_api_connect and safe_api_disconnect function as suggested.

except Exception as e:
logger.critical(f"{e=}")
raise
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad, I initially thought of also raising runtimeerror here saying fail to connect/disconnect. Then I realized such clause should be in the safe api connect/disconnect. So, I rewmoved the exception here but forgot to remove raise. I will clean this up.

@PeterKraus
Copy link
Contributor

Actually I see now that you're including one of my commits, too. You'll probably have to rebase this, either from master or from the 0.2.x branch...

src/tomato/drivers/biologic/main.py Show resolved Hide resolved
id_, device_info = api.Connect(address)
return id_, device_info
except Exception as e:
time.sleep(timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am pretty certain that when you sleep this function for more than 60 seconds, every other job trying to get a lock will fail, see:

with FileLock(lockpath, timeout=60):

How you deal with this is up to you; you could also wrap the whole with Filelock ... block in a try-except-retries cycle, instead of just the connection. Might want to catch the particular Exception type raised, though, instead of a blank except Exception as e: then.

Comment on lines 15 to 16
retries: int = 100,
timeout: int = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below.

@PeterKraus
Copy link
Contributor

Hi @NukP, please test the code on the hardware and make sure it works, then push any changes here and resolve the conflicts, and then we can merge it.

I have my doubts that the current modified use of FileLock will work for what it's meant to do. The whole idea behind using FileLock was that only one process (i.e. one job) talks to the EC-Lib dll file at one time, because with more than one we have seen frequent crashes. The way you have it now, we can have multiple jobs connected to the dll, as long as their connect() calls are aligned before/between the disconnect() calls.

@NukP
Copy link
Author

NukP commented Dec 15, 2023

Hi @PeterKraus . Ok, I will test this out on the hardware, this might take some time.

@NukP
Copy link
Author

NukP commented Jan 8, 2024

@PeterKraus Quick update on this, Enea has tried to submitting job using this test version (we submitted directly through ketchup and not through Aiida). Enea used one of the payload file form one of his previous run to make sure that there will be no issue with the payload. We found that the job can be submitted. The job id was created but the job stuck in queue and was never sent for cycling. Enea submitted this even before the holidays break and the job was still in the queue. No error was raised. I think, indeed, the issue might have happened form the way I implement the sleep function as you pointed out. I will investigate this a bit more and will get back to you if I have any update.

@PeterKraus
Copy link
Contributor

@NukP I'd like to move the biologic (and indeed any other) driver out of the core tomato package, and #70 will have to be addressed. Let's discuss next week, but I'm going to close this one, as it's probably easier to start from scratch.

@PeterKraus PeterKraus closed this Mar 12, 2024
@NukP
Copy link
Author

NukP commented Mar 12, 2024

@PeterKraus Ok, let's discuss this next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants