Fix tomato disconnection issue #59

NukP · 2023-12-07T12:05:07Z

Summary

Replace api.connect originally used in the script with safe_api_connect and safe_api_disconnect function.
The two functions is made by wrapping api.Connect and api.Disconnect in a try/except loop. Instead of try api.Connect or api.disconnect once, this wrapper function will try to connect/disconnect for 100 times with 10 seconds wait between trial, before rasing an exception.

…o api.connect and api.disconnect

PeterKraus

Hi @NukP, thanks for making this PR. I cannot test how this works on the hardware, so I'll have to trust that it works for you. There are some code-quality changes I'd like you to do, though, mainly formatting the code with black, and checking the interplay of the FileLock with the large timeouts you allow for.

PeterKraus · 2023-12-08T10:46:24Z

setup.py

@@ -35,7 +35,7 @@
        "toml",
        "pyyaml",
        "psutil",
-        "yadg>=4.2<5.0",
+        "yadg>=4.2.4,<5.0",


I assume the merge issue is here.

src/tomato/drivers/biologic/main.py

PeterKraus · 2023-12-08T10:49:41Z

src/tomato/drivers/biologic/main.py

+                    retries: int = 100,
+                    timeout: int = 10


The default values - retries and timeout could be in principle pulled in from the device settings, the same way the dllpath is. I would say a retries count of 100 is unreasonably high as a default, but allowing user customisation via the settings file would be an easy way around it.

I am not sure whether the dllpath is the path to the setting file? If this is the case, then I imagine I would just have to join this dllpath thie setting.yaml. (os.path.join(dllpath,'setting.yaml')) Then I will write a commad to open the setting file and retrive retires and timeout (which would be now be specified by the user there), then use it here. Is this the preferred method you are suggesting?

See comment below.

Ok, I have make changes as suggested. Now, the retries and time_out paraemters will be obtained from the setting file (the same way as dllpath). I also put a default value here as reduce the number of default value from 100 to 10.

PeterKraus · 2023-12-08T10:50:27Z

src/tomato/drivers/biologic/main.py

+            id_, device_info = api.Connect(address)
+            return id_, device_info
+        except Exception as e:
+            time.sleep(timeout)


The only thing I worry about with sleep here, is that I think we keep the device lock on, which might lead to time-outs there. Please check the interplay. It might be prudent to release the lock and then reacquire it after the sleep period.

Do you think is it safer if I remove the sleep here? In this case, if the connection issue is just a glitch in the connection (whatever causes it). Then, multiple retires should be sufficient to solve this. This will ensure that we don't create new issue with device locking here. Say I try without sleep here. Then, if this doesn't solve the issue, then I will re-introduce the sleep function here again and will implement the lock-releasing and reacquire the lock as you suggested.

I am pretty certain that when you sleep this function for more than 60 seconds, every other job trying to get a lock will fail, see:

tomato/src/tomato/drivers/biologic/main.py

Line 48 in 56e9923

with FileLock(lockpath, timeout=60):

How you deal with this is up to you; you could also wrap the whole with Filelock ... block in a try-except-retries cycle, instead of just the connection. Might want to catch the particular Exception type raised, though, instead of a blank except Exception as e: then.

I made a modification by wrapping FileLock into safe_api_connect and safe_api_disconnect function as suggested.

PeterKraus · 2023-12-08T10:52:08Z

src/tomato/drivers/biologic/main.py

        except Exception as e:
            logger.critical(f"{e=}")
+            raise


My bad, I initially thought of also raising runtimeerror here saying fail to connect/disconnect. Then I realized such clause should be in the safe api connect/disconnect. So, I rewmoved the exception here but forgot to remove raise. I will clean this up.

PeterKraus · 2023-12-08T10:56:40Z

Actually I see now that you're including one of my commits, too. You'll probably have to rebase this, either from master or from the 0.2.x branch...

src/tomato/drivers/biologic/main.py

PeterKraus · 2023-12-12T09:28:23Z

src/tomato/drivers/biologic/main.py

+            id_, device_info = api.Connect(address)
+            return id_, device_info
+        except Exception as e:
+            time.sleep(timeout)


I am pretty certain that when you sleep this function for more than 60 seconds, every other job trying to get a lock will fail, see:

tomato/src/tomato/drivers/biologic/main.py

Line 48 in 56e9923

with FileLock(lockpath, timeout=60):

How you deal with this is up to you; you could also wrap the whole with Filelock ... block in a try-except-retries cycle, instead of just the connection. Might want to catch the particular Exception type raised, though, instead of a blank except Exception as e: then.

PeterKraus · 2023-12-12T09:28:36Z

src/tomato/drivers/biologic/main.py

+                    retries: int = 100,
+                    timeout: int = 10


See comment below.

… safe_api connect and disconnect functions

PeterKraus · 2023-12-14T17:28:36Z

Hi @NukP, please test the code on the hardware and make sure it works, then push any changes here and resolve the conflicts, and then we can merge it.

I have my doubts that the current modified use of FileLock will work for what it's meant to do. The whole idea behind using FileLock was that only one process (i.e. one job) talks to the EC-Lib dll file at one time, because with more than one we have seen frequent crashes. The way you have it now, we can have multiple jobs connected to the dll, as long as their connect() calls are aligned before/between the disconnect() calls.

NukP · 2023-12-15T09:43:15Z

Hi @PeterKraus . Ok, I will test this out on the hardware, this might take some time.

NukP · 2024-01-08T12:22:54Z

@PeterKraus Quick update on this, Enea has tried to submitting job using this test version (we submitted directly through ketchup and not through Aiida). Enea used one of the payload file form one of his previous run to make sure that there will be no issue with the payload. We found that the job can be submitted. The job id was created but the job stuck in queue and was never sent for cycling. Enea submitted this even before the holidays break and the job was still in the queue. No error was raised. I think, indeed, the issue might have happened form the way I implement the sleep function as you pointed out. I will investigate this a bit more and will get back to you if I have any update.

PeterKraus · 2024-03-12T14:21:35Z

@NukP I'd like to move the biologic (and indeed any other) driver out of the core tomato package, and #70 will have to be addressed. Let's discuss next week, but I'm going to close this one, as it's probably easier to start from scratch.

NukP · 2024-03-12T14:41:32Z

@PeterKraus Ok, let's discuss this next week.

PeterKraus and others added 3 commits November 28, 2023 08:34

Fix yadg>=4.2.4 for 0.2.x branch.

1b0730f

Initial commit for the branch

30c96e8

Make safe_api_connect and safe_api_disconnect as a wrapper function t…

5c231a6

…o api.connect and api.disconnect

PeterKraus requested changes Dec 8, 2023

View reviewed changes

Clean up the riase statement

c15fcce

PeterKraus reviewed Dec 12, 2023

View reviewed changes

NukP added 2 commits December 14, 2023 00:59

Put FileLock in safe_api_connect and disconnect functions.

fc84bcf

Obtaining time_sleep and retires from setting files before parsing to…

3507937

… safe_api connect and disconnect functions

PeterKraus closed this Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tomato disconnection issue #59

Fix tomato disconnection issue #59

NukP commented Dec 7, 2023

PeterKraus left a comment

PeterKraus Dec 8, 2023

PeterKraus Dec 8, 2023

NukP Dec 12, 2023

PeterKraus Dec 12, 2023

NukP Dec 14, 2023

PeterKraus Dec 8, 2023

NukP Dec 12, 2023

PeterKraus Dec 12, 2023

NukP Dec 14, 2023

PeterKraus Dec 8, 2023

NukP Dec 11, 2023

PeterKraus commented Dec 8, 2023

PeterKraus Dec 12, 2023

PeterKraus Dec 12, 2023

PeterKraus commented Dec 14, 2023

NukP commented Dec 15, 2023

NukP commented Jan 8, 2024 •

edited

Loading

PeterKraus commented Mar 12, 2024

NukP commented Mar 12, 2024

Fix tomato disconnection issue #59

Fix tomato disconnection issue #59

Conversation

NukP commented Dec 7, 2023

PeterKraus left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PeterKraus commented Dec 8, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PeterKraus commented Dec 14, 2023

NukP commented Dec 15, 2023

NukP commented Jan 8, 2024 • edited Loading

PeterKraus commented Mar 12, 2024

NukP commented Mar 12, 2024

NukP commented Jan 8, 2024 •

edited

Loading