Datasets cache folder not shared between users #671

mralgos · 2022-05-09T10:46:02Z

Hello,
a discussion on this issue started on Slack here.

I have a ClearML server hosted on AWS with web authentication enabled. Each ML person has:

its own user/pass used to log in to the ClearML server
its own set of ClearML API credentials
its own set of AWS credentials
its own clearml.conf file in the home

The config file defines the path to the cache folder via:

storage {
        cache {
            # Defaults to system temp folder / cache
            default_base_dir: "/scratch/clearml-cache"
        }

We have some datasets registered by the ClearML server and the codebase uses get_local_copy() to download the data into the machine. The problem manifests when two or more people wants to access (read, i.e. the cache exists already and isn't corrupted) the dataset.

The execution fails with this error:

Traceback (most recent call last):
...
    path = Dataset.get(dataset_name=dataset_name,
  File "/scratch/people/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/datasets/dataset.py", line 567, in get_local_copy
    target_folder = self._merge_datasets(
  File "/scratch/people/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/datasets/dataset.py", line 1387, in _merge_datasets
    target_base_folder = self._create_ds_target_folder(
  File "/scratch/people/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/datasets/dataset.py", line 1336, in _create_ds_target_folder
    cache.lock_cache_folder(local_folder)
  File "/scratch/people/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/storage/cache.py", line 273, in lock_cache_folder
    lock.acquire(timeout=0)
  File "/scratch/people/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/utilities/locks/utils.py", line 130, in acquire
    fh = self._get_fh()
  File "/scratch/people/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/utilities/locks/utils.py", line 205, in _get_fh
    return open(self.filename, self.mode, **self.file_open_kwargs)
PermissionError: [Errno 13] Permission denied: '/scratch/clearml-cache/storage_manager/datasets/.lock.000.ds_5f1f42f430b042cfb213e8099cda00b4.clearml'

The text was updated successfully, but these errors were encountered:

jkhenning · 2022-05-09T14:12:25Z

@mralgos do I understand it correctly that more than one user is trying to access this dataset from the same workstation?

mralgos · 2022-05-09T15:07:52Z

@jkhenning yes, correct. It is possible that multiple users need to read the dataset cached in the same workstation

jkhenning · 2022-05-11T07:47:15Z

@mralgos in that case it would seem to be a Linux permissions that might be outside of the scope of ClearML code - how would you expect it to work? As far as I know the lock is required to make sure multiple writers don't compete on the same file

mralgos · 2022-05-11T09:49:23Z

I'd expect that if someone is writing the dataset, the lock folder must be created in order to prevent other writing ops of the same dataset. However, if the dataset exists already, multiple users should be able to read it (hence the lock wouldn't be necessary).

I'm going to check the permission and the umask settings in the meantime.

mralgos · 2022-05-18T14:31:43Z

@jkhenning After a bit of debugging I have found that setting the umask to 002 (i.e. default write permission to unix group) the problem is only partially resolved: while the permission denied error is gone, I am seeing other strange behaviours. Overall, I have the impression that there is an underlying assumption in clearml code that there's a single user running a single experiment on a given server/workstation.

What is your expected behaviour when a user starts two experiments that require the same dataset on the same server and the cache is not yet built? This is the case when two processes start downloading the datasets in the same cache location at the same time.

What I'm seeing is a bit unpredictable:

Sometimes one of the processes stops with this error

Traceback (most recent call last):
...
  File "/efs/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/datasets/dataset.py", line 736, in get_local_copy
    target_folder = self._merge_datasets(
  File "/efs/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/datasets/dataset.py", line 1750, in _merge_datasets
    shutil.rmtree(target_base_folder.as_posix())
  File "/efs/gf/miniconda3/envs/py38-torch/lib/python3.8/shutil.py", line 722, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/efs/gf/miniconda3/envs/py38-torch/lib/python3.8/shutil.py", line 720, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/scratch/clearml-cache/storage_manager/datasets/ds_3f31cd7764df45b78f20b3c7291fcbc6'

Some other times it seems that the processes download the same blob of data. In which case the data will be corrupted and the experiment fails

-rw-rw-r--  1 ubuntu ubuntu 515M May 18 14:02 6b1bff94fde42c83e26473759753e7ec.dataset.e998d366617043dda205942b6ff6d3c1.zip
-rw-rw-r--  1 ubuntu ubuntu 521M May 18 14:02 6df1395774b301fb68c697e76260e9b7.dataset.e998d366617043dda205942b6ff6d3c1.zip_1652882563.9507067.partially.ADc41ACc
-rw-rw-r--  1 ubuntu ubuntu 513M May 18 14:02 6df1395774b301fb68c697e76260e9b7.dataset.e998d366617043dda205942b6ff6d3c1.zip_1652882563.9513416.partially.BE96D5AA
-rw-rw-r--  1 ubuntu ubuntu 531M May 18 14:01 765fc4c951e0323ee833b83f181503d1.dataset.e998d366617043dda205942b6ff6d3c1.zip

As a final note, having the umask set to 002 creates other headaches because in our multi-user setup it means that any user belonging to the same unix group can write on other's files.

mralgos · 2022-06-04T15:59:45Z

@jkhenning I've put together a change that would potentially fix the issue (assuming I'm not missing something in the overall design).

Basically, two main changes:

once the download of a dataset is completed, the lock is released immediately
while a process is downloading a dataset i.e. holding the lock, other processes will wait until the lock is released (in the current fix, after 5 min of wait there's a timeout error raise. But this can be changed)

Hope this helps. Let me know what you think.

eugen-ajechiloae-clearml · 2022-07-05T08:37:41Z

Hi @mralgos! Can you please open a PR for your fix? I think it looks good

jkhenning · 2022-09-12T06:16:30Z

Hi @mralgos, thanks for the contribution! Closing this issue 🙂

mralgos added a commit to mralgos/clearml that referenced this issue Jun 3, 2022

Possible fix to dataset cache locks (allegroai#671)

895183e

mralgos added a commit to mralgos/clearml that referenced this issue Jun 4, 2022

Removed unnecessary additional return (allegroai#671)

355ecd4

mralgos mentioned this issue Jul 5, 2022

Release lock after dataset cache is downloaded #708

Merged

jkhenning closed this as completed Sep 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Datasets cache folder not shared between users #671

Datasets cache folder not shared between users #671

mralgos commented May 9, 2022

jkhenning commented May 9, 2022

mralgos commented May 9, 2022

jkhenning commented May 11, 2022

mralgos commented May 11, 2022

mralgos commented May 18, 2022

mralgos commented Jun 4, 2022

eugen-ajechiloae-clearml commented Jul 5, 2022

jkhenning commented Sep 12, 2022

Datasets cache folder not shared between users #671

Datasets cache folder not shared between users #671

Comments

mralgos commented May 9, 2022

jkhenning commented May 9, 2022

mralgos commented May 9, 2022

jkhenning commented May 11, 2022

mralgos commented May 11, 2022

mralgos commented May 18, 2022

mralgos commented Jun 4, 2022

eugen-ajechiloae-clearml commented Jul 5, 2022

jkhenning commented Sep 12, 2022