Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datasets cache folder not shared between users #671

Closed
mralgos opened this issue May 9, 2022 · 8 comments
Closed

Datasets cache folder not shared between users #671

mralgos opened this issue May 9, 2022 · 8 comments

Comments

@mralgos
Copy link
Contributor

mralgos commented May 9, 2022

Hello,
a discussion on this issue started on Slack here.

I have a ClearML server hosted on AWS with web authentication enabled. Each ML person has:

  • its own user/pass used to log in to the ClearML server
  • its own set of ClearML API credentials
  • its own set of AWS credentials
  • its own clearml.conf file in the home

The config file defines the path to the cache folder via:

storage {
        cache {
            # Defaults to system temp folder / cache
            default_base_dir: "/scratch/clearml-cache"
        }

We have some datasets registered by the ClearML server and the codebase uses get_local_copy() to download the data into the machine. The problem manifests when two or more people wants to access (read, i.e. the cache exists already and isn't corrupted) the dataset.

The execution fails with this error:

Traceback (most recent call last):
...
    path = Dataset.get(dataset_name=dataset_name,
  File "/scratch/people/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/datasets/dataset.py", line 567, in get_local_copy
    target_folder = self._merge_datasets(
  File "/scratch/people/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/datasets/dataset.py", line 1387, in _merge_datasets
    target_base_folder = self._create_ds_target_folder(
  File "/scratch/people/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/datasets/dataset.py", line 1336, in _create_ds_target_folder
    cache.lock_cache_folder(local_folder)
  File "/scratch/people/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/storage/cache.py", line 273, in lock_cache_folder
    lock.acquire(timeout=0)
  File "/scratch/people/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/utilities/locks/utils.py", line 130, in acquire
    fh = self._get_fh()
  File "/scratch/people/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/utilities/locks/utils.py", line 205, in _get_fh
    return open(self.filename, self.mode, **self.file_open_kwargs)
PermissionError: [Errno 13] Permission denied: '/scratch/clearml-cache/storage_manager/datasets/.lock.000.ds_5f1f42f430b042cfb213e8099cda00b4.clearml'
@jkhenning
Copy link
Member

@mralgos do I understand it correctly that more than one user is trying to access this dataset from the same workstation?

@mralgos
Copy link
Contributor Author

mralgos commented May 9, 2022

@jkhenning yes, correct. It is possible that multiple users need to read the dataset cached in the same workstation

@jkhenning
Copy link
Member

@mralgos in that case it would seem to be a Linux permissions that might be outside of the scope of ClearML code - how would you expect it to work? As far as I know the lock is required to make sure multiple writers don't compete on the same file

@mralgos
Copy link
Contributor Author

mralgos commented May 11, 2022

I'd expect that if someone is writing the dataset, the lock folder must be created in order to prevent other writing ops of the same dataset. However, if the dataset exists already, multiple users should be able to read it (hence the lock wouldn't be necessary).

I'm going to check the permission and the umask settings in the meantime.

@mralgos
Copy link
Contributor Author

mralgos commented May 18, 2022

@jkhenning After a bit of debugging I have found that setting the umask to 002 (i.e. default write permission to unix group) the problem is only partially resolved: while the permission denied error is gone, I am seeing other strange behaviours. Overall, I have the impression that there is an underlying assumption in clearml code that there's a single user running a single experiment on a given server/workstation.

What is your expected behaviour when a user starts two experiments that require the same dataset on the same server and the cache is not yet built? This is the case when two processes start downloading the datasets in the same cache location at the same time.

What I'm seeing is a bit unpredictable:

  • Sometimes one of the processes stops with this error
Traceback (most recent call last):
...
  File "/efs/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/datasets/dataset.py", line 736, in get_local_copy
    target_folder = self._merge_datasets(
  File "/efs/gf/miniconda3/envs/py38-torch/lib/python3.8/site-packages/clearml/datasets/dataset.py", line 1750, in _merge_datasets
    shutil.rmtree(target_base_folder.as_posix())
  File "/efs/gf/miniconda3/envs/py38-torch/lib/python3.8/shutil.py", line 722, in rmtree
    onerror(os.rmdir, path, sys.exc_info())
  File "/efs/gf/miniconda3/envs/py38-torch/lib/python3.8/shutil.py", line 720, in rmtree
    os.rmdir(path)
OSError: [Errno 39] Directory not empty: '/scratch/clearml-cache/storage_manager/datasets/ds_3f31cd7764df45b78f20b3c7291fcbc6'
  • Some other times it seems that the processes download the same blob of data. In which case the data will be corrupted and the experiment fails
-rw-rw-r--  1 ubuntu ubuntu 515M May 18 14:02 6b1bff94fde42c83e26473759753e7ec.dataset.e998d366617043dda205942b6ff6d3c1.zip
-rw-rw-r--  1 ubuntu ubuntu 521M May 18 14:02 6df1395774b301fb68c697e76260e9b7.dataset.e998d366617043dda205942b6ff6d3c1.zip_1652882563.9507067.partially.ADc41ACc
-rw-rw-r--  1 ubuntu ubuntu 513M May 18 14:02 6df1395774b301fb68c697e76260e9b7.dataset.e998d366617043dda205942b6ff6d3c1.zip_1652882563.9513416.partially.BE96D5AA
-rw-rw-r--  1 ubuntu ubuntu 531M May 18 14:01 765fc4c951e0323ee833b83f181503d1.dataset.e998d366617043dda205942b6ff6d3c1.zip

As a final note, having the umask set to 002 creates other headaches because in our multi-user setup it means that any user belonging to the same unix group can write on other's files.

mralgos added a commit to mralgos/clearml that referenced this issue Jun 3, 2022
mralgos added a commit to mralgos/clearml that referenced this issue Jun 4, 2022
@mralgos
Copy link
Contributor Author

mralgos commented Jun 4, 2022

@jkhenning I've put together a change that would potentially fix the issue (assuming I'm not missing something in the overall design).

Basically, two main changes:

  1. once the download of a dataset is completed, the lock is released immediately
  2. while a process is downloading a dataset i.e. holding the lock, other processes will wait until the lock is released (in the current fix, after 5 min of wait there's a timeout error raise. But this can be changed)

Hope this helps. Let me know what you think.

@eugen-ajechiloae-clearml
Copy link
Collaborator

Hi @mralgos! Can you please open a PR for your fix? I think it looks good

@jkhenning
Copy link
Member

Hi @mralgos, thanks for the contribution! Closing this issue 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants