Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploading and using custom datasets #363

Closed
ecm200 opened this issue May 18, 2021 · 7 comments
Closed

Uploading and using custom datasets #363

ecm200 opened this issue May 18, 2021 · 7 comments

Comments

@ecm200
Copy link

ecm200 commented May 18, 2021

Most examples in the example folders show the use of datasets that are available through some mechanism via a website, or as part of python module like Torchvision.

I would like to know how to make a training script use a bespoke defined dataset that isn’t available via the web or through a python module. I have managed to create a dataset object for test and train folders of images, and successfully uploaded them using the Datasets class. It has zipped them and placed them on the file server of the clearml server.

What I want to do now is, when the experiment is run on a compute node using an clearml-agent, is to download the datasets of images and unzip them in a known location ready for passing to the model data loader objects for testing and training.

@ecm200
Copy link
Author

ecm200 commented May 18, 2021

This is the example code block for setup of the clearml experiment object for a trainer script I have developed which I first run locally to create an experiment on the clearml-server. I then use that as a basis for creating remotely deployed versions of this on compute clearml-agents. The trainer script uses YACS to define a configuration for the model training, which can be overridden by a list of key pairs.

## CLEAR ML
# Connecting with the ClearML process
task = Task.init(project_name='Caltech Birds', task_name='Train network on CUB200')
# Add the local python package as a requirement
task.add_requirements('./cub_tools')
task.add_requirements('git+https://github.com/rwightman/pytorch-image-models.git')
# Setup ability to add configuration parameters control.
params = {'TRAIN.NUM_EPOCHS': 20, 'TRAIN.BATCH_SIZE': 32, 'TRAIN.OPTIMIZER.PARAMS.lr': 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum': 0.9}
params = task.connect(params)  # enabling configuration override by clearml
print(params)  # printing actual configuration (after override in remote mode)
# Convert Params dictionary into a set of key value pairs in a list
params_list = []
for key in params:
    params_list.extend([key,params[key]])

# Check if the task is running locally.
# If not then, get the datasets from the server.
if not task.running_locally:
    print('[INFO] Getting a local copy of the CUB200 birds dataset')
    train_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_train_dataset')
    train_dataset.get_mutable_local_copy(target_folder='./data/images/train')
    test_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_test_dataset')
    train_dataset.get_mutable_local_copy(target_folder='./data/images/train')

Most of the block relates to getting the experiment setup.
The final code block is an attempt to get the script when running as a remote experiment, to download the train and test image folders locally ready for executing training and testing.

I was hoping to download to the remote compute resource already uploaded image datasets from the clearml server when the training. I can then override the remote directory locations from the experiment configuration. However this below doesn’t seem to work as the dataset code is not executed on the remote machine, even though I’ve tested for the where the experiment is running.

@bmartinn
Copy link
Member

Hi @ecm200

The final code block is an attempt to get the script when running as a remote experiment, to download the train and test image folders locally ready for executing training and testing.

A quick note, if you have no intention of changing the dataset (i.e. read-only), you can use train_dataset.get_local_copy() which does not get an input argument and returns a folder of the cached copy of the Dataset on the machine (if the cache is empty it will download the Dataset).

I was hoping to download to the remote compute resource already uploaded image datasets from the clearml server when the ...

The Dataset.get_local_copy() will download and cache a dataset, the idea is that you will not have to worry about "preparing" the dataset prior to the execution, but make it a part of the training code itself. Assuming you have persistent cache on the agent machine (which is on by default), the main advantage is that there is no need for "warm-up" step, while the training code itself can expose the dataset it is using, allowing users to execute just by replacing a parameter without worrying about the local environment setup. The caching part makes sure that this process is fast (e.g. when we reuse the same dataset), and maintenance free, as the underlying caching mechanism takes care of deleting old entries etc.

If you want your code to automatically stop executing on the local machine and re-launch itself on the remote machine, you can use task.execute_remotely() to do exactly that.

What did I miss ?

BTW: Great code example!

@ecm200
Copy link
Author

ecm200 commented May 19, 2021

@bmartinn thanks for the tips, I am playing around with it now.
Thanks especially for the task.execute_remotely() tip, it's a bit cleaner now so I don't have to kill the process manually.

With regards to the downloading of the datasets, I wasn't able to get the clearml-agent to download the dataset remotely, it's as though the code is not being executed the way I had it before.

So I removed the test on task.running_locally(), and instead after the clearml-server task setup and parameters parsing, I inserted the task.execute_remotely() before the dataset calls. This has now had the expected results and I can see from the terminal output of the agent, that it has downloaded a local cache of the files.

## CLEAR ML
# Connecting with the ClearML process
task = Task.init(project_name='Caltech Birds', task_name='Train network on CUB200', task_type=Task.TaskTypes.training)
# Add the local python package as a requirement
task.add_requirements('./cub_tools')
task.add_requirements('git+https://github.com/rwightman/pytorch-image-models.git')
# Setup ability to add configuration parameters control.
params = {'TRAIN.NUM_EPOCHS': 20, 'TRAIN.BATCH_SIZE': 32, 'TRAIN.OPTIMIZER.PARAMS.lr': 0.001, 'TRAIN.OPTIMIZER.PARAMS.momentum': 0.9}
params = task.connect(params)  # enabling configuration override by clearml
print(params)  # printing actual configuration (after override in remote mode)
# Convert Params dictionary into a set of key value pairs in a list
params_list = []
for key in params:
    params_list.extend([key,params[key]])

# Execute task remotely
task.execute_remotely()

# Check if the task is running locally.
# If not then, get the datasets from the server.
#if not task.running_locally:
print('[INFO] Getting a local copy of the CUB200 birds dataset')
train_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_train_dataset')
#train_dataset.get_mutable_local_copy(target_folder='./data/images/train')
train_dataset.get_local_copy()
test_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_test_dataset')
#train_dataset.get_mutable_local_copy(target_folder='./data/images/train')
test_dataset.get_local_copy()

# Create the trainer object
trainer = Ignite_Trainer(config=args.config, cmd_args=params_list) # NOTE: disabled cmd line argument passing but using it to pass ClearML configs.

One Question: What's the best way to get at the paths for the newly downloaded datasets for remote execution?

Originally they are stored in the ./data/ folder of the repository folder structure, hence the cofiguration points to this directory from execution on the local machine. I either need to be able to cache the local copy dataset in that relative directory, or get at the new location and change the parameter.

@ecm200
Copy link
Author

ecm200 commented May 19, 2021

I believe I have found a solution to getting at paths for configuring the remote running of the algorithm for both input datasets and output files. I will share here for convenience of other people, and also for the opportunity for people to improve on how I have achieved it.

With regards to getting the input data directories (in this case, the inputs are directories of training and testing images, arrange in directories of action, either train or test, and sub dirs of class name). This dataset is versioned on the clearml-server by post above, and stored in a directory on the remote worker. To get access to this directory, I did the following:

# Get the dataset from the clearml-server and cache locally.
print('[INFO] Getting a local copy of the CUB200 birds datasets')
# Train
train_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_train_dataset')
print('[INFO] Default location of training dataset:: {}'.format(train_dataset.get_default_storage()))
train_dataset_base = train_dataset.get_local_copy()
print('[INFO] Default location of training dataset:: {}'.format(train_dataset_base))

# Test
test_dataset = Dataset.get(dataset_project='Caltech Birds', dataset_name='cub200_2011_test_dataset')
print('[INFO] Default location of testing dataset:: {}'.format(test_dataset.get_default_storage()))
test_dataset_base = test_dataset.get_local_copy()
print('[INFO] Default location of testing dataset:: {}'.format(test_dataset_base))

# Amend the input data directories and output directories for remote execution
# Modify experiment root dir
params_list = params_list + ['DIRS.ROOT_DIR', '']
# Add data root dir
params_list = params_list + ['DATA.DATA_DIR', str(pathlib.PurePath(train_dataset_base).parent)]
# Add data train dir
params_list = params_list + ['DATA.TRAIN_DIR', str(pathlib.PurePath(train_dataset_base).name)]
# Add data test dir
params_list = params_list + ['DATA.TEST_DIR', str(pathlib.PurePath(test_dataset_base).name)]
# Add working dir
params_list = params_list + ['DIRS.WORKING_DIR', str(task.cache_dir)]

My PyTorch Trainer class uses YACS for configuration files, with parameterisation specified in a hierarchical format. The supplied configuration can be overwritten using a list of key pairs, with the relevant configuration references passed. In this, the script is overwriting the locally specified input and output directory locations with those specified by the clearml-agent execution. The local cached copy directory of the datasets are returned by call to download to the dataset from the clearml-server.

train_dataset_base = train_dataset.get_local_copy()

The working directory for the model outputs is handled by the task object, and the default output directory was accessed using the class attribute cache_dir.

@bmartinn
Copy link
Member

bmartinn commented May 20, 2021

I believe I have found a solution to getting at paths for configuring the remote running of the algorithm for both input datasets and output files. I will share here for convenience of other people, and also for the opportunity for people to improve on how I have achieved it.

Yep, you got it :)
I guess the documentation was not clear enough?

BTW:
You can directly connect YACS configuration and control everything from the UI

from yacs.config import CfgNode as CN

_C = CN()
_C.SYSTEM = CN()
_C.SYSTEM.NUM_GPUS = 8
_C.SYSTEM.NUM_WORKERS = 4
_C.TRAIN = CN()
_C.TRAIN.HYPERPARAMETER_1 = 0.1
_C.TRAIN.SCALES = (2, 4, 8, 16)

task.connect(_C, name='YACS')

EDIT:
You could also use connect configuration and edit the entire YACS in a YAML alike text (notice that at the moment it seems that nested CN() will be converted to dicts in remote execution)

task.connect_configuration(_C, name='YACS')

@ecm200
Copy link
Author

ecm200 commented May 20, 2021

@bmartinn that's an awesome tip with regards to YACS, thanks!
I didn't realise that you could do that.

@ecm200
Copy link
Author

ecm200 commented May 20, 2021

I think this has answered all my queries on uploading and configs, thanks so much @bmartinn for your advice and input.
I will close this ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants