Train remotely from a training API #139

abelBEDOYA · 2023-02-06T09:21:48Z

Hi, I've been using ClearML and I've been tracking my yolov8 trainings which were carried out locally. To do so, I build an API with Fast API which can launch a training with custom hyperparams.

Now, I'm trying to do so remotely. So I build an agent and a queue (cola_yolo8) with that agent.
However, if I add execute_remotely() in the API training method:

This error shows up in the clearML task console:

I run this command to run the training API: uvicorn main:app --host 0.0.0.0 --port 3000 --reload

I've tried simpler tasks instead of yolov8 training and the error is the same.

Thanks!

The text was updated successfully, but these errors were encountered:

thepycoder · 2023-02-06T11:07:36Z

Hi there!

Thanks for filing a bug report :)

First of all, I would never recommend calling model.train in an API route (even if it is never actually executed due to the execute_remotely call). It makes it so your api has ML dependencies like Torch that can be quite heavy and unnecessary!

What seems to happen is that when executing remotely, ClearML will try to recognize and interpret your local environment, so it can be recreated on the remote machine. But since you're running it from the API, all the API requirements are also detected as dependencies of the task itself :)

That said, I think I see what you're trying to do. I would go for the following flow instead:

Have a "template" training task ready to go, that you know will run remotely (let me know how that goes using YOLOv8!). You run this template task once with some random parameters, just to verify that it works. This part has nothing to do with the API
Inside the API, you can now clone this task as such: https://clear.ml/docs/latest/docs/clearml_sdk/task_sdk#cloning--executing-tasks
The cloned task is now in draft mode, so you can override its original parameters using: https://clear.ml/docs/latest/docs/references/sdk/task/#update_parameters
Now enqueue the task using the Task.enqueue function as described in the docs link of 2)

Now your task should be in the queue, the API can return successfully and a worker can start working on it. You'll need a second API endpoint that you can poll every 1min for example, that a client can use to get the status. You can of course also just return the task_id in the first API when it's created, so a client can ask updates by asking the clearml server directly.

Does this help?

abelBEDOYA · 2023-02-08T12:05:55Z

@thepycoder Good idea! But, how can I get that "first" training that will be used as a template? I have carried out local YOLO trainings (working correctly without API) but they don't work as a template :( When I try to clone and run them in my agent queue from app.clearml the console says some files are not found, the train.yaml for example. (check the screenshot)

Thanks for the quick reply!

thepycoder · 2023-02-08T13:01:01Z

Like you were thinking yourself, the agent indeed needs to be able to access your train.yaml file on its local filesystem. You have multiple options to get it there, here are some:

Make sure you code is git-tracked and add the train.yaml file to the git repo. The agent will pull the repo and it should find the file!
Track it separately using e.g. clearml-data so you can use Dataset.get().get_local_copy() to get a local copy
Host it somewhere (e.g. Google Drive) and make sure the agent can download it.

In this way, the agent is no different than e.g. a colab instance, you'll have to give it a way to access your files :)

abelBEDOYA changed the title ~~Tran remotely from a trainign API~~ Train remotely from a training API Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Train remotely from a training API #139

Train remotely from a training API #139

abelBEDOYA commented Feb 6, 2023 •

edited

thepycoder commented Feb 6, 2023

abelBEDOYA commented Feb 8, 2023

thepycoder commented Feb 8, 2023

Train remotely from a training API #139

Train remotely from a training API #139

Comments

abelBEDOYA commented Feb 6, 2023 • edited

thepycoder commented Feb 6, 2023

abelBEDOYA commented Feb 8, 2023

thepycoder commented Feb 8, 2023

abelBEDOYA commented Feb 6, 2023 •

edited