Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Train remotely from a training API #139

Open
abelBEDOYA opened this issue Feb 6, 2023 · 3 comments
Open

Train remotely from a training API #139

abelBEDOYA opened this issue Feb 6, 2023 · 3 comments

Comments

@abelBEDOYA
Copy link

abelBEDOYA commented Feb 6, 2023

Hi, I've been using ClearML and I've been tracking my yolov8 trainings which were carried out locally. To do so, I build an API with Fast API which can launch a training with custom hyperparams.
api

Now, I'm trying to do so remotely. So I build an agent and a queue (cola_yolo8) with that agent.
However, if I add execute_remotely() in the API training method:
metodo_API

This error shows up in the clearML task console:
fallo_clearML

I run this command to run the training API: uvicorn main:app --host 0.0.0.0 --port 3000 --reload

I've tried simpler tasks instead of yolov8 training and the error is the same.

Thanks!

@abelBEDOYA abelBEDOYA changed the title Tran remotely from a trainign API Train remotely from a training API Feb 6, 2023
@thepycoder
Copy link

Hi there!

Thanks for filing a bug report :)

First of all, I would never recommend calling model.train in an API route (even if it is never actually executed due to the execute_remotely call). It makes it so your api has ML dependencies like Torch that can be quite heavy and unnecessary!

What seems to happen is that when executing remotely, ClearML will try to recognize and interpret your local environment, so it can be recreated on the remote machine. But since you're running it from the API, all the API requirements are also detected as dependencies of the task itself :)

That said, I think I see what you're trying to do. I would go for the following flow instead:

  1. Have a "template" training task ready to go, that you know will run remotely (let me know how that goes using YOLOv8!). You run this template task once with some random parameters, just to verify that it works. This part has nothing to do with the API

  2. Inside the API, you can now clone this task as such: https://clear.ml/docs/latest/docs/clearml_sdk/task_sdk#cloning--executing-tasks

  3. The cloned task is now in draft mode, so you can override its original parameters using: https://clear.ml/docs/latest/docs/references/sdk/task/#update_parameters

  4. Now enqueue the task using the Task.enqueue function as described in the docs link of 2)

Now your task should be in the queue, the API can return successfully and a worker can start working on it. You'll need a second API endpoint that you can poll every 1min for example, that a client can use to get the status. You can of course also just return the task_id in the first API when it's created, so a client can ask updates by asking the clearml server directly.

Does this help?

@abelBEDOYA
Copy link
Author

@thepycoder Good idea! But, how can I get that "first" training that will be used as a template? I have carried out local YOLO trainings (working correctly without API) but they don't work as a template :( When I try to clone and run them in my agent queue from app.clearml the console says some files are not found, the train.yaml for example. (check the screenshot)
Screenshot from 2023-02-08 12-31-14

Thanks for the quick reply!

@thepycoder
Copy link

Like you were thinking yourself, the agent indeed needs to be able to access your train.yaml file on its local filesystem. You have multiple options to get it there, here are some:

  • Make sure you code is git-tracked and add the train.yaml file to the git repo. The agent will pull the repo and it should find the file!
  • Track it separately using e.g. clearml-data so you can use Dataset.get().get_local_copy() to get a local copy
  • Host it somewhere (e.g. Google Drive) and make sure the agent can download it.

In this way, the agent is no different than e.g. a colab instance, you'll have to give it a way to access your files :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants