New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug fix: Auto-scaler should not spin a new instance with task_id = None #534
Conversation
Hi @tienduccao , I'm looking into it - will update soon.. Thanks for contributing! 🙂 |
Hi @tienduccao , sorry for taking a long time - this fix actually interacts with something else we're working on - we're trying to see how to make it work and be back with an answer soon 🙂 |
No problem @jkhenning 🙂 |
Hi @tienduccao. Thanks for the PR! You are right that If you look at the code in |
@@ -39,7 +39,6 @@ | |||
export CLEARML_API_ACCESS_KEY='{access_key}' | |||
export CLEARML_API_SECRET_KEY='{secret_key}' | |||
export CLEARML_AUTH_TOKEN='{auth_token}' | |||
source ~/.bashrc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove this line? What issue does it solve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this AMI ami-04129d3de24f88348 there's no .bashrc
by default, so it crashes before launching the agent on the target EC2 machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tienduccao in that case we'll need to have some expression there that calls source ~/.bashrc
if ~/.bashrc
exists, otherwise does nothing. Simply removing it means it will not be called in other AMIs as it does today which breaks backwards-compatibility 🙁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jkhenning does it sound good for you?
if [ -f "~/.bashrc" ];
then
source ~/.bashrc
fi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup 🙂
Hi @tebeka , thanks for your reply.
Yes, I tried to do a hyperparameters optimization on the queue created by the Autoscaler. Please take a look at my config file to see if there's something wrong here.
And I'm not sure if this is relevant but I don't have an AWS root account. I ran all of these experiments using an ARN provided by my company's cloud engineers. |
Hi @tienduccao
It's being passed in the
I assume Also, when you run a task, make sure to specify the queue. as in: task = Task.init(project_name='gpu_project', task_name='gpu_task')
task.execute_remotely('gpu_queue') # Will exit the program |
Hi @tebeka
I did look at
So with
In fact I launched my tasks via a Hyperparameter optimizer (by following this example). |
Hi @tienduccao
No, it'll launch the agent on the queue and it'll start taking tasks from this queue.
Ah, I see.
Thanks. I'll try to reproduce. |
Hi @tebeka
Sorry but it's still unclear for me when the task is sent to the queue 🤔 .
I meant I was able to launch the instances with the Autoscaler by using the
Thanks |
@tienduccao Just to make sure we're on the same page here - when using ClearML Agents (which is what the autoscaler is designed to run on the instances), they are monitoring the system queues and pulling tasks from them. A Task can be enqueued by manually choosing "Enqueue" in the UI, or by specifying a queue when calling |
Hi @jkhenning , I don't really get your this "An instance running a specific task is an edge case, not the standard use-case slightly_smiling_face". Which instance and which task you're referring to in my use case? |
Well, it seems to me you're trying to make sure that every instance goes up with a specific task ID for it to run, when the autoscaler design (influenced by the ClearML Agent workflow) is to detect when monitored queues have pending tasks, and start instances that should keep taking tasks from the queues - the design is not to explicitly specify which instance runs which task, but specify in the configuration which instance type uses which queues. This way queues are the main means of controlling which resource type your task will run on, so you only have to enqueue your task to the appropriate queue. |
Hi @jkhenning , yes I understand the design. |
I think I finally understood the problem 😄 . |
I think I'll do another test and let you guys know soon. |
@tienduccao thanks! I think the |
Hi @jkhenning and @tebeka, I'd like to discuss the last time about the empty The following script is generated by
Since the
I'm not really sure whether the next line ( |
Hi @tienduccao Here's what I get when I generate the user data script:
As you can see, since What version of clearml are you using? |
@tienduccao Can you attach here the AWS instance log? Something like: |
Hi @tebeka , I did my last test and I could now confirm that there's no problem with the existing Autoscaler code 😄 . |
Hi, after a discussion from clearml-community Slack channel, I figured out the solution and decided to make this PR.
Thanks in advance for your feedbacks.