Bug fix: Auto-scaler should not spin a new instance with task_id = None #534

tienduccao · 2022-01-07T14:08:45Z

Hi, after a discussion from clearml-community Slack channel, I figured out the solution and decided to make this PR.
Thanks in advance for your feedbacks.

jkhenning · 2022-01-10T10:09:37Z

Hi @tienduccao , I'm looking into it - will update soon.. Thanks for contributing! 🙂

jkhenning · 2022-01-14T14:00:42Z

Hi @tienduccao , sorry for taking a long time - this fix actually interacts with something else we're working on - we're trying to see how to make it work and be back with an answer soon 🙂

tienduccao · 2022-01-14T14:11:33Z

No problem @jkhenning 🙂

tebeka · 2022-01-16T13:26:59Z

Hi @tienduccao. Thanks for the PR!

You are right that task_id is always None. This is true for the current code in clearml, however we have other code extending this and there task_id can be somethin else. I apologize for the confusion and will try to make the code more understandable.

If you look at the code in cloud_driver.py you will see that the spawned worker does not serve a single task but reads tasks from the queue, so even when task_id is None the worker is OK. In light of this I guess the underlying issue you have is something else. Did you ran the task on the specific queue that the worker is listening on?

tebeka · 2022-01-16T13:27:49Z

clearml/automation/cloud_driver.py

@@ -39,7 +39,6 @@
 export CLEARML_API_ACCESS_KEY='{access_key}'
 export CLEARML_API_SECRET_KEY='{secret_key}'
 export CLEARML_AUTH_TOKEN='{auth_token}'
-source ~/.bashrc


Why remove this line? What issue does it solve?

With this AMI ami-04129d3de24f88348 there's no .bashrc by default, so it crashes before launching the agent on the target EC2 machine.

@tienduccao in that case we'll need to have some expression there that calls source ~/.bashrc if ~/.bashrc exists, otherwise does nothing. Simply removing it means it will not be called in other AMIs as it does today which breaks backwards-compatibility 🙁

@jkhenning does it sound good for you?

if [ -f "~/.bashrc" ]; then source ~/.bashrc fi

tienduccao · 2022-01-16T14:34:05Z

Hi @tienduccao. Thanks for the PR!

You are right that task_id is always None. This is true for the current code in clearml, however we have other code extending this and there task_id can be somethin else. I apologize for the confusion and will try to make the code more understandable.

If you look at the code in cloud_driver.py you will see that the spawned worker does not serve a single task but reads tasks from the queue, so even when task_id is None the worker is OK. In light of this I guess the underlying issue you have is something else

Hi @tebeka , thanks for your reply.
Actually I couldn't get my tasks executed in the spawned EC instances so I tried to debug and realized that all the task_id is None.
Could you elaborate more in details about how the correct task_id is sent to an EC2 with the current code self.driver.spin_up_worker(resource_conf, worker_prefix, queue, task_id=None)?

. Did you ran the task on the specific queue that the worker is listening on?

Yes, I tried to do a hyperparameters optimization on the queue created by the Autoscaler.

Please take a look at my config file to see if there's something wrong here.

configurations:
  extra_clearml_conf: ''
  extra_trains_conf: ''
  extra_vm_bash_script: ''
  queues:
    gpu_queue:
    - - aws4gpu
      - 3
  resource_configurations:
    aws4gpu:
      ami_id: ami-04129d3de24f88348
      availability_zone: us-west-2c
      ebs_device_name: /dev/sda1
      ebs_volume_size: 100
      ebs_volume_type: gp2
      instance_type: g3s.xlarge
      is_spot: false
      key_name: gpu
      security_group_ids: null
hyper_params:
  cloud_credentials_key: ''
  cloud_credentials_region: us-west-2
  cloud_credentials_secret: ''
  cloud_provider: ''
  default_docker_image: nvidia/cuda:10.1-runtime-ubuntu18.04
  git_user: '<my gitlab user name>'
  git_pass: '<my gitlab personal access token '
  max_idle_time_min: 10
  max_spin_up_time_min: 30
  polling_interval_time_min: 5
  use_credentials_chain: true
  workers_prefix: dynamic_worker

And I'm not sure if this is relevant but I don't have an AWS root account. I ran all of these experiments using an ARN provided by my company's cloud engineers.

tebeka · 2022-01-18T11:30:05Z

Hi @tienduccao

Could you elaborate more in details about how the correct task_id is sent to an EC2 with the current code self.driver.spin_up_worker(resource_conf, worker_prefix, queue, task_id=None)?

It's being passed in the {driver_extra} paremter, see cloud_driver.py.

Please take a look at my config file to see if there's something wrong here.
...

I assume clound_credientials_key and cloud_credentials_secret are not empty in the real configuration, they are used by the underlying boto3 library to spin machines. Otherwise the configuration look OK, I'll try to start a scaler with this config later this week to double check.

Also, when you run a task, make sure to specify the queue. as in:

task = Task.init(project_name='gpu_project', task_name='gpu_task')
task.execute_remotely('gpu_queue')  # Will exit the program

tienduccao · 2022-01-19T07:17:22Z

Hi @tebeka

Hi @tienduccao

Could you elaborate more in details about how the correct task_id is sent to an EC2 with the current code self.driver.spin_up_worker(resource_conf, worker_prefix, queue, task_id=None)?

It's being passed in the {driver_extra} paremter, see cloud_driver.py.

I did look at cloud_driver.py before making this PR.
Here's what I found

def driver_bash_extra(self, task_id):
        if not task_id:
            return ''
        return 'python -m clearml_agent --config-file ~/clearml.conf execute --id {}'.format(task_id)

So with task_id = None, the agent couldn't launch the expected task, right?

Please take a look at my config file to see if there's something wrong here.
...

I assume clound_credientials_key and cloud_credentials_secret are not empty in the real configuration, they are used by the underlying boto3 library to spin machines. Otherwise the configuration look OK, I'll try to start a scaler with this config later this week to double check.
No they are indeed empty. I used this option use_credentials_chain: true to launch the EC2 instances.
In my previous reply, I mentioned that I couldn't launch EC2 instances directly with my access and secret key, but it's doable via an ARN.
And according to many tests, it did work.

Also, when you run a task, make sure to specify the queue. as in:
task = Task.init(project_name='gpu_project', task_name='gpu_task')
task.execute_remotely('gpu_queue')  # Will exit the program

In fact I launched my tasks via a Hyperparameter optimizer (by following this example).
And I did assign the correct queue name to my tasks.

tebeka · 2022-01-20T08:44:48Z

Hi @tienduccao

I did look at cloud_driver.py before making this PR. Here's what I found
def driver_bash_extra(self, task_id):
        if not task_id:
            return ''
        return 'python -m clearml_agent --config-file ~/clearml.conf execute --id {}'.format(task_id)
So with task_id = None, the agent couldn't launch the expected task, right?

No, it'll launch the agent on the queue and it'll start taking tasks from this queue.

No they are indeed empty. I used this option use_credentials_chain: true to launch the EC2 instances.
In my previous reply, I mentioned that I couldn't launch EC2 instances directly with my access and secret key, but it's doable via an ARN.

Ah, I see.

And according to many tests, it did work.
it means launching instances manually or the auto scaler?

In fact I launched my tasks via a Hyperparameter optimizer (by following this example). And I did assign the correct queue name to my tasks.

Thanks. I'll try to reproduce.

tienduccao · 2022-01-20T09:19:52Z

Hi @tebeka

No, it'll launch the agent on the queue and it'll start taking tasks from this queue.

Sorry but it's still unclear for me when the task is sent to the queue 🤔 .
And without the modifications from this PR, I saw my EC2 instances keep running for hours and there were no asks executed.

And according to many tests, it did work.
it means launching instances manually or the auto scaler?

I meant I was able to launch the instances with the Autoscaler by using the use_credentials_chain: true option

In fact I launched my tasks via a Hyperparameter optimizer (by following this example). And I did assign the correct queue name to my tasks.

Thanks. I'll try to reproduce.

Thanks

jkhenning · 2022-01-20T09:53:16Z

@tienduccao Just to make sure we're on the same page here - when using ClearML Agents (which is what the autoscaler is designed to run on the instances), they are monitoring the system queues and pulling tasks from them. A Task can be enqueued by manually choosing "Enqueue" in the UI, or by specifying a queue when calling task.execute_remotely() as shown here.
An instance running a specific task is an edge case, not the standard use-case 🙂

tienduccao · 2022-01-20T10:25:41Z

@tienduccao Just to make sure we're on the same page here - when using ClearML Agents (which is what the autoscaler is designed to run on the instances), they are monitoring the system queues and pulling tasks from them. A Task can be enqueued by manually choosing "Enqueue" in the UI, or by specifying a queue when calling task.execute_remotely() as shown here. An instance running a specific task is an edge case, not the standard use-case slightly_smiling_face

Hi @jkhenning , I don't really get your this "An instance running a specific task is an edge case, not the standard use-case slightly_smiling_face". Which instance and which task you're referring to in my use case?

jkhenning · 2022-01-20T10:34:57Z

Which instance and which task you're referring to in my use case?

Well, it seems to me you're trying to make sure that every instance goes up with a specific task ID for it to run, when the autoscaler design (influenced by the ClearML Agent workflow) is to detect when monitored queues have pending tasks, and start instances that should keep taking tasks from the queues - the design is not to explicitly specify which instance runs which task, but specify in the configuration which instance type uses which queues. This way queues are the main means of controlling which resource type your task will run on, so you only have to enqueue your task to the appropriate queue.

tienduccao · 2022-01-20T10:45:17Z

Hi @jkhenning , yes I understand the design.
But it's not clear for me when were the tasks enqueued.

tienduccao · 2022-01-20T11:02:49Z

I think I finally understood the problem 😄 .
Thanks @jkhenning and @tebeka for your explanations.
We could close this PR now.

tienduccao · 2022-01-20T11:05:00Z

I think I'll do another test and let you guys know soon.

jkhenning · 2022-01-20T11:07:35Z

@tienduccao thanks! I think the .bashrc fix is a good one, so either we keep it here as the only change or you can open another PR - will be a shame not to do it 🙂

tienduccao · 2022-01-24T20:55:59Z

Hi @jkhenning and @tebeka, I'd like to discuss the last time about the empty task_id.

The following script is generated by bash_script_template from the cloud_driver.py.
And it will be served as User Data of the EC2 instance.

#!/bin/bash

set -x
set -e

apt-get update
apt-get install -y         build-essential         gcc         git         python3-dev         python3-pip
python3 -m pip install -U pip
python3 -m pip install virtualenv
python3 -m virtualenv clearml_agent_venv
source clearml_agent_venv/bin/activate
python -m pip install clearml-agent
cat << EOF >> ~/clearml.conf

EOF
export CLEARML_API_HOST=https://api.community.clear.ml
export CLEARML_WEB_HOST=https://app.community.clear.ml
export CLEARML_FILES_HOST=https://files.community.clear.ml
export DYNAMIC_INSTANCE_ID=$(curl http://169.254.169.254/latest/meta-data/instance-id)
export CLEARML_WORKER_ID=dynamic_worker:aws4gpu:g3s.xlarge:$DYNAMIC_INSTANCE_ID
export CLEARML_API_ACCESS_KEY='<>'
export CLEARML_API_SECRET_KEY='<>'
export CLEARML_AUTH_TOKEN=''

python -m clearml_agent --config-file ~/clearml.conf execute --id 
python -m clearml_agent --config-file ~/clearml.conf daemon --queue 'gpu_queue' --docker 'nvidia/cuda:10.1-runtime-ubuntu18.04'
shutdown

Since the task_id is None, we have this line python -m clearml_agent --config-file ~/clearml.conf execute --id.
And this command does raise an error like this

usage: clearml-agent execute [-h] --id TASK_ID [--log-file LOG_FILE] [--disable-monitoring] [--full-monitoring] [--require-queue] [--standalone-mode] [--docker [DOCKER [DOCKER ...]]] [--clone] [-O] [--git-user GIT_USER]
                             [--git-pass GIT_PASS] [--log-level {DEBUG,INFO,WARN,WARNING,ERROR,CRITICAL}] [--gpus GPUS] [--cpu-only]
clearml-agent execute: error: argument --id: expected one argument

I'm not really sure whether the next line (python -m clearml_agent --config-file ~/clearml.conf daemon --queue 'gpu_queue' --docker 'nvidia/cuda:10.1-runtime-ubuntu18.04') will be executed or not.
And if it doesn't get executed, we won't have a worker from the launched EC2 instance which listens to the 'gpu_queue'.
Do you have any idea to verify this assumption?
Thanks.

tebeka · 2022-01-27T08:55:45Z

Hi @tienduccao

Here's what I get when I generate the user data script:

In [1]: from clearml.automation.aws_driver import AWSDriver
In [2]: with open('../aws_autoscaler.yaml') as fp:
   ...:     data = yaml.safe_load(fp)
In [3]: drv = AWSDriver.from_config(data)
In [4]: s = drv.gen_user_data('aws-scaler-', 'q1', None)
In [5]: print(s)
#!/bin/bash

set -x
set -e

apt-get update
apt-get install -y         build-essential         gcc         git         python3-dev         python3-pip
python3 -m pip install -U pip
python3 -m pip install virtualenv
python3 -m virtualenv clearml_agent_venv
source clearml_agent_venv/bin/activate
python -m pip install clearml-agent
cat << EOF >> ~/clearml.conf
agent.git_user=""
agent.git_pass=""



EOF
export CLEARML_API_HOST=https://api.community.clear.ml
export CLEARML_WEB_HOST=https://app.community.clear.ml
export CLEARML_FILES_HOST=https://files.community.clear.ml
export DYNAMIC_INSTANCE_ID=$(curl http://169.254.169.254/latest/meta-data/instance-id)
export CLEARML_WORKER_ID=aws-scaler-:$DYNAMIC_INSTANCE_ID
export CLEARML_API_ACCESS_KEY='XXXX'
export CLEARML_API_SECRET_KEY='XXXX'
export CLEARML_AUTH_TOKEN=''
source ~/.bashrc


python -m clearml_agent --config-file ~/clearml.conf daemon --queue 'q1' 
shutdown

As you can see, since task_id is None, the execute line is not in the script.

What version of clearml are you using?

tebeka · 2022-01-27T09:20:43Z

@tienduccao Can you attach here the AWS instance log?

Something like: aws ec2 get-console-output --instance-id <INSTANCE ID>

tienduccao · 2022-01-27T10:55:12Z

Hi @tebeka , I did my last test and I could now confirm that there's no problem with the existing Autoscaler code 😄 .
It's weird that I couldn't reproduce my problem but it's the best outcome.
Thank you and @jkhenning for your time.

tienduccao added 2 commits January 7, 2022 14:36

Fix task_id is always None in cloud_driver.py

7662afe

Fix crash when initializing a worker with Auto-scaler

8fc5a51

tienduccao force-pushed the master branch from 8a881b8 to 8fc5a51 Compare January 13, 2022 09:31

tebeka reviewed Jan 16, 2022

View reviewed changes

tienduccao closed this Jan 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug fix: Auto-scaler should not spin a new instance with task_id = None #534

Bug fix: Auto-scaler should not spin a new instance with task_id = None #534

tienduccao commented Jan 7, 2022

jkhenning commented Jan 10, 2022

jkhenning commented Jan 14, 2022

tienduccao commented Jan 14, 2022

tebeka commented Jan 16, 2022

tebeka Jan 16, 2022

tienduccao Jan 16, 2022

jkhenning Jan 16, 2022

tienduccao Jan 16, 2022

jkhenning Jan 18, 2022

tienduccao commented Jan 16, 2022

tebeka commented Jan 18, 2022

tienduccao commented Jan 19, 2022

tebeka commented Jan 20, 2022

tienduccao commented Jan 20, 2022 •

edited

jkhenning commented Jan 20, 2022 •

edited

tienduccao commented Jan 20, 2022

jkhenning commented Jan 20, 2022

tienduccao commented Jan 20, 2022 •

edited

tienduccao commented Jan 20, 2022

tienduccao commented Jan 20, 2022

jkhenning commented Jan 20, 2022

tienduccao commented Jan 24, 2022

tebeka commented Jan 27, 2022

tebeka commented Jan 27, 2022

tienduccao commented Jan 27, 2022

Bug fix: Auto-scaler should not spin a new instance with task_id = None #534

Bug fix: Auto-scaler should not spin a new instance with task_id = None #534

Conversation

tienduccao commented Jan 7, 2022

jkhenning commented Jan 10, 2022

jkhenning commented Jan 14, 2022

tienduccao commented Jan 14, 2022

tebeka commented Jan 16, 2022

tebeka Jan 16, 2022

Choose a reason for hiding this comment

tienduccao Jan 16, 2022

Choose a reason for hiding this comment

jkhenning Jan 16, 2022

Choose a reason for hiding this comment

tienduccao Jan 16, 2022

Choose a reason for hiding this comment

jkhenning Jan 18, 2022

Choose a reason for hiding this comment

tienduccao commented Jan 16, 2022

tebeka commented Jan 18, 2022

tienduccao commented Jan 19, 2022

tebeka commented Jan 20, 2022

tienduccao commented Jan 20, 2022 • edited

jkhenning commented Jan 20, 2022 • edited

tienduccao commented Jan 20, 2022

jkhenning commented Jan 20, 2022

tienduccao commented Jan 20, 2022 • edited

tienduccao commented Jan 20, 2022

tienduccao commented Jan 20, 2022

jkhenning commented Jan 20, 2022

tienduccao commented Jan 24, 2022

tebeka commented Jan 27, 2022

tebeka commented Jan 27, 2022

tienduccao commented Jan 27, 2022

tienduccao commented Jan 20, 2022 •

edited

jkhenning commented Jan 20, 2022 •

edited

tienduccao commented Jan 20, 2022 •

edited