Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug fix: Auto-scaler should not spin a new instance with task_id = None #534

Closed
wants to merge 2 commits into from

Conversation

tienduccao
Copy link

Hi, after a discussion from clearml-community Slack channel, I figured out the solution and decided to make this PR.
Thanks in advance for your feedbacks.

@jkhenning
Copy link
Member

Hi @tienduccao , I'm looking into it - will update soon.. Thanks for contributing! 🙂

@jkhenning
Copy link
Member

Hi @tienduccao , sorry for taking a long time - this fix actually interacts with something else we're working on - we're trying to see how to make it work and be back with an answer soon 🙂

@tienduccao
Copy link
Author

No problem @jkhenning 🙂

@tebeka
Copy link

tebeka commented Jan 16, 2022

Hi @tienduccao. Thanks for the PR!

You are right that task_id is always None. This is true for the current code in clearml, however we have other code extending this and there task_id can be somethin else. I apologize for the confusion and will try to make the code more understandable.

If you look at the code in cloud_driver.py you will see that the spawned worker does not serve a single task but reads tasks from the queue, so even when task_id is None the worker is OK. In light of this I guess the underlying issue you have is something else. Did you ran the task on the specific queue that the worker is listening on?

@@ -39,7 +39,6 @@
export CLEARML_API_ACCESS_KEY='{access_key}'
export CLEARML_API_SECRET_KEY='{secret_key}'
export CLEARML_AUTH_TOKEN='{auth_token}'
source ~/.bashrc
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why remove this line? What issue does it solve?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this AMI ami-04129d3de24f88348 there's no .bashrc by default, so it crashes before launching the agent on the target EC2 machine.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tienduccao in that case we'll need to have some expression there that calls source ~/.bashrc if ~/.bashrc exists, otherwise does nothing. Simply removing it means it will not be called in other AMIs as it does today which breaks backwards-compatibility 🙁

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jkhenning does it sound good for you?

if [ -f "~/.bashrc" ]; 
then
    source ~/.bashrc
fi

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup 🙂

@tienduccao
Copy link
Author

Hi @tienduccao. Thanks for the PR!

You are right that task_id is always None. This is true for the current code in clearml, however we have other code extending this and there task_id can be somethin else. I apologize for the confusion and will try to make the code more understandable.

If you look at the code in cloud_driver.py you will see that the spawned worker does not serve a single task but reads tasks from the queue, so even when task_id is None the worker is OK. In light of this I guess the underlying issue you have is something else

Hi @tebeka , thanks for your reply.
Actually I couldn't get my tasks executed in the spawned EC instances so I tried to debug and realized that all the task_id is None.
Could you elaborate more in details about how the correct task_id is sent to an EC2 with the current code self.driver.spin_up_worker(resource_conf, worker_prefix, queue, task_id=None)?

. Did you ran the task on the specific queue that the worker is listening on?

Yes, I tried to do a hyperparameters optimization on the queue created by the Autoscaler.

Please take a look at my config file to see if there's something wrong here.

configurations:
  extra_clearml_conf: ''
  extra_trains_conf: ''
  extra_vm_bash_script: ''
  queues:
    gpu_queue:
    - - aws4gpu
      - 3
  resource_configurations:
    aws4gpu:
      ami_id: ami-04129d3de24f88348
      availability_zone: us-west-2c
      ebs_device_name: /dev/sda1
      ebs_volume_size: 100
      ebs_volume_type: gp2
      instance_type: g3s.xlarge
      is_spot: false
      key_name: gpu
      security_group_ids: null
hyper_params:
  cloud_credentials_key: ''
  cloud_credentials_region: us-west-2
  cloud_credentials_secret: ''
  cloud_provider: ''
  default_docker_image: nvidia/cuda:10.1-runtime-ubuntu18.04
  git_user: '<my gitlab user name>'
  git_pass: '<my gitlab personal access token '
  max_idle_time_min: 10
  max_spin_up_time_min: 30
  polling_interval_time_min: 5
  use_credentials_chain: true
  workers_prefix: dynamic_worker

And I'm not sure if this is relevant but I don't have an AWS root account. I ran all of these experiments using an ARN provided by my company's cloud engineers.

@tebeka
Copy link

tebeka commented Jan 18, 2022

Hi @tienduccao

Could you elaborate more in details about how the correct task_id is sent to an EC2 with the current code self.driver.spin_up_worker(resource_conf, worker_prefix, queue, task_id=None)?

It's being passed in the {driver_extra} paremter, see cloud_driver.py.

Please take a look at my config file to see if there's something wrong here.
...

I assume clound_credientials_key and cloud_credentials_secret are not empty in the real configuration, they are used by the underlying boto3 library to spin machines. Otherwise the configuration look OK, I'll try to start a scaler with this config later this week to double check.

Also, when you run a task, make sure to specify the queue. as in:

task = Task.init(project_name='gpu_project', task_name='gpu_task')
task.execute_remotely('gpu_queue')  # Will exit the program

@tienduccao
Copy link
Author

Hi @tebeka

Hi @tienduccao

Could you elaborate more in details about how the correct task_id is sent to an EC2 with the current code self.driver.spin_up_worker(resource_conf, worker_prefix, queue, task_id=None)?

It's being passed in the {driver_extra} paremter, see cloud_driver.py.

I did look at cloud_driver.py before making this PR.
Here's what I found

def driver_bash_extra(self, task_id):
        if not task_id:
            return ''
        return 'python -m clearml_agent --config-file ~/clearml.conf execute --id {}'.format(task_id)

So with task_id = None, the agent couldn't launch the expected task, right?

Please take a look at my config file to see if there's something wrong here.
...

I assume clound_credientials_key and cloud_credentials_secret are not empty in the real configuration, they are used by the underlying boto3 library to spin machines. Otherwise the configuration look OK, I'll try to start a scaler with this config later this week to double check.
No they are indeed empty. I used this option use_credentials_chain: true to launch the EC2 instances.
In my previous reply, I mentioned that I couldn't launch EC2 instances directly with my access and secret key, but it's doable via an ARN.
And according to many tests, it did work.

Also, when you run a task, make sure to specify the queue. as in:

task = Task.init(project_name='gpu_project', task_name='gpu_task')
task.execute_remotely('gpu_queue')  # Will exit the program

In fact I launched my tasks via a Hyperparameter optimizer (by following this example).
And I did assign the correct queue name to my tasks.

@tebeka
Copy link

tebeka commented Jan 20, 2022

Hi @tienduccao

I did look at cloud_driver.py before making this PR. Here's what I found

def driver_bash_extra(self, task_id):
        if not task_id:
            return ''
        return 'python -m clearml_agent --config-file ~/clearml.conf execute --id {}'.format(task_id)

So with task_id = None, the agent couldn't launch the expected task, right?

No, it'll launch the agent on the queue and it'll start taking tasks from this queue.

No they are indeed empty. I used this option use_credentials_chain: true to launch the EC2 instances.
In my previous reply, I mentioned that I couldn't launch EC2 instances directly with my access and secret key, but it's doable via an ARN.

Ah, I see.

And according to many tests, it did work.
it means launching instances manually or the auto scaler?

In fact I launched my tasks via a Hyperparameter optimizer (by following this example). And I did assign the correct queue name to my tasks.

Thanks. I'll try to reproduce.

@tienduccao
Copy link
Author

tienduccao commented Jan 20, 2022

Hi @tebeka

No, it'll launch the agent on the queue and it'll start taking tasks from this queue.

Sorry but it's still unclear for me when the task is sent to the queue 🤔 .
And without the modifications from this PR, I saw my EC2 instances keep running for hours and there were no asks executed.

And according to many tests, it did work.
it means launching instances manually or the auto scaler?

I meant I was able to launch the instances with the Autoscaler by using the use_credentials_chain: true option

In fact I launched my tasks via a Hyperparameter optimizer (by following this example). And I did assign the correct queue name to my tasks.

Thanks. I'll try to reproduce.

Thanks

@jkhenning
Copy link
Member

jkhenning commented Jan 20, 2022

@tienduccao Just to make sure we're on the same page here - when using ClearML Agents (which is what the autoscaler is designed to run on the instances), they are monitoring the system queues and pulling tasks from them. A Task can be enqueued by manually choosing "Enqueue" in the UI, or by specifying a queue when calling task.execute_remotely() as shown here.
An instance running a specific task is an edge case, not the standard use-case 🙂

@tienduccao
Copy link
Author

@tienduccao Just to make sure we're on the same page here - when using ClearML Agents (which is what the autoscaler is designed to run on the instances), they are monitoring the system queues and pulling tasks from them. A Task can be enqueued by manually choosing "Enqueue" in the UI, or by specifying a queue when calling task.execute_remotely() as shown here. An instance running a specific task is an edge case, not the standard use-case slightly_smiling_face

Hi @jkhenning , I don't really get your this "An instance running a specific task is an edge case, not the standard use-case slightly_smiling_face". Which instance and which task you're referring to in my use case?

@jkhenning
Copy link
Member

Which instance and which task you're referring to in my use case?

Well, it seems to me you're trying to make sure that every instance goes up with a specific task ID for it to run, when the autoscaler design (influenced by the ClearML Agent workflow) is to detect when monitored queues have pending tasks, and start instances that should keep taking tasks from the queues - the design is not to explicitly specify which instance runs which task, but specify in the configuration which instance type uses which queues. This way queues are the main means of controlling which resource type your task will run on, so you only have to enqueue your task to the appropriate queue.

@tienduccao
Copy link
Author

tienduccao commented Jan 20, 2022

Hi @jkhenning , yes I understand the design.
But it's not clear for me when were the tasks enqueued.

@tienduccao
Copy link
Author

I think I finally understood the problem 😄 .
Thanks @jkhenning and @tebeka for your explanations.
We could close this PR now.

@tienduccao
Copy link
Author

I think I'll do another test and let you guys know soon.

@jkhenning
Copy link
Member

@tienduccao thanks! I think the .bashrc fix is a good one, so either we keep it here as the only change or you can open another PR - will be a shame not to do it 🙂

@tienduccao
Copy link
Author

Hi @jkhenning and @tebeka, I'd like to discuss the last time about the empty task_id.

The following script is generated by bash_script_template from the cloud_driver.py.
And it will be served as User Data of the EC2 instance.

#!/bin/bash

set -x
set -e

apt-get update
apt-get install -y         build-essential         gcc         git         python3-dev         python3-pip
python3 -m pip install -U pip
python3 -m pip install virtualenv
python3 -m virtualenv clearml_agent_venv
source clearml_agent_venv/bin/activate
python -m pip install clearml-agent
cat << EOF >> ~/clearml.conf

EOF
export CLEARML_API_HOST=https://api.community.clear.ml
export CLEARML_WEB_HOST=https://app.community.clear.ml
export CLEARML_FILES_HOST=https://files.community.clear.ml
export DYNAMIC_INSTANCE_ID=$(curl http://169.254.169.254/latest/meta-data/instance-id)
export CLEARML_WORKER_ID=dynamic_worker:aws4gpu:g3s.xlarge:$DYNAMIC_INSTANCE_ID
export CLEARML_API_ACCESS_KEY='<>'
export CLEARML_API_SECRET_KEY='<>'
export CLEARML_AUTH_TOKEN=''

python -m clearml_agent --config-file ~/clearml.conf execute --id 
python -m clearml_agent --config-file ~/clearml.conf daemon --queue 'gpu_queue' --docker 'nvidia/cuda:10.1-runtime-ubuntu18.04'
shutdown

Since the task_id is None, we have this line python -m clearml_agent --config-file ~/clearml.conf execute --id.
And this command does raise an error like this

usage: clearml-agent execute [-h] --id TASK_ID [--log-file LOG_FILE] [--disable-monitoring] [--full-monitoring] [--require-queue] [--standalone-mode] [--docker [DOCKER [DOCKER ...]]] [--clone] [-O] [--git-user GIT_USER]
                             [--git-pass GIT_PASS] [--log-level {DEBUG,INFO,WARN,WARNING,ERROR,CRITICAL}] [--gpus GPUS] [--cpu-only]
clearml-agent execute: error: argument --id: expected one argument

I'm not really sure whether the next line (python -m clearml_agent --config-file ~/clearml.conf daemon --queue 'gpu_queue' --docker 'nvidia/cuda:10.1-runtime-ubuntu18.04') will be executed or not.
And if it doesn't get executed, we won't have a worker from the launched EC2 instance which listens to the 'gpu_queue'.
Do you have any idea to verify this assumption?
Thanks.

@tebeka
Copy link

tebeka commented Jan 27, 2022

Hi @tienduccao

Here's what I get when I generate the user data script:

In [1]: from clearml.automation.aws_driver import AWSDriver
In [2]: with open('../aws_autoscaler.yaml') as fp:
   ...:     data = yaml.safe_load(fp)
In [3]: drv = AWSDriver.from_config(data)
In [4]: s = drv.gen_user_data('aws-scaler-', 'q1', None)
In [5]: print(s)
#!/bin/bash

set -x
set -e

apt-get update
apt-get install -y         build-essential         gcc         git         python3-dev         python3-pip
python3 -m pip install -U pip
python3 -m pip install virtualenv
python3 -m virtualenv clearml_agent_venv
source clearml_agent_venv/bin/activate
python -m pip install clearml-agent
cat << EOF >> ~/clearml.conf
agent.git_user=""
agent.git_pass=""



EOF
export CLEARML_API_HOST=https://api.community.clear.ml
export CLEARML_WEB_HOST=https://app.community.clear.ml
export CLEARML_FILES_HOST=https://files.community.clear.ml
export DYNAMIC_INSTANCE_ID=$(curl http://169.254.169.254/latest/meta-data/instance-id)
export CLEARML_WORKER_ID=aws-scaler-:$DYNAMIC_INSTANCE_ID
export CLEARML_API_ACCESS_KEY='XXXX'
export CLEARML_API_SECRET_KEY='XXXX'
export CLEARML_AUTH_TOKEN=''
source ~/.bashrc


python -m clearml_agent --config-file ~/clearml.conf daemon --queue 'q1' 
shutdown

As you can see, since task_id is None, the execute line is not in the script.

What version of clearml are you using?

@tebeka
Copy link

tebeka commented Jan 27, 2022

@tienduccao Can you attach here the AWS instance log?

Something like: aws ec2 get-console-output --instance-id <INSTANCE ID>

@tienduccao
Copy link
Author

Hi @tebeka , I did my last test and I could now confirm that there's no problem with the existing Autoscaler code 😄 .
It's weird that I couldn't reproduce my problem but it's the best outcome.
Thank you and @jkhenning for your time.

@tienduccao tienduccao closed this Jan 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants