Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(aws): add market type to allow usage of spot instances #24

Merged
merged 1 commit into from Oct 22, 2018

Conversation

Pajk
Copy link
Contributor

@Pajk Pajk commented Oct 16, 2018

No description provided.

Copy link
Contributor

@tboerger tboerger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep the imports like before, besides that LGTM

@Pajk
Copy link
Contributor Author

Pajk commented Oct 16, 2018

@tboerger I noticed that - VSCode with Go extension formatted it like that automatically. I was not sure if such PR would be welcomed so I didn't bother with details.

The source code of the autoscaler docs is not public, is it?

@tboerger
Copy link
Contributor

The source code of the autoscaler docs is not public, is it?

Not sure where @bradrydzewski have put the sources, asking him :)

@Pajk Pajk force-pushed the feat/spot-instances branch 2 times, most recently from 9a9bbd5 to 648faac Compare Oct 17, 2018
@bradrydzewski
Copy link
Member

the code looks good, but I would caution against using spot instances for auto-scaling. The challenge you will run into is that we have no way to know if / when the spot instance has been terminated, which will prevent the autoscaler from being able to correctly calculate capacity. Also if an instance is terminated while a build is running it will be stuck, and will require manual intervention.

@tboerger
Copy link
Contributor

Shot instances are automatically killed?

@bradrydzewski
Copy link
Member

yes, spot instances are terminated when your bid price falls below market price

@Pajk Pajk force-pushed the feat/spot-instances branch from 648faac to a4b5dfa Compare Oct 18, 2018
@Pajk
Copy link
Contributor Author

Pajk commented Oct 19, 2018

@bradrydzewski We're using spot instances also for our production services and we're quite happy with the way how they work. We don't specify the maximum price and the default is on demand price so it never happens that it's terminated because of the market price. However Amazon terminates it anyway from time to time. From our experience it's usually several weeks. What is worse is that they don't have unlimited capacity of spot instances so if you require only one specific instance type it might happen that they refuse to create an instance for you. So the best practice is to allow amazon to choose from several types of instances or manually try different one when the provisioning fails.

But for our use case what is in this PR is enough. I'm scaling CI quite aggressively and we're using quite a large instances (c5.2xlarge). Only one is running all the time (and I plan to scale it to 0 outside of working hours) and it handles max 2 tasks. If there're more jobs to be processed a new instance is created and it's killed after 1 hour if it's not needed anymore. With spot instances the price is $0.1326/hr instead of $0.5006 which is quite a big difference.

Build artefacts can be cached on S3. The only thing that sucks is that all the docker images has to be re-downloaded on the fresh instances so the first build of each project can be slower. But there's probably no easy way how to avoid that is it?

I have one off topic question - is the autoscaling possible when Drone is deployed in Kubernetes cluster? What would be the recommended way? It would be great to utilize the k8s horizontal pod autoscaler and cluster autoscaler but Drone is not able to run the pipeline as k8s pods, is it?

Btw Drone is great, thanks for it! We're really looking forward to 0.9 and beyond.

@tboerger
Copy link
Contributor

There is a Kubernetes runtime which would use the Kubernetes resources, but AFAIK it's not in a somehow finished state.

@bradrydzewski
Copy link
Member

We don't specify the maximum price and the default is on demand price so it never happens that it's terminated because of the market price

very cool, I did not know it worked this way :)

The challenge you will run into is that we have no way to know if / when the spot instance has been terminated

I am replying to my own comment here ... I forgot that I implemented a pinger that periodically checks to make sure your servers are available. I never had time to test it in production, so it is currently behind a feature flag DRONE_ENABLE_PINGER=true

The only thing that sucks is that all the docker images has to be re-downloaded on the fresh instances so the first build of each project can be slower. But there's probably no easy way how to avoid that is it

You could probably create an ami with all of your commonly used images, and the use this as the default ami to reduce startup time.

@Pajk
Copy link
Contributor Author

Pajk commented Oct 19, 2018

I have the "pinger" enabled but so far I think the instances were always terminated by the autoscaler so the pinger hasn't had a change to show off.

Custom AMI is a good idea but too much hassle I guess.

An option to inject a custom code to the autoscaler's userdata could be just as useful and easier to use. Probably an idea for another PR:).

@bradrydzewski
Copy link
Member

bradrydzewski commented Oct 19, 2018

An option to inject a custom code to the autoscaler's userdata could be just as useful and easier to use. Probably an idea for another PR:).

no need, you can do this today :)

create a cloud-init file on your host machine (where you have the autoscaler installed). Make any customizations that you see fit.

#cloud-config

apt_reboot_if_required: false
package_update: false
package_upgrade: false

apt:
  sources:
    docker.list:
      source: deb [arch=amd64] https://download.docker.com/linux/ubuntu $RELEASE stable
      keyid: 0EBFCD88

packages:
  - docker-ce

write_files:
  - path: /etc/systemd/system/docker.service.d/override.conf
    content: |
      [Service]
      ExecStart=
      ExecStart=/usr/bin/dockerd
  - path: /etc/default/docker
    content: |
      DOCKER_OPTS=""
  - path: /etc/docker/daemon.json
    content: |
      {
        "dns": [ "8.8.8.8", "8.8.4.4" ],
        "hosts": [ "0.0.0.0:2376", "unix:///var/run/docker.sock" ],
        "tls": true,
        "tlsverify": true,
        "tlscacert": "/etc/docker/ca.pem",
        "tlscert": "/etc/docker/server-cert.pem",
        "tlskey": "/etc/docker/server-key.pem"
      }
  - path: /etc/docker/ca.pem
    encoding: b64
    content: {{ .CACert | base64 }}
  - path: /etc/docker/server-cert.pem
    encoding: b64
    content: {{ .TLSCert | base64 }}
  - path: /etc/docker/server-key.pem
    encoding: b64
    content: {{ .TLSKey | base64 }}

runcmd:
  - [ systemctl, daemon-reload ]
  - [ systemctl, restart, docker ]

You will need to provide your custom cloud-init file to the autoscale server. You can mount the file as a volume:

--volume=/path/on/host/init.yml:/path/in/container/init.yml

You will also need to tell the autoscale server where to find the file inside the container:

DRONE_AMAZON_USERDATA_FILE=/path/inside/container/init.yml

@Pajk
Copy link
Contributor Author

Pajk commented Oct 19, 2018

How could I missed that?! Thanks again.

@tboerger
Copy link
Contributor

@bradrydzewski you should remove the docker restart, otherwise it's restarted twice. Totally forgot about it, I have dropped that from my custom cloud init.

Docker gets started after the installation (at least on Ubuntu)

@bradrydzewski
Copy link
Member

ok, I think we are good to merge based on my updated understanding of how spot instances work, and the fact that we have the instance pinger in place. This could still result in some stuck builds, but that is something we should be able to automate (detect and cancel or restart) in a separate patch, once I've had some more time to think through how that would work.

@bradrydzewski bradrydzewski merged commit 2a1e2cf into drone:master Oct 22, 2018
@Pajk Pajk deleted the feat/spot-instances branch Oct 23, 2018
@yurymkomarov
Copy link

Hmm... looks like somebody forgot to update the autoscale documentation :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants