Refactor worker nodes into ECS #29

brietaylor · 2020-04-09T17:37:34Z

There are a couple of nuisances with our current worker strategy, that I think would be helped by moving most of what we've done to an orchestration system like ECS.

Log streams are currently generated by instance, so the logs from all N workers get interleaved (making it hard to find errors)
When a worker crashes, it never gets replaced.
Updating the container images is a right pain. docker kill, docker rm, docker pull, find / -name part-001, (cloud-init script) /path/to/part-001. vs. pushing a new launch template and having fresh images in a couple of minutes.
Ugly names for the ASGs (means we have to "discover" the names to do adjust desired sizes), like tf-asg-tf-serratus-dl-20200304125312000001, this is currently necessary, so that all instances get replaced when we change the user_data in the launch configuration, ECS would deal with sending the correct arguments to our scripts.

There are a couple things to work out though, first:

will we use Daemon or Replication jobs? Daemon doesn't solve 1, but replication doesn't solve 4. We need a way to force all images to be replaced if we change them.
ECS + Cloudwatch Logs
...and more, maybe?

The text was updated successfully, but these errors were encountered:

brietaylor added Terraform enhancement New feature or request labels Apr 9, 2020

mathemage added this to Open Tasks in TODO List via automation Apr 18, 2020

ababaian closed this as completed Dec 9, 2020

TODO List automation moved this from Open Tasks to Completed Tasks Dec 9, 2020

ababaian removed this from Completed Tasks in TODO List Dec 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor worker nodes into ECS #29

Refactor worker nodes into ECS #29

brietaylor commented Apr 9, 2020 •

edited

Refactor worker nodes into ECS #29

Refactor worker nodes into ECS #29

Comments

brietaylor commented Apr 9, 2020 • edited

brietaylor commented Apr 9, 2020 •

edited