Update the deploy.yaml workflow to run the model on CI #26

jeancochrane · 2023-10-25T19:25:58Z

This PR updates the deploy.yaml workflow added in #25 to run the model on CI using AWS Batch jobs. It also adds a cleanup-terraform.yaml workflow to delete any created AWS resources when PRs are closed.

All of the Batch resources required by PR runs are provisioned by the code added in this PR using Terraform; however, some resources that remain consistent across Batch jobs (e.g. IAM roles and policies) are instead managed outside of this repo and referenced here using Terraform data sources. We will manage these resources in the https://github.com/ccao-data/aws-infrastructure repo instead.

…ghcr

…mmands

…orkflow

…file

…low-to-run-the-model-on-prs-and-workflow_dispatch'

…aml workflow

This reverts commit 202f941.

pipeline/05-finalize.R

dfsnow

Awesome stuff @jeancochrane, thanks for putting this together. No major issues but some tweaks around the edges. Really excited to have this up this year.

dfsnow · 2023-11-03T21:09:56Z

.github/actions/setup-terraform/action.yaml

@@ -0,0 +1,87 @@
+name: Setup Terraform


I think factoring this out would be wise, considering it's a pattern we're very likely to re-use. How about we make a ccao-data/actions repo, since we're starting to replicate lots of actions in general.

dfsnow · 2023-11-03T21:19:19Z

.github/workflows/deploy.yaml

+#
+# Images are built on every commit to a PR or main branch in order to ensure
+# that the build continues to work properly, but Batch jobs are gated behind
+# a `deploy` environment that requires manual approval from a codeowner.


Suggested change

# a `deploy` environment that requires manual approval from a codeowner.

# a `deploy` environment that requires manual approval from an @ccao-data/core-team member.

Fixed in 2a85c14.

dfsnow · 2023-11-03T21:22:58Z

.github/workflows/deploy.yaml

+    # "*-assessment-year" are long-lived branches containing the most up-to-date
+    # models for a given assessment cycle, and hence we consider them to be
+    # main branches
+    branches: [master, '*-assessment-year']


suggestion: We actually probably don't need a long-lived *-assessment-year branch. We used this pattern in the GitLab days because we had a secondary staging branch on most repos. I thought we might replicate it for modeling just to keep the master branch stable, but now that our PR practices are much tighter I don't think we need it.

Removed in 2a85c14.

.github/workflows/deploy.yaml

dfsnow · 2023-11-03T22:19:31Z

terraform/main.tf

+# workflow that provisions these resources
+resource "aws_batch_job_definition" "main" {
+  name                  = var.batch_job_name
+  platform_capabilities = ["FARGATE"]


question: How difficult is it to switch to jobs run by EC2? Just curious since we may want the GPU capabilities.

I think it should be feasible! I switched to Fargate for the MVP since I was finding EC2 to be difficult to debug, but I'd be interested in taking another crack at it now that I have a better sense of how the IAM permissions and networking need to be configured in the Fargate context. I think the key challenge will be making sure that we can get instance logging configured in such a way that failures are debuggable. I opened up #57 to track this effort.

Great! Not especially worried about this yet but it may become important once we start training lots of models (December).

terraform/main.tf

.github/workflows/deploy.yaml

.github/workflows/cleanup-terraform.yaml

…b_status.sh

jeancochrane · 2023-11-06T18:27:40Z

.github/scripts/poll_batch_job_status.sh

@@ -0,0 +1,119 @@
+#!/usr/bin/env bash


Since this logic is shared between the steps that check for job startup and job completion, I factored it out into a shared script.

jeancochrane · 2023-11-06T18:28:32Z

.github/scripts/poll_batch_job_status.sh

+# derive a timeout in second units. There is no equivalent timeout for running
+# jobs, because those timeouts can be set on the Batch level, whereas startup
+# timeouts are not controllable by Batch
+BATCH_JOB_POLL_STARTUP_MAX_RETRIES=60


It could make sense to eventually make these attributes configurable, particularly once we factor this logic out into a shared action or workflow, but for the sake of faster iteration I decided to hardcode them for now since bash argument parsing is so annoying.

jeancochrane · 2023-11-06T21:23:45Z

Requesting @dfsnow one more time to make sure I didn't miss any comments!

dfsnow

Looks great @jeancochrane. Nice work

dfsnow · 2023-11-06T21:48:01Z

.github/actions/setup-terraform/action.yaml

@@ -0,0 +1,87 @@
+name: Setup Terraform


I think the pattern in r-lib/actions works pretty well. Minor versions can change the code underlying the tag and only breaking changes would warrant a major version bump. I think it makes sense to basically factor out stuff into composite actions in their own sub-directories. Curious what you think

dfsnow · 2023-11-06T21:52:44Z

terraform/main.tf

+  compute_resources {
+    type               = "FARGATE"
+    min_vcpus          = 0
+    max_vcpus          = 64  # Max across all jobs, not within one job


Totally makes sense and I like the flexibility to being able to change environment per repo and per branch. Nice thinking 👍

dfsnow · 2023-11-06T21:53:36Z

terraform/main.tf

+# workflow that provisions these resources
+resource "aws_batch_job_definition" "main" {
+  name                  = var.batch_job_name
+  platform_capabilities = ["FARGATE"]


Great! Not especially worried about this yet but it may become important once we start training lots of models (December).

jeancochrane and others added 18 commits October 24, 2023 13:42

Add Dockerfile and workflow for building and pushing Docker image to …

c565cd1

…ghcr

Cache R and pip packages

b93c3a6

Temporarily simplify Dockerfile to test caching

d74fb4b

Attempt to fix up cache paths for GitHub workflow

efcf7e6

Temporarily try a simpler R package for testing cache

92c2b1e

Try using cache dir env vars to pass mount paths to Dockerfile RUN co…

644e916

…mmands

Revert pip/renv caching for now

afa0c8d

Add CML runner to deploy workflow and use it to run the model

7223f4c

Remove explicit permissions from publish-docker-image job in deploy w…

4e3ab96

…orkflow

Change SHA Docker tagging scheme to follow GitHub tags instead

4699510

Switch to PPM for R package installs

35b15ed

Install more system dependencies into Dockerfile

c41c3a7

Fixup renv install in Dockerfile

3edd026

Simplify Pipfile to only keep dvc dependency

62c144a

Try explicitly copying renv path to make sure it gets added in Docker…

d1e9ff7

…file

Merge Dockerfile branch into model running branch

9f93b29

Make CML --cloud-type and --cloud-gpu configurable

4be0bbd

Don't require deploy environment in publish-docker-image workflow job

b851aa6

Base automatically changed from jeancochrane/22-infra-updates-build-and-push-a-docker-image-for-the-model to master October 25, 2023 20:13

jeancochrane added 3 commits October 25, 2023 15:17

Merge branch 'master' into 'jeancochrane/23-infra-updates-add-a-workf…

381b7e2

…low-to-run-the-model-on-prs-and-workflow_dispatch'

Try another format for handling optional workflow inputs in deploy.yaml

55ee7fe

Trigger CI run

04f2246

jeancochrane had a problem deploying to deploy October 25, 2023 21:52 — with GitHub Actions Failure

Fix enable_gpu input in deploy.yaml workflow

6573c92

jeancochrane had a problem deploying to deploy October 25, 2023 22:02 — with GitHub Actions Failure

Try using equals signs for cml runner launch invokation in deploy.y…

57d1c8d

…aml workflow

jeancochrane had a problem deploying to deploy October 25, 2023 22:06 — with GitHub Actions Failure

Try noRetry instead of no-retry for cml runner launch flag

0dea848

jeancochrane had a problem deploying to deploy October 25, 2023 22:15 — with GitHub Actions Failure

Fix renv path in Dockerfile

d1df1f1

jeancochrane had a problem deploying to deploy November 2, 2023 20:03 — with GitHub Actions Failure

jeancochrane force-pushed the jeancochrane/23-infra-updates-add-a-workflow-to-run-the-model-on-prs-and-workflow_dispatch branch from 18b5e85 to 1686c4d Compare November 2, 2023 20:29

jeancochrane had a problem deploying to deploy November 2, 2023 20:29 — with GitHub Actions Failure

jeancochrane force-pushed the jeancochrane/23-infra-updates-add-a-workflow-to-run-the-model-on-prs-and-workflow_dispatch branch from 1686c4d to 5de0712 Compare November 2, 2023 20:35

jeancochrane had a problem deploying to deploy November 2, 2023 20:36 — with GitHub Actions Failure

Temporarily refactor image to just install awscli

202f941

jeancochrane force-pushed the jeancochrane/23-infra-updates-add-a-workflow-to-run-the-model-on-prs-and-workflow_dispatch branch from 5de0712 to 202f941 Compare November 2, 2023 20:40

jeancochrane temporarily deployed to deploy November 2, 2023 20:40 — with GitHub Actions Inactive

jeancochrane added 2 commits November 2, 2023 21:11

Add aws.ec2metadata package to enable ECS to use task role

146b3b9

Revert "Temporarily refactor image to just install awscli"

ca7a306

This reverts commit 202f941.

jeancochrane temporarily deployed to deploy November 2, 2023 21:18 — with GitHub Actions Inactive

jeancochrane commented Nov 2, 2023

View reviewed changes

pipeline/05-finalize.R Show resolved Hide resolved

jeancochrane marked this pull request as ready for review November 3, 2023 14:45

jeancochrane requested review from dfsnow and wrridgeway as code owners November 3, 2023 14:45

dfsnow approved these changes Nov 3, 2023

View reviewed changes

jeancochrane mentioned this pull request Nov 6, 2023

[Infra updates] Enable EC2 backend for model Batch jobs #57

Closed

Tweak deploy.yaml and factor out Batch polling logic into shell script

2a85c14

jeancochrane had a problem deploying to deploy November 6, 2023 18:04 — with GitHub Actions Failure

Make poll_batch_job_status.sh executable

45bc5e2

jeancochrane had a problem deploying to deploy November 6, 2023 18:09 — with GitHub Actions Failure

Use parameter expansion to check for unset variables in poll_batch_jo…

f4384f9

…b_status.sh

jeancochrane temporarily deployed to deploy November 6, 2023 18:19 — with GitHub Actions Inactive

jeancochrane commented Nov 6, 2023

View reviewed changes

jeancochrane requested a review from dfsnow November 6, 2023 21:23

dfsnow approved these changes Nov 6, 2023

View reviewed changes

jeancochrane merged commit f51c35b into master Nov 6, 2023
3 checks passed

jeancochrane deleted the jeancochrane/23-infra-updates-add-a-workflow-to-run-the-model-on-prs-and-workflow_dispatch branch November 6, 2023 23:01

jeancochrane mentioned this pull request Nov 13, 2023

Add Dockerfile and build-and-run-model workflow for CI model runs ccao-data/model-condo-avm#9

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update the deploy.yaml workflow to run the model on CI #26

Update the deploy.yaml workflow to run the model on CI #26

jeancochrane commented Oct 25, 2023 •

edited

Loading

dfsnow left a comment

dfsnow Nov 3, 2023

dfsnow Nov 3, 2023

jeancochrane Nov 6, 2023

dfsnow Nov 3, 2023

jeancochrane Nov 6, 2023

dfsnow Nov 3, 2023

jeancochrane Nov 6, 2023

dfsnow Nov 6, 2023

jeancochrane Nov 6, 2023

jeancochrane Nov 6, 2023

jeancochrane commented Nov 6, 2023

dfsnow left a comment

dfsnow Nov 6, 2023

dfsnow Nov 6, 2023

dfsnow Nov 6, 2023

	# a `deploy` environment that requires manual approval from a codeowner.
	# a `deploy` environment that requires manual approval from an @ccao-data/core-team member.

Update the deploy.yaml workflow to run the model on CI #26

Update the deploy.yaml workflow to run the model on CI #26

Conversation

jeancochrane commented Oct 25, 2023 • edited Loading

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented Nov 6, 2023

dfsnow left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeancochrane commented Oct 25, 2023 •

edited

Loading