Skip to content

Clean up Nomad job artifact handling#2312

Merged
sitole merged 5 commits intomainfrom
chore/nomad-jobs-assets-cleanup
Apr 7, 2026
Merged

Clean up Nomad job artifact handling#2312
sitole merged 5 commits intomainfrom
chore/nomad-jobs-assets-cleanup

Conversation

@sitole
Copy link
Copy Markdown
Member

@sitole sitole commented Apr 7, 2026

Summary

  • Replace custom checksum.sh bash script + data "external" with GCS object's native .generation field as the version/checksum
  • Always include the version query param in artifact source URLs (remove the dev/prod conditional logic)
  • Move clean-nfs-cache Nomad job setup from its own clean-nfs-cache.tf file into main.tf alongside other jobs
  • Propagate artifact_source URL from Terraform into the HCL job templates for template-manager and clean-nfs-cache instead of constructing it inside the template
  • Remove the template_manager_checksum variable (superseded by version in the URL)
  • Fix artifact mode = "file" and destination for template-manager and clean-nfs-cache jobs

sitole added 4 commits April 7, 2026 11:15
…omad jobs

Same fix as 5c599a2 for orchestrator: set mode=file and explicit destination
to prevent Nomad from switching to folder mode when multiple objects share a prefix.
… clean-nfs-cache

Move env-conditional version query param logic out of HCL job templates into Terraform,
matching the orchestrator pattern. Removes checksum options block from template-manager
and external checksum data source from clean-nfs-cache in favor of GCS object generation.
Removed bash script and replaced with native object version taken from
object data source.

Non-dev environments are now using object version as well.
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 7, 2026

PR Summary

Medium Risk
Changes how Nomad jobs fetch artifacts across AWS/GCP (moving to versioned URLs and removing explicit checksums), which could affect rollout behavior or cause jobs to pull the wrong binary if the URL/versioning is miscomputed. Scope is limited to IaC/job templates but impacts deploy-time reliability.

Overview
Standardizes Nomad job artifact fetching by removing explicit checksum variables/scripts and instead embedding object version identifiers directly into artifact_source URLs (GCS generation / S3 etag) for template-manager, its autoscaler plugin, orchestrator, and clean-nfs-cache. It also fixes artifact download semantics by explicitly setting destination/mode = "file", and relocates the GCP clean-nfs-cache job wiring into main.tf while passing the fully constructed artifact_source from Terraform into the job templates.

Reviewed by Cursor Bugbot for commit 4e003f9. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread iac/provider-aws/nomad/main.tf
Comment thread iac/provider-aws/nomad/main.tf
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 29c3c09b89

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread iac/provider-aws/nomad/main.tf
Comment thread iac/provider-aws/nomad/main.tf Outdated
Comment thread iac/provider-aws/nomad/main.tf Outdated
Comment thread iac/provider-gcp/nomad/main.tf
Copy link
Copy Markdown

@cursor cursor bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 2 potential issues.

Fix All in Cursor

Bugbot Autofix prepared fixes for both issues found in the latest run.

  • ✅ Fixed: Dev/prod conditional retained for clean-nfs-cache artifact URL
    • Updated clean_nfs_cache_artifact_source to always include the ?version= generation parameter for all environments.
  • ✅ Fixed: GCS generation number used as MD5 checksum
    • Replaced the checksum input with a real MD5 hex digest derived from the GCS object's md5hash via an external data step and passed that to the autoscaler job template.

Create PR

Or push these changes by commenting:

@cursor push 21ea3a2166
Preview (21ea3a2166)
diff --git a/iac/provider-gcp/nomad/main.tf b/iac/provider-gcp/nomad/main.tf
--- a/iac/provider-gcp/nomad/main.tf
+++ b/iac/provider-gcp/nomad/main.tf
@@ -485,6 +485,27 @@
   bucket = var.fc_env_pipeline_bucket_name
 }
 
+data "external" "nomad_nodepool_apm_checksum" {
+  count = var.template_manages_clusters_size_gt_1 ? 1 : 0
+
+  program = [
+    "python3",
+    "-c",
+    <<-EOT
+import base64
+import json
+import sys
+
+query = json.load(sys.stdin)
+print(json.dumps({"hex": base64.b64decode(query["md5hash"]).hex()}))
+EOT
+  ]
+
+  query = {
+    md5hash = data.google_storage_bucket_object.nomad_nodepool_apm[0].md5hash
+  }
+}
+
 module "template_manager_autoscaler" {
   source = "../../modules/job-template-manager-autoscaler"
   count  = var.template_manages_clusters_size_gt_1 ? 1 : 0
@@ -493,7 +514,7 @@
   autoscaler_version         = var.nomad_autoscaler_version
   nomad_token                = var.nomad_acl_token_secret
   apm_plugin_artifact_source = "gcs::https://www.googleapis.com/storage/v1/${var.fc_env_pipeline_bucket_name}/nomad-nodepool-apm?version=${data.google_storage_bucket_object.nomad_nodepool_apm[0].generation}"
-  apm_plugin_checksum        = data.google_storage_bucket_object.nomad_nodepool_apm[0].generation
+  apm_plugin_checksum        = data.external.nomad_nodepool_apm_checksum[0].result.hex
 }
 
 module "loki" {
@@ -609,7 +630,7 @@
 }
 
 locals {
-  clean_nfs_cache_artifact_source = var.environment == "dev" ? "gcs::https://www.googleapis.com/storage/v1/${var.fc_env_pipeline_bucket_name}/clean-nfs-cache?version=${data.google_storage_bucket_object.filestore_cleanup.generation}" : "gcs::https://www.googleapis.com/storage/v1/${var.fc_env_pipeline_bucket_name}/clean-nfs-cache"
+  clean_nfs_cache_artifact_source = "gcs::https://www.googleapis.com/storage/v1/${var.fc_env_pipeline_bucket_name}/clean-nfs-cache?version=${data.google_storage_bucket_object.filestore_cleanup.generation}"
 }
 
 resource "nomad_job" "clean_nfs_cache" {

This Bugbot Autofix run was free. To enable autofix for future PRs, go to the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 29c3c09. Configure here.

Comment thread iac/provider-gcp/nomad/main.tf
Comment thread iac/provider-gcp/nomad/main.tf Outdated
Comment thread iac/provider-aws/nomad/main.tf
Comment thread iac/provider-gcp/nomad/main.tf
Comment thread iac/provider-aws/nomad/main.tf
- Remove apm_plugin_checksum from autoscaler module; rely on ?etag=/?version= query param for artifact pinning (consistent with orchestrator/template-manager)
- Fix locals. → local. typo in AWS autoscaler module call
- Remove obsolete template_manager_checksum argument from AWS template_manager module call
- Inline autoscaler artifact source URL directly (was a local with empty-string fallback)
- Fix clean_nfs_cache_artifact_source to always include ?version= (remove dev-only conditional)
@sitole sitole changed the title chore: clean up Nomad job artifact handling Clean up Nomad job artifact handling Apr 7, 2026
@sitole sitole merged commit 8f7060f into main Apr 7, 2026
44 checks passed
@sitole sitole deleted the chore/nomad-jobs-assets-cleanup branch April 7, 2026 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants