Skip to content

Commit

Permalink
WX-1595 Remove genomics usage from GCP Batch (#7411)
Browse files Browse the repository at this point in the history
The Batch backend configuration depends on entries prefixed by "genomics" which is the wrong term, this entry has been renamed to "batch-api".

Also, the usages for these configs were renamed to "batch".

At last, there was an usage involving lifesciences which has been removed too.

NOTE: This is a breaking change, existing projecs using batch backend will need to update their configuration.
  • Loading branch information
AlexITC committed May 15, 2024
1 parent c8d36d5 commit e9db9a8
Show file tree
Hide file tree
Showing 13 changed files with 72 additions and 67 deletions.
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@
Cromwell now allows opting into configured soft links on shared file systems such as HPC environments. More details can
be found [here](https://cromwell.readthedocs.io/en/stable/backends/HPC/#optional-docker-soft-links).

### GCP Batch

The `genomics` configuration entry was renamed to `batch`, see [ReadTheDocs](https://cromwell.readthedocs.io/en/stable/backends/GCPBatch/) for more information.

## 87 Release Notes

### `upgrade` command removed from Womtool
Expand Down
1 change: 1 addition & 0 deletions centaur/src/main/resources/reference.conf
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@ centaur {
genomics.endpoint-url = "Error: BA-6546 The environment variable CROMWELL_BUILD_PAPI_ENDPOINT_URL must be set/export pointing to a valid endpoint such as 'https://lifesciences.googleapis.com/'"
genomics.endpoint-url = ${?CROMWELL_BUILD_PAPI_ENDPOINT_URL}
genomics.location = "us-central1"
batch.location = "us-central1"
auth = "Error: BA-6546 The environment variable CROMWELL_BUILD_PAPI_AUTH_MODE must be set/export pointing to a valid auth such as 'application-default'"
auth = ${?CROMWELL_BUILD_PAPI_AUTH_MODE}
json-dir = "Error: BA-6546 The environment variable CROMWELL_BUILD_RESOURCES_DIRECTORY must be set/export pointing to a valid path such as 'target/ci/resources'"
Expand Down
10 changes: 6 additions & 4 deletions docs/backends/GCPBatch.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
**Google Cloud Backend**
**Google Cloud Batch Backend (alpha)**

[//]:
Google Cloud Batch is a fully managed service that lets you schedule, queue, and execute batch processing workloads on Google Cloud resources. Batch provisions resources and manages capacity on your behalf, allowing your batch workloads to run at scale.
Expand All @@ -9,6 +9,8 @@ authentication modes. Before reading further in this section please see the
and detailed instructions for the application default authentication scheme in particular.
The instructions below assume you have created a Google Cloud Storage bucket and a Google project enabled for the appropriate APIs.

*NOTE*: Google Cloud Batch is still in alpha version, this means that there could be breaking changes, be sure to review the [GCP Batch CHANGELOG](https://github.com/broadinstitute/cromwell/blob/develop/CHANGELOG.md#gcp-batch) carefully before upgrading.

**Configuring Authentication**

The `google` stanza in the Cromwell configuration file defines how to authenticate to Google. There are four different
Expand Down Expand Up @@ -84,7 +86,7 @@ Creating the account will cause the JSON file to be downloaded. The structure o
Most importantly, the value of the `client_email` field should go into the `service-account-id` field in the configuration (see below). The
`private_key` portion needs to be pulled into its own file (e.g. `my-key.pem`). The `\n`s in the string need to be converted to newline characters.

While technically not part of Service Account authentication mode, one can also override the default service account that the compute VM is started with via the configuration option `GCPBATCH.config.genomics.compute-service-account` or through the workflow options parameter `google_compute_service_account`. The service account you provide must have been granted Service Account Actor role to Cromwell's primary service account. As this only affects Google Batch API and not GCS, it's important that this service account, and the service account specified in `GCPBATCH.config.genomics.auth` can both read/write the location specified by `GCPBATCH.config.root`
While technically not part of Service Account authentication mode, one can also override the default service account that the compute VM is started with via the configuration option `GCPBATCH.config.batch.compute-service-account` or through the workflow options parameter `google_compute_service_account`. The service account you provide must have been granted Service Account Actor role to Cromwell's primary service account. As this only affects Google Batch API and not GCS, it's important that this service account, and the service account specified in `GCPBATCH.config.batch.auth` can both read/write the location specified by `GCPBATCH.config.root`

**User Service Account**

Expand Down Expand Up @@ -280,7 +282,7 @@ google {

Cromwell can be configured to use GCS parallel composite uploads which can greatly improve delocalization performance. This feature
is turned off by default but can be enabled backend-wide by specifying a `gsutil`-compatible memory specification for the key
`genomics.parallel-composite-upload-threshold` in backend configuration. This memory value represents the minimum size an output file
`batch.parallel-composite-upload-threshold` in backend configuration. This memory value represents the minimum size an output file
must have to be a candidate for `gsutil` parallel composite uploading:

```
Expand All @@ -292,7 +294,7 @@ backend {
actor-factory = "cromwell.backend.google.batch.GcpBatchLifecycleActorFactory"
config {
...
genomics {
batch {
...
parallel-composite-upload-threshold = 150M
...
Expand Down
2 changes: 1 addition & 1 deletion src/ci/resources/gcp_batch_provider_config.inc.conf
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ root = "gs://cloud-cromwell-dev-self-cleaning/cromwell_execution/ci"
maximum-polling-interval = 600
concurrent-job-limit = 1000

genomics {
batch {
auth = "service_account"
location = "us-central1"
}
Expand Down
2 changes: 1 addition & 1 deletion src/ci/resources/gcp_batch_shared_application.inc.conf
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ backend {
include "dockerhub_provider_config_v2.inc.conf"
# This SA does not have permission to bill this project when accessing RP buckets.
# This is on purpose so that we can assert the failure (see requester_pays_localization_negative)
genomics.compute-service-account = "centaur@broad-dsde-cromwell-dev.iam.gserviceaccount.com"
batch.compute-service-account = "centaur@broad-dsde-cromwell-dev.iam.gserviceaccount.com"
filesystems.http {}
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,8 @@ class GcpBatchInitializationActor(batchParams: GcpBatchInitializationActorParams
private lazy val gcsCredentials: Future[Credentials] = gcpBatchConfiguration.batchAttributes.auths.gcs
.retryCredentials(workflowOptions, List(StorageScopes.DEVSTORAGE_FULL_CONTROL))

// Credentials object for the Genomics API
private lazy val genomicsCredentials: Future[Credentials] = gcpBatchConfiguration.batchAttributes.auths.genomics
// Credentials object for the Batch API
private lazy val batchCredentials: Future[Credentials] = gcpBatchConfiguration.batchAttributes.auths.batch
.retryCredentials(workflowOptions,
List(
CloudLifeSciencesScopes.CLOUD_PLATFORM,
Expand Down Expand Up @@ -230,11 +230,11 @@ class GcpBatchInitializationActor(batchParams: GcpBatchInitializationActorParams

override lazy val workflowPaths: Future[GcpBatchWorkflowPaths] = for {
gcsCred <- gcsCredentials
genomicsCred <- genomicsCredentials
batchCred <- batchCredentials
validatedPathBuilders <- pathBuilders
} yield new GcpBatchWorkflowPaths(workflowDescriptor,
gcsCred,
genomicsCred,
batchCred,
gcpBatchConfiguration,
validatedPathBuilders,
standardStreamNameToFileNameMetadataMapper
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,4 @@ package cromwell.backend.google.batch.authentication

import cromwell.cloudsupport.gcp.auth.GoogleAuthMode

case class GcpBatchAuths(genomics: GoogleAuthMode, gcs: GoogleAuthMode)
case class GcpBatchAuths(batch: GoogleAuthMode, gcs: GoogleAuthMode)
Original file line number Diff line number Diff line change
Expand Up @@ -98,15 +98,15 @@ object GcpBatchConfigurationAttributes
"project",
"root",
"maximum-polling-interval",
"genomics",
"genomics.location",
"genomics.compute-service-account",
"genomics.auth",
"genomics.restrict-metadata-access",
"genomics.enable-fuse",
"genomics-api-queries-per-100-seconds",
"genomics.localization-attempts",
"genomics.parallel-composite-upload-threshold",
"batch",
"batch.location",
"batch.compute-service-account",
"batch.auth",
"batch.restrict-metadata-access",
"batch.enable-fuse",
"batch-queries-per-100-seconds",
"batch.localization-attempts",
"batch.parallel-composite-upload-threshold",
"filesystems",
"filesystems.drs.auth",
"filesystems.gcs.auth",
Expand All @@ -133,7 +133,7 @@ object GcpBatchConfigurationAttributes
)

private val deprecatedJesKeys: Map[String, String] = Map(
"genomics.default-zones" -> "default-runtime-attributes.zones"
"batch.default-zones" -> "default-runtime-attributes.zones"
)

def apply(googleConfig: GoogleConfiguration,
Expand Down Expand Up @@ -205,19 +205,19 @@ object GcpBatchConfigurationAttributes
backendConfig.as[String]("root")
}
val location: ErrorOr[String] = validate {
backendConfig.as[String]("genomics.location")
backendConfig.as[String]("batch.location")
}
val maxPollingInterval: Int = backendConfig.as[Option[Int]]("maximum-polling-interval").getOrElse(600)
val computeServiceAccount: String =
backendConfig.as[Option[String]]("genomics.compute-service-account").getOrElse("default")
val genomicsAuthName: ErrorOr[String] = validate {
backendConfig.as[String]("genomics.auth")
backendConfig.as[Option[String]]("batch.compute-service-account").getOrElse("default")
val batchAuthName: ErrorOr[String] = validate {
backendConfig.as[String]("batch.auth")
}
val genomicsRestrictMetadataAccess: ErrorOr[Boolean] = validate {
backendConfig.as[Option[Boolean]]("genomics.restrict-metadata-access").getOrElse(false)
val batchRestrictMetadataAccess: ErrorOr[Boolean] = validate {
backendConfig.as[Option[Boolean]]("batch.restrict-metadata-access").getOrElse(false)
}
val genomicsEnableFuse: ErrorOr[Boolean] = validate {
backendConfig.as[Option[Boolean]]("genomics.enable-fuse").getOrElse(false)
val batchEnableFuse: ErrorOr[Boolean] = validate {
backendConfig.as[Option[Boolean]]("batch.enable-fuse").getOrElse(false)
}

val dockerhubToken: ErrorOr[String] = validate {
Expand Down Expand Up @@ -252,11 +252,11 @@ object GcpBatchConfigurationAttributes
}

val parallelCompositeUploadThreshold =
validateGsutilMemorySpecification(backendConfig, "genomics.parallel-composite-upload-threshold")
validateGsutilMemorySpecification(backendConfig, "batch.parallel-composite-upload-threshold")

val localizationAttempts: ErrorOr[Int Refined Positive] = backendConfig
.as[Option[Int]]("genomics.localization-attempts")
.map(attempts => validatePositiveInt(attempts, "genomics.localization-attempts"))
.as[Option[Int]]("batch.localization-attempts")
.map(attempts => validatePositiveInt(attempts, "batch.localization-attempts"))
.getOrElse(DefaultGcsTransferAttempts.validNel)

val gcsTransferConfiguration: ErrorOr[GcsTransferConfiguration] =
Expand Down Expand Up @@ -303,7 +303,7 @@ object GcpBatchConfigurationAttributes
def authGoogleConfigForBatchConfigurationAttributes(
project: String,
bucket: String,
genomicsName: String,
batchName: String,
location: String,
restrictMetadata: Boolean,
dockerhubToken: String,
Expand All @@ -318,17 +318,17 @@ object GcpBatchConfigurationAttributes
referenceDiskLocalizationManifestFilesOpt: Option[List[ManifestFile]],
dockerImageCacheManifestFileOpt: Option[ValidFullGcsPath]
): ErrorOr[GcpBatchConfigurationAttributes] =
(googleConfig.auth(genomicsName), googleConfig.auth(gcsName)) mapN { (genomicsAuth, gcsAuth) =>
(googleConfig.auth(batchName), googleConfig.auth(gcsName)) mapN { (batchAuth, gcsAuth) =>
val generatedReferenceFilesMappingOpt = referenceDiskLocalizationManifestFilesOpt map {
generateReferenceFilesMapping(genomicsAuth, _)
generateReferenceFilesMapping(batchAuth, _)
}
val dockerImageToCacheDiskImageMappingOpt = dockerImageCacheManifestFileOpt map {
generateDockerImageToDiskImageMapping(genomicsAuth, _)
generateDockerImageToDiskImageMapping(batchAuth, _)
}
models.GcpBatchConfigurationAttributes(
project = project,
computeServiceAccount = computeServiceAccount,
auths = GcpBatchAuths(genomicsAuth, gcsAuth),
auths = GcpBatchAuths(batchAuth, gcsAuth),
restrictMetadataAccess = restrictMetadata,
dockerhubToken = dockerhubToken,
enableFuse = enableFuse,
Expand All @@ -351,11 +351,11 @@ object GcpBatchConfigurationAttributes

(project,
executionBucket,
genomicsAuthName,
batchAuthName,
location,
genomicsRestrictMetadataAccess,
batchRestrictMetadataAccess,
dockerhubToken,
genomicsEnableFuse,
batchEnableFuse,
gcsFilesystemAuthName,
qpsValidation,
duplicationStrategy,
Expand Down Expand Up @@ -429,12 +429,12 @@ object GcpBatchConfigurationAttributes
def validateQps(config: Config): ErrorOr[Int Refined Positive] = {
import eu.timepit.refined._

val qp100s = config.as[Option[Int]]("genomics-api-queries-per-100-seconds").getOrElse(BatchApiDefaultQps)
val qp100s = config.as[Option[Int]]("batch-queries-per-100-seconds").getOrElse(BatchApiDefaultQps)
val qpsCandidate = qp100s / 100

refineV[Positive](qpsCandidate) match {
case Left(_) =>
s"Calculated QPS for Google Genomics API ($qpsCandidate/s) was not a positive integer (supplied value was $qp100s per 100s)".invalidNel
s"Calculated QPS for Google Batch API ($qpsCandidate/s) was not a positive integer (supplied value was $qp100s per 100s)".invalidNel
case Right(refined) => refined.validNel
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ object GcpBatchWorkflowPaths {
}
case class GcpBatchWorkflowPaths(workflowDescriptor: BackendWorkflowDescriptor,
gcsCredentials: Credentials,
genomicsCredentials: Credentials,
batchCredentials: Credentials,
gcpBatchConfiguration: GcpBatchConfiguration,
override val pathBuilders: PathBuilders,
// This allows for the adjustment of the standard stream file names in PAPI v1 to match the
Expand Down Expand Up @@ -55,19 +55,19 @@ case class GcpBatchWorkflowPaths(workflowDescriptor: BackendWorkflowDescriptor,
.get(GcpBatchWorkflowPaths.AuthFilePathOptionKey) getOrElse defaultBucket.pathAsString

/*
* This is an "exception". The filesystem used here is built from genomicsAuth
* This is an "exception". The filesystem used here is built from batchAuth
* unlike everywhere else where the filesystem used is built from gcsFileSystemAuth
*/
val pathBuilderWithGenomicsAuth = GcsPathBuilder.fromCredentials(
genomicsCredentials,
val pathBuilderWithBatchAuth = GcsPathBuilder.fromCredentials(
batchCredentials,
gcpBatchConfiguration.googleConfig.applicationName,
RetrySettings.newBuilder().build(),
GcsStorage.DefaultCloudStorageConfiguration,
workflowOptions,
Option(gcpBatchConfiguration.batchAttributes.project)
)

val authBucket = pathBuilderWithGenomicsAuth.build(bucket) recover { case ex =>
val authBucket = pathBuilderWithBatchAuth.build(bucket) recover { case ex =>
throw new Exception(s"Invalid gcs auth_bucket path $bucket", ex)
} get

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -139,12 +139,10 @@ object GcpBatchInitializationActorSpec {
| // This is the maximum polling interval (in seconds):
| maximum-polling-interval = 600
|
| genomics {
| // A reference to an auth defined in the `google` stanza at the top. This auth is used to create
| // Pipelines and manipulate auth JSONs.
| batch {
| // A reference to an auth defined in the `google` stanza at the top.
| // This auth is used to create jobs and manipulate auth JSONs.
| auth = "application-default"
| // Endpoint for APIs, no reason to change this unless directed by Google.
| endpoint-url = "https://genomics.googleapis.com/"
| }
|
| default-runtime-attributes {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,15 +110,15 @@ class GcpBatchConfigurationAttributesSpec

it should "parse compute service account" in {

val backendConfig = ConfigFactory.parseString(configString(genomics = """compute-service-account = "testing" """))
val backendConfig = ConfigFactory.parseString(configString(batch = """compute-service-account = "testing" """))

val gcpBatchAttributes = GcpBatchConfigurationAttributes(googleConfig, backendConfig, "batch")
gcpBatchAttributes.computeServiceAccount should be("testing")
}

it should "parse restrict-metadata-access" in {

val backendConfig = ConfigFactory.parseString(configString(genomics = "restrict-metadata-access = true"))
val backendConfig = ConfigFactory.parseString(configString(batch = "restrict-metadata-access = true"))

val gcpBatchAttributes = GcpBatchConfigurationAttributes(googleConfig, backendConfig, "batch")
gcpBatchAttributes.restrictMetadataAccess should be(true)
Expand All @@ -127,7 +127,7 @@ class GcpBatchConfigurationAttributesSpec

it should "parse localization-attempts" in {

val backendConfig = ConfigFactory.parseString(configString(genomics = "localization-attempts = 31380"))
val backendConfig = ConfigFactory.parseString(configString(batch = "localization-attempts = 31380"))

val gcpBatchAttributes = GcpBatchConfigurationAttributes(googleConfig, backendConfig, "batch")
gcpBatchAttributes.gcsTransferConfiguration.transferAttempts.value should be(31380)
Expand Down Expand Up @@ -250,7 +250,7 @@ class GcpBatchConfigurationAttributesSpec
val nakedConfig =
ConfigFactory.parseString("""
|{
| genomics {
| batch {
|
| }
|}
Expand All @@ -262,22 +262,22 @@ class GcpBatchConfigurationAttributesSpec
val errorsList = exception.errorMessages.toList
errorsList should contain("String: 2: No configuration setting found for key 'project'")
errorsList should contain("String: 2: No configuration setting found for key 'root'")
errorsList should contain("String: 3: No configuration setting found for key 'genomics.auth'")
errorsList should contain("String: 3: No configuration setting found for key 'batch.auth'")
errorsList should contain("String: 2: No configuration setting found for key 'filesystems'")
}

def configString(customContent: String = "", genomics: String = ""): String =
def configString(customContent: String = "", batch: String = ""): String =
s"""
|{
| project = "myProject"
| root = "gs://myBucket"
| maximum-polling-interval = 600
| $customContent
| genomics {
| batch {
| // A reference to an auth defined in the `google` stanza at the top. This auth is used to create
| // Pipelines and manipulate auth JSONs.
| auth = "mock"
| $genomics
| $batch
| endpoint-url = "http://myEndpoint"
| location = "us-central1"
| }
Expand Down

0 comments on commit e9db9a8

Please sign in to comment.