ID-1276 Introduce Bard Service for sending metrics #7434

tlangs · 2024-05-14T16:17:58Z

https://broadworkbench.atlassian.net/browse/ID-1276

Introducing a BardService to get Cromwell to start sending events to Bard. This PR includes the introduction of a Bard client, the Service to use that client, and the Actor to use the Service.

This will only support collecting cpu hours for gcp for now, but there will be a followup pr for azure.

tlangs · 2024-05-15T15:50:53Z

backend/src/main/scala/cromwell/backend/standard/StandardAsyncExecutionActor.scala

+  def tellBard(metadataKeyValues: Map[String, Any]): Unit =
+    serviceRegistryActor ! BardEventRequest(
+      TaskSummaryEvent(workflowDescriptor.id,
+                       Option(jobDescriptor.key.propertiesToMap).getOrElse(Map()).asJava,
+                       metadataKeyValues.asJava
+      )
+    )
+


This seems like the canonical way to get task metadata out of Cromwell, but I'm a little hesitant about the arbitrary key-values. What info is in there? Does it have everything we need? Does it have way more than we need?

Is metadata standardized per-cloud? Across clouds?

I think your hesitation is warranted. This is one small collection of metadata that backends can choose to create on job termination, it's not standardized across backends and doesn't tell the full story of the job.

What's the current thinking around requirements here? What information do you need about the job?

We just need CPU Hours and Docker Image, but I'd love to include everything you use to generate the Monthly GCP Cromwell Spreadsheet so that we can automate that for you!

Cool! While we would love allll the metrics immediately, let's start with Docker image and CPU hours. Can maybe throw in memory hours, since it operates the same way as CPU.

Docker should be easy, you can get it from the runtime attributes in the job descriptor via:

RuntimeAttributesValidation.extract(DockerValidation.instance, validatedRuntimeAttributes)

For CPU or memory hours, you need both the CPU/memory count for the task and the cloud runtime of the task. The CPU and memory counts are in runtime attributes alongside Docker image, so you can get them as above. Cloud runtime is going to be backend-dependent and I'm not sure we have a representation of it in the GCP backend right now[1]. Katrina is actually adding one for Azure/TES in #7415, so you can probably use or iterate on that.

[1] You might ask, then how go we compute core hours for the spreadsheet? We do it by looking at the time between the earliest and latest events that we get from GCP Lifesciences during the task run. We store these events in metadata as a giant bucket of executionEvents. Since we're doing this in source code rather than analyzing metadata after the fact, I'm hoping we can do something better.

How would I get a value of gcp vs azure we also want to know what cloud we're on

For cloud identity, we have a slightly-used notion of platform that we use when interpreting runtime attributes. Currently I think we only use it to distinguish between Azure-TES and Not-Azure-TES. You could set the cloud provider in each backend actor implementation, and use that platform value to detect when on Azure. We could also choose to fully plumb platform through for GCP and depend fully on it.

https://github.com/broadinstitute/cromwell/pull/7380/files

cjllanwarne · 2024-05-16T18:14:56Z

services/src/main/scala/cromwell/services/metrics/bard/impl/BardEventingActor.scala

+
+  override def receive: Receive = {
+    case BardEventRequest(event) if bardConfig.enabled =>
+      bardService.sendEvent(event)


Warning! A blocking action in a receive method means the actor's mailbox can overflow. Are we sure that the bard sending can keep up with the bard request generation?

Incidentally, it also means that all requests are processed in series. Which is fine, but why do we need a connection-pool-size option?

Its a standard tuning option. The http connection pool means that connections can be re-used for multiple requests, improving performance.

I'm also nervous about this, but OK starting here and seeing how it performs in scale testing.

services/src/main/scala/cromwell/services/metrics/bard/BardService.scala

...nds/aws/src/main/scala/cromwell/backend/impl/aws/AwsBatchAsyncBackendJobExecutionActor.scala

.../main/scala/cromwell/backend/google/batch/actors/GcpBatchAsyncBackendJobExecutionActor.scala

...ala/cromwell/backend/google/pipelines/common/PipelinesApiAsyncBackendJobExecutionActor.scala

…orSpec

…' into tl_ID-1276_cromwell_bard_service

jgainerdewar · 2024-06-10T13:27:07Z

services/src/main/scala/cromwell/services/metrics/bard/BardService.scala

+    catch {
+      // Sending events to Bard is a best-effort affair. If it fails, log the error and move on.
+      case e: Exception =>
+        logger.error(s"Failed to send event to Bard: ${e.getMessage}", e)


It would be great to emit a metric counting successes and failures here, so we have something to monitor to ensure the integration is working.

jgainerdewar · 2024-06-10T13:38:05Z

services/src/main/scala/cromwell/services/metrics/bard/impl/BardEventingActor.scala

+
+  override def receive: Receive = {
+    case BardEventRequest(event) if bardConfig.enabled =>
+      bardService.sendEvent(event)


I'm also nervous about this, but OK starting here and seeing how it performs in scale testing.

jgainerdewar · 2024-06-10T14:03:18Z

...aws/src/test/scala/cromwell/backend/impl/aws/AwsBatchAsyncBackendJobExecutionActorSpec.scala

@@ -32,7 +32,6 @@
 package cromwell.backend.impl.aws

 import java.util.UUID
-


Can you back out the whitespace changes to these files that were not otherwise changed?

jgainerdewar · 2024-06-10T14:27:42Z

...cromwell/backend/google/pipelines/common/PipelinesApiAsyncBackendJobExecutionActorSpec.scala

@@ -1745,6 +1750,169 @@ class PipelinesApiAsyncBackendJobExecutionActorSpec

  }

+  private def setupBackend: TestablePipelinesApiJobExecutionActor = {


Nice tests!

tlangs added 5 commits May 14, 2024 12:12

ID-1276 Introduce Bard Service for sending metrics

7054b88

unwire telling bard

ae3ca70

naming...

60570ec

docker build fix

40942a0

tellBard

8c05c8f

tlangs commented May 15, 2024

View reviewed changes

tlangs added 4 commits May 15, 2024 15:38

start extracting metadata values

057c61b

log task state

5807cd9

revert some earlier ideas

04507ed

types matter for ser/der

44e9512

cjllanwarne reviewed May 16, 2024

View reviewed changes

tlangs added 9 commits May 16, 2024 14:46

more job info

c7accea

optionl docker image

3b8dc77

start and end times

ed74a7a

test error message

254d4f5

getTerminalEvents didn't do what I wanted

e377c81

tests for getStartAndEndTimes

bd3e1ff

add cloud platform

33c5836

remove unused code

04d604f

Merge branch 'develop' into tl_ID-1276_cromwell_bard_service

715d589

jgainerdewar reviewed May 22, 2024

View reviewed changes

tlangs and others added 9 commits May 23, 2024 14:59

not sure why this would be different

efb585a

perhaps like this

81ef740

cpuStart and durations

969ffd1

one api client and no exceptions

c1c4849

remove GCP Batch start/end times

bd2360b

Merge branch 'develop' into tl_ID-1276_cromwell_bard_service

766e86a

Add bard message assertion to PipelinesApiAsyncBackendJobExecutionAct…

15c0126

…orSpec

Add distinct_id to bard requests

10fde08

scalafmt...

84c2605

Ghost-in-a-Jar added 3 commits June 6, 2024 10:45

Write actor tests

3c26085

Merge remote-tracking branch 'origin/tl_ID-1276_cromwell_bard_service…

b6b7699

…' into tl_ID-1276_cromwell_bard_service

scalafmt

38f4aa9

Ghost-in-a-Jar force-pushed the tl_ID-1276_cromwell_bard_service branch from 12ab572 to 365ecda Compare June 6, 2024 15:15

Add priority mailbox to bard eventing actor

dd91660

Ghost-in-a-Jar force-pushed the tl_ID-1276_cromwell_bard_service branch from 365ecda to dd91660 Compare June 6, 2024 15:17

fix tests

14f098b

Ghost-in-a-Jar marked this pull request as ready for review June 6, 2024 19:20

Ghost-in-a-Jar requested a review from a team as a code owner June 6, 2024 19:20

Merge branch 'develop' into tl_ID-1276_cromwell_bard_service

41498f3

jgainerdewar reviewed Jun 10, 2024

View reviewed changes

Ghost-in-a-Jar added 4 commits June 10, 2024 13:39

Temporary logging for bee, to be reverted

a58d744

Temporary logging for bee, to be reverted

cbda72e

Reverting logging and fixing regex

3756e78

Try contains instead of regex

7b91ac9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ID-1276 Introduce Bard Service for sending metrics #7434

ID-1276 Introduce Bard Service for sending metrics #7434

tlangs commented May 14, 2024 •

edited by Ghost-in-a-Jar

tlangs May 15, 2024

tlangs May 15, 2024

jgainerdewar May 15, 2024

tlangs May 15, 2024

jgainerdewar May 15, 2024

tlangs May 15, 2024

jgainerdewar May 15, 2024

cjllanwarne May 16, 2024

tlangs May 16, 2024

jgainerdewar Jun 10, 2024

jgainerdewar Jun 10, 2024

jgainerdewar Jun 10, 2024

jgainerdewar Jun 10, 2024

jgainerdewar Jun 10, 2024

		@@ -32,7 +32,6 @@
		package cromwell.backend.impl.aws

		import java.util.UUID

		@@ -1745,6 +1750,169 @@ class PipelinesApiAsyncBackendJobExecutionActorSpec

		}

		private def setupBackend: TestablePipelinesApiJobExecutionActor = {

ID-1276 Introduce Bard Service for sending metrics #7434

Are you sure you want to change the base?

ID-1276 Introduce Bard Service for sending metrics #7434

Conversation

tlangs commented May 14, 2024 • edited by Ghost-in-a-Jar

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tlangs commented May 14, 2024 •

edited by Ghost-in-a-Jar