-
Notifications
You must be signed in to change notification settings - Fork 351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ID-1276 Introduce Bard Service for sending metrics #7434
base: develop
Are you sure you want to change the base?
Conversation
def tellBard(metadataKeyValues: Map[String, Any]): Unit = | ||
serviceRegistryActor ! BardEventRequest( | ||
TaskSummaryEvent(workflowDescriptor.id, | ||
Option(jobDescriptor.key.propertiesToMap).getOrElse(Map()).asJava, | ||
metadataKeyValues.asJava | ||
) | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the canonical way to get task metadata out of Cromwell, but I'm a little hesitant about the arbitrary key-values. What info is in there? Does it have everything we need? Does it have way more than we need?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is metadata standardized per-cloud? Across clouds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your hesitation is warranted. This is one small collection of metadata that backends can choose to create on job termination, it's not standardized across backends and doesn't tell the full story of the job.
What's the current thinking around requirements here? What information do you need about the job?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just need CPU Hours and Docker Image, but I'd love to include everything you use to generate the Monthly GCP Cromwell Spreadsheet so that we can automate that for you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool! While we would love allll the metrics immediately, let's start with Docker image and CPU hours. Can maybe throw in memory hours, since it operates the same way as CPU.
Docker should be easy, you can get it from the runtime attributes in the job descriptor via:
RuntimeAttributesValidation.extract(DockerValidation.instance, validatedRuntimeAttributes)
For CPU or memory hours, you need both the CPU/memory count for the task and the cloud runtime of the task. The CPU and memory counts are in runtime attributes alongside Docker image, so you can get them as above. Cloud runtime is going to be backend-dependent and I'm not sure we have a representation of it in the GCP backend right now[1]. Katrina is actually adding one for Azure/TES in #7415, so you can probably use or iterate on that.
[1] You might ask, then how go we compute core hours for the spreadsheet? We do it by looking at the time between the earliest and latest events that we get from GCP Lifesciences during the task run. We store these events in metadata as a giant bucket of executionEvents
. Since we're doing this in source code rather than analyzing metadata after the fact, I'm hoping we can do something better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would I get a value of gcp
vs azure
we also want to know what cloud we're on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For cloud identity, we have a slightly-used notion of platform
that we use when interpreting runtime attributes. Currently I think we only use it to distinguish between Azure-TES and Not-Azure-TES. You could set the cloud provider in each backend actor implementation, and use that platform
value to detect when on Azure. We could also choose to fully plumb platform
through for GCP and depend fully on it.
|
||
override def receive: Receive = { | ||
case BardEventRequest(event) if bardConfig.enabled => | ||
bardService.sendEvent(event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Warning! A blocking action in a receive
method means the actor's mailbox can overflow. Are we sure that the bard sending can keep up with the bard request generation?
Incidentally, it also means that all requests are processed in series. Which is fine, but why do we need a connection-pool-size
option?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Its a standard tuning option. The http connection pool means that connections can be re-used for multiple requests, improving performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also nervous about this, but OK starting here and seeing how it performs in scale testing.
services/src/main/scala/cromwell/services/metrics/bard/BardService.scala
Show resolved
Hide resolved
...nds/aws/src/main/scala/cromwell/backend/impl/aws/AwsBatchAsyncBackendJobExecutionActor.scala
Show resolved
Hide resolved
.../main/scala/cromwell/backend/google/batch/actors/GcpBatchAsyncBackendJobExecutionActor.scala
Outdated
Show resolved
Hide resolved
.../main/scala/cromwell/backend/google/batch/actors/GcpBatchAsyncBackendJobExecutionActor.scala
Outdated
Show resolved
Hide resolved
...ala/cromwell/backend/google/pipelines/common/PipelinesApiAsyncBackendJobExecutionActor.scala
Outdated
Show resolved
Hide resolved
…' into tl_ID-1276_cromwell_bard_service
12ab572
to
365ecda
Compare
365ecda
to
dd91660
Compare
catch { | ||
// Sending events to Bard is a best-effort affair. If it fails, log the error and move on. | ||
case e: Exception => | ||
logger.error(s"Failed to send event to Bard: ${e.getMessage}", e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to emit a metric counting successes and failures here, so we have something to monitor to ensure the integration is working.
|
||
override def receive: Receive = { | ||
case BardEventRequest(event) if bardConfig.enabled => | ||
bardService.sendEvent(event) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also nervous about this, but OK starting here and seeing how it performs in scale testing.
@@ -32,7 +32,6 @@ | |||
package cromwell.backend.impl.aws | |||
|
|||
import java.util.UUID | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you back out the whitespace changes to these files that were not otherwise changed?
@@ -1745,6 +1750,169 @@ class PipelinesApiAsyncBackendJobExecutionActorSpec | |||
|
|||
} | |||
|
|||
private def setupBackend: TestablePipelinesApiJobExecutionActor = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice tests!
https://broadworkbench.atlassian.net/browse/ID-1276
Introducing a
BardService
to get Cromwell to start sending events to Bard. This PR includes the introduction of a Bard client, the Service to use that client, and the Actor to use the Service.This will only support collecting cpu hours for gcp for now, but there will be a followup pr for azure.