Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ID-1276 Introduce Bard Service for sending metrics #7434

Open
wants to merge 37 commits into
base: develop
Choose a base branch
from

Conversation

tlangs
Copy link
Contributor

@tlangs tlangs commented May 14, 2024

https://broadworkbench.atlassian.net/browse/ID-1276

Introducing a BardService to get Cromwell to start sending events to Bard. This PR includes the introduction of a Bard client, the Service to use that client, and the Actor to use the Service.

This will only support collecting cpu hours for gcp for now, but there will be a followup pr for azure.

Comment on lines 1477 to 1484
def tellBard(metadataKeyValues: Map[String, Any]): Unit =
serviceRegistryActor ! BardEventRequest(
TaskSummaryEvent(workflowDescriptor.id,
Option(jobDescriptor.key.propertiesToMap).getOrElse(Map()).asJava,
metadataKeyValues.asJava
)
)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the canonical way to get task metadata out of Cromwell, but I'm a little hesitant about the arbitrary key-values. What info is in there? Does it have everything we need? Does it have way more than we need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is metadata standardized per-cloud? Across clouds?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your hesitation is warranted. This is one small collection of metadata that backends can choose to create on job termination, it's not standardized across backends and doesn't tell the full story of the job.

What's the current thinking around requirements here? What information do you need about the job?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We just need CPU Hours and Docker Image, but I'd love to include everything you use to generate the Monthly GCP Cromwell Spreadsheet so that we can automate that for you!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool! While we would love allll the metrics immediately, let's start with Docker image and CPU hours. Can maybe throw in memory hours, since it operates the same way as CPU.

Docker should be easy, you can get it from the runtime attributes in the job descriptor via:

RuntimeAttributesValidation.extract(DockerValidation.instance, validatedRuntimeAttributes)

For CPU or memory hours, you need both the CPU/memory count for the task and the cloud runtime of the task. The CPU and memory counts are in runtime attributes alongside Docker image, so you can get them as above. Cloud runtime is going to be backend-dependent and I'm not sure we have a representation of it in the GCP backend right now[1]. Katrina is actually adding one for Azure/TES in #7415, so you can probably use or iterate on that.

[1] You might ask, then how go we compute core hours for the spreadsheet? We do it by looking at the time between the earliest and latest events that we get from GCP Lifesciences during the task run. We store these events in metadata as a giant bucket of executionEvents. Since we're doing this in source code rather than analyzing metadata after the fact, I'm hoping we can do something better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would I get a value of gcp vs azure we also want to know what cloud we're on

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For cloud identity, we have a slightly-used notion of platform that we use when interpreting runtime attributes. Currently I think we only use it to distinguish between Azure-TES and Not-Azure-TES. You could set the cloud provider in each backend actor implementation, and use that platform value to detect when on Azure. We could also choose to fully plumb platform through for GCP and depend fully on it.

https://github.com/broadinstitute/cromwell/pull/7380/files


override def receive: Receive = {
case BardEventRequest(event) if bardConfig.enabled =>
bardService.sendEvent(event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning! A blocking action in a receive method means the actor's mailbox can overflow. Are we sure that the bard sending can keep up with the bard request generation?

Incidentally, it also means that all requests are processed in series. Which is fine, but why do we need a connection-pool-size option?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its a standard tuning option. The http connection pool means that connections can be re-used for multiple requests, improving performance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also nervous about this, but OK starting here and seeing how it performs in scale testing.

@Ghost-in-a-Jar Ghost-in-a-Jar force-pushed the tl_ID-1276_cromwell_bard_service branch from 12ab572 to 365ecda Compare June 6, 2024 15:15
@Ghost-in-a-Jar Ghost-in-a-Jar force-pushed the tl_ID-1276_cromwell_bard_service branch from 365ecda to dd91660 Compare June 6, 2024 15:17
@Ghost-in-a-Jar Ghost-in-a-Jar marked this pull request as ready for review June 6, 2024 19:20
@Ghost-in-a-Jar Ghost-in-a-Jar requested a review from a team as a code owner June 6, 2024 19:20
catch {
// Sending events to Bard is a best-effort affair. If it fails, log the error and move on.
case e: Exception =>
logger.error(s"Failed to send event to Bard: ${e.getMessage}", e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to emit a metric counting successes and failures here, so we have something to monitor to ensure the integration is working.


override def receive: Receive = {
case BardEventRequest(event) if bardConfig.enabled =>
bardService.sendEvent(event)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also nervous about this, but OK starting here and seeing how it performs in scale testing.

@@ -32,7 +32,6 @@
package cromwell.backend.impl.aws

import java.util.UUID

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you back out the whitespace changes to these files that were not otherwise changed?

@@ -1745,6 +1750,169 @@ class PipelinesApiAsyncBackendJobExecutionActorSpec

}

private def setupBackend: TestablePipelinesApiJobExecutionActor = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice tests!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants