Only execute isAlive once per timeout #4220

ffinfo · 2018-10-10T09:37:42Z

Follow up on #4112

This will reduce the load on the JVM a lot

I did indeed a stress test on our system with 50.000 async qsub/qstat jobs but this was outside the jvm. Inside the jvm this ends up in blocking threads to cromwell.

When the timeout is set to 120 seconds, isAlive will only run once each 120 seconds.

ffinfo · 2018-10-10T12:05:59Z

oke this also not fixing the cpu load, I'm trying t debug this but can't see where the load comes from. Maybe the calander call?

Right now I'm trying to create a extra poll queue just for isAlive. Altough with the the status is not passed anymore to the normal polling. I think this is because the handle stays in memory till the jobActor is finished, starting a new queue mean a copy of the handle.

geoffjentry · 2018-10-10T13:25:00Z

Hi @ffinfo - to help preserve the sanity of @gemmalam (and thus indirectly my own sanity!) would you be amenable to closing this until you think it's ready? If I've misunderstood your comment and you think this is ready to go feel free to ignore me. :)

ffinfo · 2018-10-15T13:05:02Z

@geoffjentry We came to the conclusion that my earlier change did not cause the performance issue. I will make a new issue for this performence

Still this code I made here is useful to merge and is ready to review/merge.

Horneth · 2018-10-18T13:50:39Z

...ackends/sfs/src/main/scala/cromwell/backend/sfs/SharedFileSystemAsyncJobExecutionActor.scala

@@ -17,6 +17,12 @@ import scala.util.{Failure, Success, Try}

 case class SharedFileSystemRunStatus(status: String, date: Calendar) {
  override def toString: String = status
+
+  def experired(timeoutSeconds: Int): Boolean = {


expired ?

Will fix this

Horneth · 2018-10-18T13:55:52Z

backend/src/main/scala/cromwell/backend/standard/StandardAsyncExecutionActor.scala

@@ -1002,7 +1002,8 @@ trait StandardAsyncExecutionActor extends AsyncBackendJobExecutionActor with Sta
      the state names.
       */
      val prevStateName = previousStatus.map(_.toString).getOrElse("-")
-      jobLogger.info(s"Status change from $prevStateName to $status")
+      if (prevStateName == status.toString) jobLogger.debug(s"Status change from $prevStateName to $status")
+      else jobLogger.info(s"Status change from $prevStateName to $status")


This has been changed recently here 3fd5b04 (which is likely the reason why your PR has a merge conflict)

Horneth · 2018-10-18T14:21:49Z

...ackends/sfs/src/main/scala/cromwell/backend/sfs/SharedFileSystemAsyncJobExecutionActor.scala

@@ -17,6 +17,12 @@ import scala.util.{Failure, Success, Try}

 case class SharedFileSystemRunStatus(status: String, date: Calendar) {
  override def toString: String = status
+
+  def experired(timeoutSeconds: Int): Boolean = {
+    val currentDate = Calendar.getInstance()


I think something like LocalDateTime would make this a bit more readable:

def expired(timeoutInSeconds: Int) = { LocalDateTime .ofInstant(date.toInstant, ZoneId.systemDefault()) .plusSeconds(timeoutInSeconds) .isBefore(LocalDateTime.now()) }

Good one, din't like the version but it did work. Will change this

Horneth · 2018-10-18T14:38:33Z

...kends/sfs/src/main/scala/cromwell/backend/impl/sfs/config/ConfigAsyncJobExecutionActor.scala

-        else SharedFileSystemRunStatus("WaitingForReturnCode")
+        exitCodeTimeout match {
+          case Some(timeout) =>
+            if (s.experired(timeout)) s


I'm not sure I understand the logic, this seems to say "if we're past the timeout, do nothing, otherwise check if the job is still alive". Isn't that the opposite of the intent ?

Sorry about this, it was a it inverted logic here, will fix this

ffinfo · 2018-10-22T12:29:21Z

@Horneth Sorry for the spam because of the github issue ;)

I did process the comments, could you look at it again?

cpavanrun · 2018-10-22T15:35:02Z

Looking forward to this feature. When enabled the check-alive command call via exit-code-timeout-seconds currently polls on average once every 10 seconds per running job (under minimum load).

Horneth

LGTM once the logging change is reverted

Horneth · 2018-10-22T16:41:40Z

backend/src/main/scala/cromwell/backend/standard/StandardAsyncExecutionActor.scala

@@ -1006,7 +1006,8 @@ trait StandardAsyncExecutionActor extends AsyncBackendJobExecutionActor with Sta
      // the state names.
      // This logging and metadata publishing assumes that StandardAsyncRunState subtypes `toString` nicely to state names.
      val prevStatusName = previousState.map(_.toString).getOrElse("-")
-      jobLogger.info(s"Status change from $prevStatusName to $state")
+      if (prevStatusName == state.toString) jobLogger.debug(s"Status change from $prevStatusName to $state")


I don't think this change is needed, the if (!(previousState exists statusEquivalentTo(state))) on line 1004 already makes sure we log this at the right time

Was not aware this was changed. All I did is fixing the merge conflicts ;)
Did remove this change again, seems to work fine indeed.

ffinfo · 2018-10-24T04:12:34Z

@Horneth @cpavanrun
Should be good to merge now. Only still added something to the change log and did fix a link in the release 36 change log ;)

CHANGELOG.md

ruchim · 2018-10-24T13:54:34Z

...kends/sfs/src/main/scala/cromwell/backend/impl/sfs/config/ConfigAsyncJobExecutionActor.scala

-          case _ => s
-        }
-      case Some(s) if s.status == "Done" => s // Nothing to be done here
+        jobLogger.error(s"Return file not found after ${exitCodeTimeout.getOrElse("-")} seconds, assuming external kill")


Just thinking out loud -- at this point, the exitCodeTimeout has to exist, is that right? We can't enter this case unless the job has exceeded the timeout? So theoretically one should never see an error that looks like: Return file not found after - seconds, assuming external kill"

This is true when exit-code-timeout-seconds is not yet. Once this is set the value will become Some(true) if the timeout is passed, and so enter this case.
Under the right conditions and a job will never be lost this case will be never touched. Sadly not all system are that reliable ;)

ruchim · 2018-10-24T14:04:36Z

👍

Something to consider for the future-- a user may want to see how often their job failed due to timeouts, so it might be interesting for this to be marked as a new state other than Failed, but it works perfectly well for the goal of this PR.

Co-Authored-By: ffinfo <pjrvanthof@gmail.com>

ffinfo added 2 commits October 10, 2018 11:29

Only execute isAlive once per timeout

58e966c

Remove 'Status change from Running to Running' from info log

887b794

gemmalam assigned Horneth and ruchim Oct 16, 2018

Horneth reviewed Oct 18, 2018

View reviewed changes

Peter van 't Hof added 3 commits October 22, 2018 11:27

Fixing expired comment

ffa236c

Restructure code

c5d7b93

Merge remote-tracking branch 'remotes/origin/develop' into fix_polling

10fb7b4

Horneth approved these changes Oct 22, 2018

View reviewed changes

Peter van 't Hof added 2 commits October 24, 2018 06:05

Revert log change

9346943

Added to change log

740278b

cpavanrun reviewed Oct 24, 2018

View reviewed changes

CHANGELOG.md Outdated Show resolved Hide resolved

ruchim reviewed Oct 24, 2018

View reviewed changes

ruchim approved these changes Oct 24, 2018

View reviewed changes

Update CHANGELOG.md

a7ced45

Co-Authored-By: ffinfo <pjrvanthof@gmail.com>

ruchim merged commit 592f5c6 into broadinstitute:develop Oct 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only execute isAlive once per timeout #4220

Only execute isAlive once per timeout #4220

ffinfo commented Oct 10, 2018

ffinfo commented Oct 10, 2018

geoffjentry commented Oct 10, 2018

ffinfo commented Oct 15, 2018

Horneth Oct 18, 2018

ffinfo Oct 22, 2018

Horneth Oct 18, 2018

Horneth Oct 18, 2018

ffinfo Oct 22, 2018

ffinfo Oct 22, 2018

ffinfo Oct 22, 2018

ffinfo Oct 22, 2018

Horneth Oct 18, 2018

ffinfo Oct 22, 2018

ffinfo Oct 22, 2018

ffinfo commented Oct 22, 2018

cpavanrun commented Oct 22, 2018

Horneth left a comment

Horneth Oct 22, 2018

ffinfo Oct 24, 2018

ffinfo commented Oct 24, 2018 •

edited

ruchim Oct 24, 2018

ffinfo Oct 24, 2018

ruchim commented Oct 24, 2018 •

edited by ahaessly

Only execute isAlive once per timeout #4220

Only execute isAlive once per timeout #4220

Conversation

ffinfo commented Oct 10, 2018

ffinfo commented Oct 10, 2018

geoffjentry commented Oct 10, 2018

ffinfo commented Oct 15, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffinfo commented Oct 22, 2018

cpavanrun commented Oct 22, 2018

Horneth left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ffinfo commented Oct 24, 2018 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ruchim commented Oct 24, 2018 • edited by ahaessly

ffinfo commented Oct 24, 2018 •

edited

ruchim commented Oct 24, 2018 •

edited by ahaessly