Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only execute isAlive once per timeout #4220

Merged
merged 8 commits into from Oct 25, 2018

Conversation

ffinfo
Copy link
Contributor

@ffinfo ffinfo commented Oct 10, 2018

Follow up on #4112

This will reduce the load on the JVM a lot

I did indeed a stress test on our system with 50.000 async qsub/qstat jobs but this was outside the jvm. Inside the jvm this ends up in blocking threads to cromwell.

When the timeout is set to 120 seconds, isAlive will only run once each 120 seconds.

@ffinfo
Copy link
Contributor Author

ffinfo commented Oct 10, 2018

oke this also not fixing the cpu load, I'm trying t debug this but can't see where the load comes from. Maybe the calander call?

Right now I'm trying to create a extra poll queue just for isAlive. Altough with the the status is not passed anymore to the normal polling. I think this is because the handle stays in memory till the jobActor is finished, starting a new queue mean a copy of the handle.

@geoffjentry
Copy link
Contributor

Hi @ffinfo - to help preserve the sanity of @gemmalam (and thus indirectly my own sanity!) would you be amenable to closing this until you think it's ready? If I've misunderstood your comment and you think this is ready to go feel free to ignore me. :)

@ffinfo
Copy link
Contributor Author

ffinfo commented Oct 15, 2018

@geoffjentry We came to the conclusion that my earlier change did not cause the performance issue. I will make a new issue for this performence

Still this code I made here is useful to merge and is ready to review/merge.

@@ -17,6 +17,12 @@ import scala.util.{Failure, Success, Try}

case class SharedFileSystemRunStatus(status: String, date: Calendar) {
override def toString: String = status

def experired(timeoutSeconds: Int): Boolean = {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expired ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fix this

@@ -1002,7 +1002,8 @@ trait StandardAsyncExecutionActor extends AsyncBackendJobExecutionActor with Sta
the state names.
*/
val prevStateName = previousStatus.map(_.toString).getOrElse("-")
jobLogger.info(s"Status change from $prevStateName to $status")
if (prevStateName == status.toString) jobLogger.debug(s"Status change from $prevStateName to $status")
else jobLogger.info(s"Status change from $prevStateName to $status")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This has been changed recently here 3fd5b04 (which is likely the reason why your PR has a merge conflict)

@@ -17,6 +17,12 @@ import scala.util.{Failure, Success, Try}

case class SharedFileSystemRunStatus(status: String, date: Calendar) {
override def toString: String = status

def experired(timeoutSeconds: Int): Boolean = {
val currentDate = Calendar.getInstance()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something like LocalDateTime would make this a bit more readable:

def expired(timeoutInSeconds: Int) = {
    LocalDateTime
      .ofInstant(date.toInstant, ZoneId.systemDefault())
      .plusSeconds(timeoutInSeconds)
      .isBefore(LocalDateTime.now())
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one, din't like the version but it did work. Will change this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one, din't like the version but it did work. Will change this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one, din't like the version but it did work. Will change this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one, din't like the version but it did work. Will change this

else SharedFileSystemRunStatus("WaitingForReturnCode")
exitCodeTimeout match {
case Some(timeout) =>
if (s.experired(timeout)) s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the logic, this seems to say "if we're past the timeout, do nothing, otherwise check if the job is still alive". Isn't that the opposite of the intent ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about this, it was a it inverted logic here, will fix this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about this, it was a it inverted logic here, will fix this

@ffinfo
Copy link
Contributor Author

ffinfo commented Oct 22, 2018

@Horneth Sorry for the spam because of the github issue ;)

I did process the comments, could you look at it again?

@cpavanrun
Copy link
Contributor

Looking forward to this feature. When enabled the check-alive command call via exit-code-timeout-seconds currently polls on average once every 10 seconds per running job (under minimum load).

Copy link
Contributor

@Horneth Horneth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM once the logging change is reverted

@@ -1006,7 +1006,8 @@ trait StandardAsyncExecutionActor extends AsyncBackendJobExecutionActor with Sta
// the state names.
// This logging and metadata publishing assumes that StandardAsyncRunState subtypes `toString` nicely to state names.
val prevStatusName = previousState.map(_.toString).getOrElse("-")
jobLogger.info(s"Status change from $prevStatusName to $state")
if (prevStatusName == state.toString) jobLogger.debug(s"Status change from $prevStatusName to $state")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this change is needed, the if (!(previousState exists statusEquivalentTo(state))) on line 1004 already makes sure we log this at the right time

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was not aware this was changed. All I did is fixing the merge conflicts ;)
Did remove this change again, seems to work fine indeed.

@ffinfo
Copy link
Contributor Author

ffinfo commented Oct 24, 2018

@Horneth @cpavanrun
Should be good to merge now. Only still added something to the change log and did fix a link in the release 36 change log ;)

CHANGELOG.md Outdated Show resolved Hide resolved
case _ => s
}
case Some(s) if s.status == "Done" => s // Nothing to be done here
jobLogger.error(s"Return file not found after ${exitCodeTimeout.getOrElse("-")} seconds, assuming external kill")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking out loud -- at this point, the exitCodeTimeout has to exist, is that right? We can't enter this case unless the job has exceeded the timeout? So theoretically one should never see an error that looks like: Return file not found after - seconds, assuming external kill"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true when exit-code-timeout-seconds is not yet. Once this is set the value will become Some(true) if the timeout is passed, and so enter this case.
Under the right conditions and a job will never be lost this case will be never touched. Sadly not all system are that reliable ;)

@ruchim
Copy link
Contributor

ruchim commented Oct 24, 2018

👍

Something to consider for the future-- a user may want to see how often their job failed due to timeouts, so it might be interesting for this to be marked as a new state other than Failed, but it works perfectly well for the goal of this PR.

Approved with PullApprove

Co-Authored-By: ffinfo <pjrvanthof@gmail.com>
@ruchim ruchim merged commit 592f5c6 into broadinstitute:develop Oct 25, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants