New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only execute isAlive once per timeout #4220
Conversation
oke this also not fixing the cpu load, I'm trying t debug this but can't see where the load comes from. Maybe the calander call? Right now I'm trying to create a extra poll queue just for isAlive. Altough with the the status is not passed anymore to the normal polling. I think this is because the handle stays in memory till the jobActor is finished, starting a new queue mean a copy of the handle. |
@geoffjentry We came to the conclusion that my earlier change did not cause the performance issue. I will make a new issue for this performence Still this code I made here is useful to merge and is ready to review/merge. |
@@ -17,6 +17,12 @@ import scala.util.{Failure, Success, Try} | |||
|
|||
case class SharedFileSystemRunStatus(status: String, date: Calendar) { | |||
override def toString: String = status | |||
|
|||
def experired(timeoutSeconds: Int): Boolean = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expired
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fix this
@@ -1002,7 +1002,8 @@ trait StandardAsyncExecutionActor extends AsyncBackendJobExecutionActor with Sta | |||
the state names. | |||
*/ | |||
val prevStateName = previousStatus.map(_.toString).getOrElse("-") | |||
jobLogger.info(s"Status change from $prevStateName to $status") | |||
if (prevStateName == status.toString) jobLogger.debug(s"Status change from $prevStateName to $status") | |||
else jobLogger.info(s"Status change from $prevStateName to $status") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This has been changed recently here 3fd5b04 (which is likely the reason why your PR has a merge conflict)
@@ -17,6 +17,12 @@ import scala.util.{Failure, Success, Try} | |||
|
|||
case class SharedFileSystemRunStatus(status: String, date: Calendar) { | |||
override def toString: String = status | |||
|
|||
def experired(timeoutSeconds: Int): Boolean = { | |||
val currentDate = Calendar.getInstance() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something like LocalDateTime
would make this a bit more readable:
def expired(timeoutInSeconds: Int) = {
LocalDateTime
.ofInstant(date.toInstant, ZoneId.systemDefault())
.plusSeconds(timeoutInSeconds)
.isBefore(LocalDateTime.now())
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one, din't like the version but it did work. Will change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one, din't like the version but it did work. Will change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one, din't like the version but it did work. Will change this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good one, din't like the version but it did work. Will change this
else SharedFileSystemRunStatus("WaitingForReturnCode") | ||
exitCodeTimeout match { | ||
case Some(timeout) => | ||
if (s.experired(timeout)) s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand the logic, this seems to say "if we're past the timeout, do nothing, otherwise check if the job is still alive". Isn't that the opposite of the intent ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about this, it was a it inverted logic here, will fix this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about this, it was a it inverted logic here, will fix this
@Horneth Sorry for the spam because of the github issue ;) I did process the comments, could you look at it again? |
Looking forward to this feature. When enabled the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM once the logging change is reverted
@@ -1006,7 +1006,8 @@ trait StandardAsyncExecutionActor extends AsyncBackendJobExecutionActor with Sta | |||
// the state names. | |||
// This logging and metadata publishing assumes that StandardAsyncRunState subtypes `toString` nicely to state names. | |||
val prevStatusName = previousState.map(_.toString).getOrElse("-") | |||
jobLogger.info(s"Status change from $prevStatusName to $state") | |||
if (prevStatusName == state.toString) jobLogger.debug(s"Status change from $prevStatusName to $state") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this change is needed, the if (!(previousState exists statusEquivalentTo(state)))
on line 1004 already makes sure we log this at the right time
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was not aware this was changed. All I did is fixing the merge conflicts ;)
Did remove this change again, seems to work fine indeed.
@Horneth @cpavanrun |
case _ => s | ||
} | ||
case Some(s) if s.status == "Done" => s // Nothing to be done here | ||
jobLogger.error(s"Return file not found after ${exitCodeTimeout.getOrElse("-")} seconds, assuming external kill") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking out loud -- at this point, the exitCodeTimeout has to exist, is that right? We can't enter this case unless the job has exceeded the timeout? So theoretically one should never see an error that looks like: Return file not found after - seconds, assuming external kill"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is true when exit-code-timeout-seconds
is not yet. Once this is set the value will become Some(true) if the timeout is passed, and so enter this case.
Under the right conditions and a job will never be lost this case will be never touched. Sadly not all system are that reliable ;)
Co-Authored-By: ffinfo <pjrvanthof@gmail.com>
Follow up on #4112
This will reduce the load on the JVM a lot
I did indeed a stress test on our system with 50.000 async qsub/qstat jobs but this was outside the jvm. Inside the jvm this ends up in blocking threads to cromwell.
When the timeout is set to 120 seconds,
isAlive
will only run once each 120 seconds.