-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
log job lifecycle #9130
log job lifecycle #9130
Conversation
Build failed.
|
Build failed.
|
Build failed.
|
What are you thoughts on the information output to if there is additional information helpful for debugging purposes, we could include them. |
@fosterseth that looks really good and should be super helpful, this should be exposed to awx logger by default right? |
awx/main/models/unified_jobs.py
Outdated
@@ -397,6 +397,7 @@ def create_unified_job(self, **kwargs): | |||
# credentials and labels | |||
unified_job.save() | |||
|
|||
logger_job_lifecycle.info(f"{unified_job._meta.model_name}-{unified_job.id} created", extra={'type': unified_job._meta.model_name, 'template_name': unified_job.unified_job_template.name, 'job_id': unified_job.id, 'state': 'created'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this a function on model.UnifiedJob
and use it everywhere below e.g.,
def log_lifecycle(self, status, **extra):
....
uj.log_lifecycle('created')
uj.log_lifecycle('blocked', blocked_by='blah blah blah')
awx/main/scheduler/task_manager.py
Outdated
logger.debug("{} is blocked from running".format(task.log_format)) | ||
logger_job_lifecycle.info(f"{task._meta.model_name}-{task.id} blocked by {blocked_by._meta.model_name}-{blocked_by.id}", extra={'type': task._meta.model_name, 'template_name': task.unified_job_template.name, 'job_id': task.id, 'state': 'blocked', 'blocked_by': f"{blocked_by._meta.model_name}-{blocked_by.id}"}) | ||
job_explanation = str.format(f"waiting for {blocked_by._meta.model_name}-{blocked_by.id} to finish") | ||
if task.job_explanation != job_explanation and task.created < (tz_now() - timedelta(seconds=30)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, I think this is a reasonable approach to avoid a stampede of updates.
awx/main/scheduler/task_manager.py
Outdated
@@ -539,6 +552,10 @@ def process_pending_tasks(self, pending_tasks): | |||
logger.debug("No instance available in group {} to run job {} w/ capacity requirement {}".format( | |||
rampart_group.name, task.log_format, task.task_impact)) | |||
if not found_acceptable_queue: | |||
job_explanation = "This job is not ready to start because there is not enough available capacity." | |||
if task.job_explanation != job_explanation and task.created < (tz_now() - timedelta(seconds=30)): | |||
task.job_explanation = "This job is not ready to start because there is not enough available capacity." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick, but task.job_explanation = job_explanation
no need to duplicate this string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch thanks. I also realize I need to unset/change this job_explanation when we start the task
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, good call. We may want to make sure that when a job goes into waiting
, we unset this explicitly.
awx/main/scheduler/task_manager.py
Outdated
@@ -539,6 +552,10 @@ def process_pending_tasks(self, pending_tasks): | |||
logger.debug("No instance available in group {} to run job {} w/ capacity requirement {}".format( | |||
rampart_group.name, task.log_format, task.task_impact)) | |||
if not found_acceptable_queue: | |||
job_explanation = "This job is not ready to start because there is not enough available capacity." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This string should be wrapped in i18n _(...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if this is still relevant
Line 806 in dfa6522
# Because this string is saved in the db in the source language, |
does this mean we shouldn't mark job_explanation for translation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey, good catch! Yep, I think you're right about this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in this case we might actually want to wrap this in ugettext_noop
(not ugettext_lazy
). This will mark the string for translation (so it shows up in our generated i18n files), but it won't actually translate it here.
https://docs.djangoproject.com/en/2.2/ref/utils/#django.utils.translation.ugettext_noop
awx/main/tasks.py
Outdated
@@ -1186,16 +1186,19 @@ def pre_run_hook(self, instance, private_data_dir): | |||
''' | |||
Hook for any steps to run before the job/task starts | |||
''' | |||
logger_job_lifecycle.info(f"{instance._meta.model_name}-{instance.id} pre run", extra={'type': instance._meta.model_name, 'template_name': instance.unified_job_template.name, 'job_id': instance.id, 'state': 'pre_run'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the use of the hooks for this 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of 'job_id'
could we use self.event_data_key
for accuracy for all types of "jobs"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about parsing this json -- would we want a fixed key here across all entries? maybe it could be named task_id
to be more generic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yea, that's a good point. If we want to stick with job_id
, that probably makes sense considering I think there's additional context in here that tells you what type of job it is.
awx/main/tasks.py
Outdated
|
||
def post_run_hook(self, instance, status): | ||
''' | ||
Hook for any steps to run before job/task is marked as complete. | ||
''' | ||
logger_job_lifecycle.info(f"{instance._meta.model_name}-{instance.id} post run", extra={'type': instance._meta.model_name, 'template_name': instance.unified_job_template.name, 'job_id': instance.id, 'state': 'post_run'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as above here.
awx/main/tasks.py
Outdated
|
||
def final_run_hook(self, instance, status, private_data_dir, fact_modification_times, isolated_manager_instance=None): | ||
''' | ||
Hook for any steps to run after job/task is marked as complete. | ||
''' | ||
logger_job_lifecycle.info(f"{instance._meta.model_name}-{instance.id} finalize run", extra={'type': instance._meta.model_name, 'template_name': instance.unified_job_template.name, 'job_id': instance.id, 'state': 'finalize_run'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(and here)
awx/settings/defaults.py
Outdated
@@ -940,6 +943,15 @@ def IS_TESTING(argv=None): | |||
'backupCount': 5, | |||
'formatter':'simple', | |||
}, | |||
'job_lifecycle': { | |||
'level': 'INFO', | |||
'class':'logging.handlers.RotatingFileHandler', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to sync up w/ @chrismeyersfsu on his impending logrotate work, as he's been planning on removing these rotating file handlers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This definitely needs an update now that @chrismeyersfsu has merged his log rotation PR, but I don't expect anything super complicated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is coming along nicely - I've got some minor feedback.
Build failed.
|
Build succeeded.
|
Build failed.
|
I like it 😄 |
if self.unified_job_template: | ||
extra["template_name"] = self.unified_job_template.name | ||
if state == "blocked" and blocked_by: | ||
blocked_by_msg = f"{blocked_by._meta.model_name}-{blocked_by.id}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've got a property that's pretty similar to this that we use in other logging contexts:
UnifiedJob.log_format
It also includes the current status of the job. If that's too verbose, that's fine, but it might be an opportunity for reuse here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [1]: Job.objects.first().log_format
Out[1]: 'job 1 (successful)'```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In [4]: [j.log_format for j in UnifiedJob.objects.all()]
Out[4]:
['job 1 (successful)',
'project_update 2 (successful)',
'ad_hoc_command 3 (successful)',
'ad_hoc_command 4 (successful)',
'ad_hoc_command 5 (successful)',
'job 6 (successful)',
'project_update 7 (successful)']```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really great work on this - I like how it turned out, and this is going to be really useful to figure out why jobs aren't running yet.
I'd say once we've tossed this through an integration run, I'm a 👍 on merging this.
@jakemcdermott had mentioned to me the possibility of folding his work on #8726 into this PR and verifying and landing them around the same time. Is this something you two have discussed as an option? |
I've gone back and forth on whether we should provide a waiting_on = models.ForeignKey(
'self',
null=True,
editable=False,
related_name='+',
on_delete=polymorphic.SET_NULL,
) Currently, we can stuff that info into the Because waiting_on could be a projectupdate/job/inventorysource, we would also need another field in the json that identifies the type of resource, so that we can get the appropriate link. @jakemcdermott and I brainstormed some ideas how to display this link. One thought was, if waiting_on is null, then just display job_explanation. If not null, then have a predefined string with the link to the waiting_on resource (e.g. "waiting on <project update 123>"). The downside of this is that there might be case we want job_explanation information to take a priority over waiting_on information, so having these two fields compete for the same space in the UI is not ideal. Having a separate "Waiting On" field in the UI solves this, but then again it's yet another field on the detail page, introducing more clutter. Maybe we should punt the UI link idea until later, and settle with the generic text in the job_explanation for now? |
@fosterseth before we merge this, can we squash a few commits (or at least, improve a few of the latter commit messages)? |
@ryanpetrello Yep I plan to squash them all into one |
158595f
to
269facf
Compare
Build failed.
|
Build succeeded.
|
states for all jobs
special processing for some jobs
stuck in pending
|
blocked_by = self.job_blocked_by(task) | ||
if blocked_by: | ||
task.log_lifecycle("blocked", blocked_by=blocked_by) | ||
job_explanation = gettext_noop(f"waiting for {blocked_by._meta.model_name}-{blocked_by.id} to finish") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to make the blocked by job name to become a link in the UI so we can easily navigate to it and follow if we want to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, we plan to make this a link in the UI, which would be helpful to users
Build succeeded.
|
f30baff
to
17e2129
Compare
Build succeeded.
|
6dfb396
to
82db4bd
Compare
Build succeeded.
|
Build succeeded.
|
was curious how big this file gets -- here it is after running the yolo tests
|
0686062
to
227750a
Compare
Build succeeded.
|
227750a
to
978b2b1
Compare
Build succeeded.
|
978b2b1
to
78f4f1e
Compare
Build succeeded.
|
Various points (e.g. created, running, processing events), are structured into json format and output to /var/log/tower/job_lifecycle.log As part of this work, the DependencyGraph is reworked to return which job object is doing the blocking, rather than a boolean.
78f4f1e
to
41d0a2f
Compare
Build succeeded.
|
Build succeeded (gate pipeline).
|
feature issue #8749
SUMMARY
example output in
/var/log/tower/job_lifecycle.log
these events also show up in the stdout log
awx_1 | 2021-02-03 18:41:05,656 DEBUG awx.analytics.job_lifecycle job-740 created
Dependency Graph Changes
The
DependencyGraph
tells us what is blocked, but doesn't tell us what is doing the blocking. I rewrote it to expose that information.e.g. here is the new representation in
dependency_graph.data
now instead of
is_job_blocked
in the task manager, we havejob_blocked_by
which returns aUnifiedJob
object, orNone
if it is not blocked.Job Explanation Changes
We can also update
job_explanation
with some information about why a job is blocked.e.g.
waiting until the project update finishes
To reduce performance loss with
task.save()
, we only update thisjob_explanation
if it is different from the current job_explanation, and if more than 30 seconds have passed sincejob.created
.There is a PR open that will display job explanation in the UI, so this might be our mechanism of displaying information to the user.
#8726
ISSUE TYPE
COMPONENT NAME
AWX VERSION