Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log job lifecycle #9130

Conversation

fosterseth
Copy link
Member

@fosterseth fosterseth commented Jan 21, 2021

feature issue #8749

SUMMARY

example output in /var/log/tower/job_lifecycle.log

{"type": "job", "task_id": 353, "state": "created", "template_name": "sleep", "time": "2021-01-28T04:25:46.661029+00:00"}
{"type": "job", "task_id": 353, "state": "acknowledged", "template_name": "sleep", "time": "2021-01-28T04:25:47.125391+00:00"}
{"type": "job", "task_id": 350, "state": "blocked", "template_name": "sleep", "blocked_by": "projectupdate-351", "time": "2021-01-28T04:25:47.166337+00:00"}
{"type": "job", "task_id": 352, "state": "blocked", "template_name": "sleep", "blocked_by": "projectupdate-351", "time": "2021-01-28T04:25:47.176640+00:00"}
{"type": "job", "task_id": 353, "state": "blocked", "template_name": "sleep", "blocked_by": "projectupdate-351", "time": "2021-01-28T04:25:47.177712+00:00"}

these events also show up in the stdout log
awx_1 | 2021-02-03 18:41:05,656 DEBUG awx.analytics.job_lifecycle job-740 created

Dependency Graph Changes

The DependencyGraph tells us what is blocked, but doesn't tell us what is doing the blocking. I rewrote it to expose that information.

e.g. here is the new representation in dependency_graph.data

{'inventory_source_ids': {},
 'inventory_source_updates': {},
 'inventory_updates': {},
 'job_template_jobs': {},
 'project_updates': {12: <ProjectUpdate: 2021-01-14 05:48:48.599418+00:00-2-pending>},
 'system_job': None,
 'workflow_job_template_jobs': {}}

now instead of is_job_blocked in the task manager, we have job_blocked_by which returns a UnifiedJob object, or None if it is not blocked.

Job Explanation Changes

We can also update job_explanation with some information about why a job is blocked.
e.g.
waiting until the project update finishes

To reduce performance loss with task.save(), we only update this job_explanation if it is different from the current job_explanation, and if more than 30 seconds have passed since job.created.

There is a PR open that will display job explanation in the UI, so this might be our mechanism of displaying information to the user.

#8726

ISSUE TYPE
  • Feature Pull Request
COMPONENT NAME
  • API
AWX VERSION
awx: 16.0.0

@fosterseth fosterseth changed the title Feat wip job lifecycle [wip] job lifecycle Jan 21, 2021
@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@fosterseth
Copy link
Member Author

@gamuniz

What are you thoughts on the information output to /var/log/tower/job_lifecycle.log (see above)

if there is additional information helpful for debugging purposes, we could include them.

@gamuniz
Copy link
Contributor

gamuniz commented Jan 25, 2021

@fosterseth that looks really good and should be super helpful, this should be exposed to awx logger by default right?

@@ -397,6 +397,7 @@ def create_unified_job(self, **kwargs):
# credentials and labels
unified_job.save()

logger_job_lifecycle.info(f"{unified_job._meta.model_name}-{unified_job.id} created", extra={'type': unified_job._meta.model_name, 'template_name': unified_job.unified_job_template.name, 'job_id': unified_job.id, 'state': 'created'})
Copy link
Contributor

@ryanpetrello ryanpetrello Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we make this a function on model.UnifiedJob and use it everywhere below e.g.,

def log_lifecycle(self, status, **extra):
    ....

uj.log_lifecycle('created')
uj.log_lifecycle('blocked', blocked_by='blah blah blah')

logger.debug("{} is blocked from running".format(task.log_format))
logger_job_lifecycle.info(f"{task._meta.model_name}-{task.id} blocked by {blocked_by._meta.model_name}-{blocked_by.id}", extra={'type': task._meta.model_name, 'template_name': task.unified_job_template.name, 'job_id': task.id, 'state': 'blocked', 'blocked_by': f"{blocked_by._meta.model_name}-{blocked_by.id}"})
job_explanation = str.format(f"waiting for {blocked_by._meta.model_name}-{blocked_by.id} to finish")
if task.job_explanation != job_explanation and task.created < (tz_now() - timedelta(seconds=30)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I think this is a reasonable approach to avoid a stampede of updates.

@@ -539,6 +552,10 @@ def process_pending_tasks(self, pending_tasks):
logger.debug("No instance available in group {} to run job {} w/ capacity requirement {}".format(
rampart_group.name, task.log_format, task.task_impact))
if not found_acceptable_queue:
job_explanation = "This job is not ready to start because there is not enough available capacity."
if task.job_explanation != job_explanation and task.created < (tz_now() - timedelta(seconds=30)):
task.job_explanation = "This job is not ready to start because there is not enough available capacity."
Copy link
Contributor

@ryanpetrello ryanpetrello Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, but task.job_explanation = job_explanation no need to duplicate this string.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch thanks. I also realize I need to unset/change this job_explanation when we start the task

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, good call. We may want to make sure that when a job goes into waiting, we unset this explicitly.

@@ -539,6 +552,10 @@ def process_pending_tasks(self, pending_tasks):
logger.debug("No instance available in group {} to run job {} w/ capacity requirement {}".format(
rampart_group.name, task.log_format, task.task_impact))
if not found_acceptable_queue:
job_explanation = "This job is not ready to start because there is not enough available capacity."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This string should be wrapped in i18n _(...)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if this is still relevant

# Because this string is saved in the db in the source language,

does this mean we shouldn't mark job_explanation for translation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey, good catch! Yep, I think you're right about this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well.

Copy link
Contributor

@ryanpetrello ryanpetrello Jan 26, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in this case we might actually want to wrap this in ugettext_noop (not ugettext_lazy). This will mark the string for translation (so it shows up in our generated i18n files), but it won't actually translate it here.

https://docs.djangoproject.com/en/2.2/ref/utils/#django.utils.translation.ugettext_noop

@@ -1186,16 +1186,19 @@ def pre_run_hook(self, instance, private_data_dir):
'''
Hook for any steps to run before the job/task starts
'''
logger_job_lifecycle.info(f"{instance._meta.model_name}-{instance.id} pre run", extra={'type': instance._meta.model_name, 'template_name': instance.unified_job_template.name, 'job_id': instance.id, 'state': 'pre_run'})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the use of the hooks for this 👍

Copy link
Contributor

@ryanpetrello ryanpetrello Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of 'job_id' could we use self.event_data_key for accuracy for all types of "jobs"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thinking about parsing this json -- would we want a fixed key here across all entries? maybe it could be named task_id to be more generic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, that's a good point. If we want to stick with job_id, that probably makes sense considering I think there's additional context in here that tells you what type of job it is.


def post_run_hook(self, instance, status):
'''
Hook for any steps to run before job/task is marked as complete.
'''
logger_job_lifecycle.info(f"{instance._meta.model_name}-{instance.id} post run", extra={'type': instance._meta.model_name, 'template_name': instance.unified_job_template.name, 'job_id': instance.id, 'state': 'post_run'})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above here.


def final_run_hook(self, instance, status, private_data_dir, fact_modification_times, isolated_manager_instance=None):
'''
Hook for any steps to run after job/task is marked as complete.
'''
logger_job_lifecycle.info(f"{instance._meta.model_name}-{instance.id} finalize run", extra={'type': instance._meta.model_name, 'template_name': instance.unified_job_template.name, 'job_id': instance.id, 'state': 'finalize_run'})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(and here)

@@ -940,6 +943,15 @@ def IS_TESTING(argv=None):
'backupCount': 5,
'formatter':'simple',
},
'job_lifecycle': {
'level': 'INFO',
'class':'logging.handlers.RotatingFileHandler',
Copy link
Contributor

@ryanpetrello ryanpetrello Jan 25, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to sync up w/ @chrismeyersfsu on his impending logrotate work, as he's been planning on removing these rotating file handlers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely needs an update now that @chrismeyersfsu has merged his log rotation PR, but I don't expect anything super complicated.

Copy link
Contributor

@ryanpetrello ryanpetrello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is coming along nicely - I've got some minor feedback.

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@fosterseth fosterseth changed the title [wip] job lifecycle log job lifecycle Jan 28, 2021
@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@fosterseth
Copy link
Member Author

image

in development environment, job_lifecycle logs can be blue, if we want that

@ryanpetrello
Copy link
Contributor

in development environment, job_lifecycle logs can be blue, if we want that

I like it 😄

if self.unified_job_template:
extra["template_name"] = self.unified_job_template.name
if state == "blocked" and blocked_by:
blocked_by_msg = f"{blocked_by._meta.model_name}-{blocked_by.id}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've got a property that's pretty similar to this that we use in other logging contexts:

UnifiedJob.log_format

It also includes the current status of the job. If that's too verbose, that's fine, but it might be an opportunity for reuse here.

Copy link
Contributor

@ryanpetrello ryanpetrello Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [1]: Job.objects.first().log_format
Out[1]: 'job 1 (successful)'```

Copy link
Contributor

@ryanpetrello ryanpetrello Jan 28, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [4]: [j.log_format for j in UnifiedJob.objects.all()]
Out[4]:
['job 1 (successful)',
 'project_update 2 (successful)',
 'ad_hoc_command 3 (successful)',
 'ad_hoc_command 4 (successful)',
 'ad_hoc_command 5 (successful)',
 'job 6 (successful)',
 'project_update 7 (successful)']```

Copy link
Contributor

@ryanpetrello ryanpetrello left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really great work on this - I like how it turned out, and this is going to be really useful to figure out why jobs aren't running yet.

I'd say once we've tossed this through an integration run, I'm a 👍 on merging this.

@ryanpetrello
Copy link
Contributor

@fosterseth,

@jakemcdermott had mentioned to me the possibility of folding his work on #8726 into this PR and verifying and landing them around the same time. Is this something you two have discussed as an option?

@fosterseth
Copy link
Member Author

fosterseth commented Jan 28, 2021

I've gone back and forth on whether we should provide a waiting_on field that would identify the job that is causing the block. We could then provide that job as a link in the ui, which I think would be really nice

    waiting_on = models.ForeignKey(
        'self',
        null=True,
        editable=False,
        related_name='+',
        on_delete=polymorphic.SET_NULL,
    )

e.g.
image

Currently, we can stuff that info into the job_explanation, but it won't be a link in the UI or anything.

Because waiting_on could be a projectupdate/job/inventorysource, we would also need another field in the json that identifies the type of resource, so that we can get the appropriate link.

@jakemcdermott and I brainstormed some ideas how to display this link. One thought was, if waiting_on is null, then just display job_explanation. If not null, then have a predefined string with the link to the waiting_on resource (e.g. "waiting on <project update 123>"). The downside of this is that there might be case we want job_explanation information to take a priority over waiting_on information, so having these two fields compete for the same space in the UI is not ideal. Having a separate "Waiting On" field in the UI solves this, but then again it's yet another field on the detail page, introducing more clutter.

Maybe we should punt the UI link idea until later, and settle with the generic text in the job_explanation for now?

@ryanpetrello
Copy link
Contributor

ryanpetrello commented Jan 29, 2021

@fosterseth before we merge this, can we squash a few commits (or at least, improve a few of the latter commit messages)?

@fosterseth
Copy link
Member Author

@ryanpetrello Yep I plan to squash them all into one

@softwarefactory-project-zuul
Copy link
Contributor

Build failed.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@fosterseth
Copy link
Member Author

states for all jobs

  • created
  • acknowledged
  • waiting
  • pre run
  • preparing playbook
  • running playbook
  • post run
  • finalize run

special processing for some jobs

  • start job fact cache
  • finish job fact cache
  • event processing cache
  • notifications sent

stuck in pending

  • blocked by
  • needs capacity

blocked_by = self.job_blocked_by(task)
if blocked_by:
task.log_lifecycle("blocked", blocked_by=blocked_by)
job_explanation = gettext_noop(f"waiting for {blocked_by._meta.model_name}-{blocked_by.id} to finish")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to make the blocked by job name to become a link in the UI so we can easily navigate to it and follow if we want to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, we plan to make this a link in the UI, which would be helpful to users

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@fosterseth
Copy link
Member Author

was curious how big this file gets -- here it is after running the yolo tests

[root@ip-10-0-4-116 tower]# ls -lh
total 1.7M
-rw-r-----. 1 awx  awx   52K Feb  3 02:58 callback_receiver.log
-rw-r-----. 1 awx  awx   99K Feb  3 02:58 dispatcher.log
-rw-r-----. 1 awx  awx     0 Feb  3 01:55 isolated_manager.log
-rw-r-----. 1 awx  awx  667K Feb  3 02:58 job_lifecycle.log
-rw-r-----. 1 awx  awx     0 Feb  3 01:55 management_playbooks.log
drwxr-xr-x. 2 awx  awx     6 Feb  3 02:14 playbook_profiling
-rw-------. 1 root root 658K Feb  3 02:11 setup-2021-02-03-01:50:34.log
-rw-r-----. 1 awx  awx   32K Feb  3 02:58 task_system.log
-rw-r-----. 1 awx  awx  212K Feb  3 03:05 tower.log
-rw-r-----. 1 awx  awx     0 Feb  3 01:55 tower_rbac_migrations.log
-rw-r-----. 1 awx  awx     0 Feb  3 01:55 tower_system_tracking_migrations.log
-rw-r-----. 1 awx  awx     0 Feb  3 01:55 wsbroadcast.log

@fosterseth fosterseth force-pushed the feat_wip_job_lifecycle branch 2 times, most recently from 0686062 to 227750a Compare February 3, 2021 16:45
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

Various	points (e.g. created, running, processing events), are
structured into	json format and	output to /var/log/tower/job_lifecycle.log

As part	of this	work, the DependencyGraph is reworked to return
which job object is doing the blocking, rather than a boolean.
@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded.

@softwarefactory-project-zuul
Copy link
Contributor

Build succeeded (gate pipeline).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants