Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sched: Add pending job cancellation support #291

Merged
merged 6 commits into from Mar 22, 2018

Conversation

Projects
None yet
5 participants
@dongahn
Copy link
Contributor

dongahn commented Mar 20, 2018

This adds support for cancelling pending jobs in response to the flux wreck cancel command. (Please see flux-core Issue #1359).

Note that this PR requires flux-core PR #1374 to be merged into flux-core first.

Two more items:

  • add errno support when resource loading fails. flux-core Issue #1361.

  • add some log print statements which can be used to help identify the performance/scalability bottleneck of resource loading. flux-core Issue #1363.

Maybe, @SteVwonder and/or @grondo can review this?

@dongahn dongahn force-pushed the dongahn:track_cancel branch 2 times, most recently from 048b75f to 842b8e5 Mar 20, 2018

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Mar 20, 2018

@dongahn, looks like this PR needs to be rebased on flux-sched/master. I'll try to review the PR soon (and give it a spin on my desktop or a production machine)

@dongahn dongahn force-pushed the dongahn:track_cancel branch from 842b8e5 to 7ef27b3 Mar 20, 2018

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Mar 20, 2018

@grondo: Thanks. Odd, though. I actually freshly created a fork from the flux-sched/master but still,
the latest pull commit (a4dae4) is missing from this PR.

Rebased and forced a push.

@coveralls

This comment has been minimized.

Copy link

coveralls commented Mar 20, 2018

Coverage Status

Coverage decreased (-0.08%) to 75.883% when pulling 1898ee5 on dongahn:track_cancel into a4dae42 on flux-framework:master.

@dongahn dongahn force-pushed the dongahn:track_cancel branch from 7ef27b3 to b633dba Mar 21, 2018

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Mar 21, 2018

Codecov Report

Merging #291 into master will decrease coverage by 0.05%.
The diff coverage is 65.38%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #291      +/-   ##
==========================================
- Coverage   74.19%   74.13%   -0.06%     
==========================================
  Files          49       49              
  Lines        9176     9257      +81     
==========================================
+ Hits         6808     6863      +55     
- Misses       2368     2394      +26
Impacted Files Coverage Δ
resource/dfu_traverse_impl.hpp 100% <100%> (ø) ⬆️
resource/scoring_api.hpp 89.72% <100%> (ø) ⬆️
sched/sched.c 73.77% <59.3%> (-0.68%) ⬇️
sched/flux-waitjob.c 85.12% <92.85%> (+0.3%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4dae42...1898ee5. Read the comment docs.

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Mar 21, 2018

Hmmm I think the new reap support in sched is causing a slight coverage difference. But there isn't an easy way to test this... (until reap is implemented in flux-core).

I can remove the code to increase the coverage but I am not convinved that it is a good reason to remove code.

q_move_to_cqueue (ctx, job);
} else {
/* free resource here if reap is not enabled */
q_rm_from_pqueue (ctx, job);

This comment has been minimized.

@SteVwonder

SteVwonder Mar 21, 2018

Member

At this point, I believe the job is actually in the rqueue not the pqueue, and thus I think the function call should be q_rm_from_rqueue.

The q_move_to_cqueue function, which is called in the if block above, pulls jobs out of the rqueue and moves them into the cqueue.

This comment has been minimized.

@dongahn

dongahn Mar 21, 2018

Author Contributor

Great catch! Will fix it.

flux_log (h, LOG_ERR, "job state is %d.", job->state);
goto error;
}
zlist_remove (ctx->p_queue, job);

This comment has been minimized.

@SteVwonder

SteVwonder Mar 21, 2018

Member

This should be replaced with a call to q_rm_from_pqueue, which will also set the pq_state to true automatically.

This comment has been minimized.

@dongahn

dongahn Mar 21, 2018

Author Contributor

Yes another good catch! Removing a job from a pending queue should be a scheduleable event. Will fix.

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Mar 21, 2018

My other recommendation is to take the chunk of code below that is copied and pasted several times in the scheduler state machine (whenever a job is reaped, or if reaping is disabled, when the job is cancelled) and replace it with a flux_lwj_t deconstructor, which can later be expanded to include other fields in the flux_lwj_t struct like user, account, and resrc_tree.

            if (job->req)
                free (job->req);
            key = xasprintf ("%"PRId64"", job->lwj_id);
            zhash_delete (ctx->job_index, key);
            free (key);
            free (job);
@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Mar 21, 2018

Yes, thought was to do it this way and refactor later. But I will have to fix other issues, I'd be happy to look into this as well.

Thanks @SteVwonder.

@grondo
Copy link
Contributor

grondo left a comment

Ran through the tests here and everything seems to be working. Just a few inline comments.

goto error;
} else if (job->state != J_SCHEDREQ) {
errno = EINVAL;
flux_log (h, LOG_ERR, "job state is %d.", job->state);

This comment has been minimized.

@grondo

grondo Mar 21, 2018

Contributor

I'm not sure these calls to flux_log (LOG_ERR) belong here. The real error is being returned to the caller via the appropriate errno, and it is not a "system error" so probably shouldn't go the system error log (if that makes sense). If you want these events to be logged, I'd suggest maybe LOG_DEBUG level messages like "Attempt to cancel [nonexistent|state=STATE] job", perhaps including the uid that made the request.

This comment has been minimized.

@dongahn

dongahn Mar 21, 2018

Author Contributor

How do you get the uid of the request?

This comment has been minimized.

@grondo

grondo Mar 21, 2018

Contributor

int flux_msg_get_userid (const flux_msg_t *msg, uint32_t *userid);

flux_log (ctx->h, LOG_INFO, "pending job (%"PRId64") removed.", jobid);
if (flux_respond_pack (h, msg, "{ }") < 0)
flux_log_error (h, "%s", __FUNCTION__);
return;

This comment has been minimized.

@grondo

grondo Mar 21, 2018

Contributor

Since you are already responding with an empty message here, it might be useful to include the jobid in the response, which might make handling asynchronous responses easier on the sched.cancel caller, e.g.

  if (flux_respond_pack (h, msg, "{s:I}", "jobid", jobid) < 0)
test ${state} = "complete" &&
flux module remove sched
'

This comment has been minimized.

@grondo

grondo Mar 21, 2018

Contributor

Should a test for cancel of non-existent jobid be added here as well to test the ENOENT case above?

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Mar 21, 2018

@grondo: Sound good across the board!. I will address those.

dongahn added some commits Mar 17, 2018

resource: fix compiler-caught integer comparison issues
Fix comparisons between integers of different types
as GCC 4.9.3 on a TOSS 3 system failed to compile the code
sched: Add sched.cancel support
Add support for cancelling pending jobs in response to
sched.cancel requests.

@dongahn dongahn force-pushed the dongahn:track_cancel branch from b633dba to 1898ee5 Mar 22, 2018

@dongahn dongahn referenced this pull request Mar 22, 2018

Closed

Refactor flux_lwj_t code #293

@dongahn

This comment has been minimized.

Copy link
Contributor Author

dongahn commented Mar 22, 2018

OK. Forced a push, I think this is ready to go in. @SteVwonder, I decided not to do flux_lwj_t refactoring prematurely at this round. I've created a new issue (#293) so we can revisit it later.

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Mar 22, 2018

👍 LGTM. Thanks Dong!

@SteVwonder SteVwonder merged commit 822f739 into flux-framework:master Mar 22, 2018

2 of 4 checks passed

codecov/patch 65.38% of diff hit (target 74.19%)
Details
codecov/project 74.13% (-0.06%) compared to a4dae42
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
coverage/coveralls Coverage decreased (-0.08%) to 75.883%
Details

@grondo grondo referenced this pull request May 11, 2018

Closed

Need 0.5.0 Release #340

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.