Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue337 #362

Merged
merged 3 commits into from
Jul 18, 2018
Merged

Issue337 #362

merged 3 commits into from
Jul 18, 2018

Conversation

tpatki
Copy link
Member

@tpatki tpatki commented Jul 13, 2018

Added request/reponse functionality to update scheduler parameters at runtime. Added a test case for the same. Needs the flux wreck sched-params get/set subcommand to be present in flux-core (separate PR).
Also fixes #364.

@dongahn
Copy link
Member

dongahn commented Jul 13, 2018

@tpatki: great start!

It seems a few minor changes can help me review your PR. You may want to:

  1. Put all the C code changes within sched.c into the first commit
  2. Put all the tests into the second commit;
  3. And remove intermediate commits (you may find git rebase's squash useful)

Also for commit messages. The convention has been:

  • Put a particular component name that the commit affects as "component:" in the commit title
  • Then, have a short summary. Keep it short so that the fully title will show up full in the PR with no elision. "sched: Add support to update sched params at runtime". "test: Add cases for ..."
  • Add some details in the body of the commit. Typically, one uses imperative like "Add request handers for..." instead of "This adds request handers for..." in here. But we can be flexible in the commit body message format.

@dongahn
Copy link
Member

dongahn commented Jul 13, 2018

@tpatki: Sorry I just updated my comments above.

Copy link
Member

@SteVwonder SteVwonder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tpatki! Generally LGTM, just minor nits on the tests.

test_expect_success 'sched-params: getting sched params at runtime works' '
adjust_session_info 4 &&
flux hwloc reload ${excl_4N4B} &&
flux module load sched sched-once=true &&
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend explicitly setting the delay-sched and queue-depth arguments on load so that if the defaults ever change, that doesn't affect this test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SteVwonder: Will do.

flux wreck sched-params set queue-depth=4 &&
timed_sync_wait_job 10 &&
verify_1N_nproc_sleep_jobs ${excl_4N4B_nc}
'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend adding a flux wreck sched-params get queue-depth here to ensure that the value was actually set correctly. Right now, I believe this test just ensures that Flux doesn't crash when you call flux wreck sched-params get.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can avoid writing to a file too by just using test flux wreck sched-params get queue-depth -eq 4

Copy link
Member Author

@tpatki tpatki Jul 13, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, I only added support for flux wreck sched-params get, which returns all the scheduler parameters.

I didn't add support for flux wreck sched-params get queue-depth or flux wreck sched-params get queue-depth,delay-sched. This would require me to introduce redundant code (similar to sched_params_args, but one that "gets" instead of "sets"), and I didn't want to do that. I think this is fine as a first cut, let me know if you think otherwise.

To support something like flux wreck sched-params get queue-depth,delay-sched, we will
(1) Either have to change sched_params_args to have a flag for set/get and have the latter return the sched_params_t object, which can then be packed and sent back to the requester, or,
(2)Add similiar parsing code in either the subcommand in wreck or the callback in sched.c

I was trying to avoid both of those things.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would require me to introduce redundant code (similar to sched_params_args, but one that "gets" instead of "sets"), and I didn't want to do that. I think this is fine as a first cut, let me know if you think otherwise.

This makes sense. Thanks @tpatki!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use flux wreck sched-params get | grep queue-depth to simulate a "get" of a single value? It would be nice to verify the set worked in the tests.

sched/sched.c Outdated

flux_log (h, LOG_DEBUG, "return values of sched_params ");

msg2 = flux_event_encode ("sched.params.returned", NULL);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially, it seems like the only client that cares about a parameter query being answered is the client that made the original query. And that client is about to get an RPC response, so this event seems redundant.

Can you elaborate on the use-cases that you had in mind for this event?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a bit confused about this myself, and was following the convention used by the existing include/exclude/cancel callbacks.

This event (msg2) is used in the flux_send call that goes out to the broker, which is different than the flux_respond_pack which contains the data requested by the client. I wasn't sure why the other callbacks had it, and if the response to the client was different than the information sent out to the broker. Maybe @dongahn or @grondo have some clarity here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of exclude and include request, an extra event needs to be generated so that the scheduler will get an resource event and invoke a scheduler loop. For this particular case, we probably won' t need an event. Sorry if it wasn't obvious in the code. (Should have had a comment.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that makes sense. Thanks Dong! Yes it wasn't clear to me, and didn't want to break anything or cause a race condition. I'll remove the event/flux_send from both the callbacks I added (and add a comment to capture the above discussion as well).

@tpatki
Copy link
Member Author

tpatki commented Jul 13, 2018

@dongahn: Shall fix the PR per your suggestions soon (need to head out for a bit but will revisit later today evening).

@dongahn
Copy link
Member

dongahn commented Jul 13, 2018

Thanks @tpatki

@coveralls
Copy link

coveralls commented Jul 13, 2018

Coverage Status

Coverage decreased (-0.01%) to 75.408% when pulling 58de2de on tpatki:issue337 into 371ae9d on flux-framework:master.

@tpatki
Copy link
Member Author

tpatki commented Jul 13, 2018

@dongahn, @SteVwonder: I submitted a separate PR to flux-core for the new flux wreck sched-params subcommand, please take a look at that as well.

@codecov-io
Copy link

codecov-io commented Jul 14, 2018

Codecov Report

Merging #362 into master will decrease coverage by <.01%.
The diff coverage is 77.19%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #362      +/-   ##
==========================================
- Coverage   73.96%   73.96%   -0.01%     
==========================================
  Files          61       61              
  Lines       10211    10263      +52     
==========================================
+ Hits         7553     7591      +38     
- Misses       2658     2672      +14
Impacted Files Coverage Δ
sched/sched.c 73.4% <77.19%> (-0.02%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 371ae9d...58de2de. Read the comment docs.

@tpatki
Copy link
Member Author

tpatki commented Jul 14, 2018

@dongahn: updated as per your suggestion. Let me know if I missed anything. Thanks!

sched/sched.c Outdated

if (sprms)
rc = sched_params_args (sprms, &(ctx->arg.s_params));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think calling sched_params_args () alone will lead to scheduling behavior changes for all parameters. queue-depth would be okay. But delay-sched would not. If the sched load option was delay-sched=false, and the user executes flux wreck sched-param set delay-sched=true to change it. This alone will not change the scheduler to start delay scheduling.

Generally speaking, as we will add more parameters, we can't expect only changing a field in the sched context will effect the requested behavior.

I think this can be addressed if we refactor adjust_for_sched_params () into effect_sched_params () Here, you can have a logic to dynamically adjust the behavior based on the given scheduling parameters.

I have to believe you can make it so that the function will work for both load time options as well as dynamically requested options.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted, I will work on fixing this and also on testing if the change took effect. I was thinking about this as well, thanks for the pointers!

sched/sched.c Outdated
if (flux_msg_get_userid (msg, &userid) < 0)
goto error;

flux_log (h, LOG_INFO, "sched_params change requested by user at runtime (%u).", userid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nit: Maybe (%u) can be next to user?

sched/sched.c Outdated
goto error;

flux_log (h, LOG_INFO, "sched_param values requested by user (%u).", userid);
if (flux_request_unpack (msg, NULL, "{}") < 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you want to have bulk get AND individual get?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only bulk get for now, lets discuss individual get in person. Take a look at the comment I added to Stephen's question above regarding that. I'm not sure how to avoid duplicate code there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. OK. Yeah I'm ok with bulk get only since this will have be refactored later anyway within flux sched command. Thanks @tpatki.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also think about how to refactor sched_params_args so we don't write duplicate code for the individual get. I'll add an issue corresponding to this once this PR is done.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dongahn, @SteVwonder: PRs now support individual get as well. Please see test cases.

#set -x

test_description='Test dynamic updation of queue depth.
First cut. Ensure jobs are correctly scheduled under different values of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove "First cut."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor nit. I don't think you need "Frist cut" here.

timed_sync_wait_job 10 &&
verify_1N_nproc_sleep_jobs ${excl_4N4B_nc}
'
test_expect_success 'sched-params: unloaded sched module' '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great you have added many tests! You think it will be possible to test whether "set" has led to scheduler behavior changes as well?

@dongahn
Copy link
Member

dongahn commented Jul 14, 2018

@tpatki: Thank you for putting this together so quickly. I have just added some comments inline.

@tpatki
Copy link
Member Author

tpatki commented Jul 17, 2018

@dongahn, @SteVwonder, @grondo:

Please see the modified PR that addresses the changes requested. I have added a few test cases, and have manually verified each of the cases for adjust_for_sched_params through logging/stepping, but I can't ensure the correctness of delay_sched (I can only match existing workflow). If you can think of better test cases, please let me know.

@SteVwonder
Copy link
Member

LGTM. It just needs a rebase to master. Thanks, @tpatki! This was more work than I thought it would be, but this should come in handy.

@tpatki
Copy link
Member Author

tpatki commented Jul 17, 2018

@SteVwonder: rebased.

@dongahn
Copy link
Member

dongahn commented Jul 17, 2018

@tpatki and I looked at the code together a bit after our meeting this morning. There will be some minor changes.

@tpatki
Copy link
Member Author

tpatki commented Jul 17, 2018

@dongahn: updated as per our discussion.

sched/sched.c Outdated

/* If one of the watchers is not available, send error */
if ((!ctx->before && ctx->after) || (!ctx->after && ctx->before)) {
rc = EINVAL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for following up. I would actually go with:

rc = -1;
errno = EINVAL;

sched/sched.c Outdated
}
/* If one of the watchers is not available, send error */
if ((!ctx->before && ctx->after) || (!ctx->after && ctx->before)) {
rc = EINVAL;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for following up. I would actually go with:

rc = -1;
errno = EINVAL;

sched/sched.c Outdated
if (flux_msg_get_userid (msg, &userid) < 0)
goto error;

flux_log (h, LOG_INFO, "sched_params change requested by user (%u) at runtime.", userid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line seems long. Does it fit 80 characters? If not, I'd suggest to break the line into two.

sched/sched.c Outdated
if (flux_msg_get_userid (msg, &userid) < 0)
goto error;

flux_log (h, LOG_INFO, "sched_param values requested by user (%u).", userid);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

@tpatki
Copy link
Member Author

tpatki commented Jul 17, 2018

@dongahn, @SteVwonder: Looks like my testcase fails with certain gcc/clang versions, not sure why. Will debug further in the evening/tmrw. Any thoughts on why this may be happening?

#set -x

test_description='Test dynamic updation of queue depth.
First cut. Ensure jobs are correctly scheduled under different values of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very minor nit. I don't think you need "Frist cut" here.

echo ${excl_4N4B_nc}
'

test_expect_success 'sched-params: getting sched params (no args) at runtime works' '
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any way we can try to shorten the test messages here and below to fit into 80 characters. I don't think it is required but it would be generally good for readability.

@dongahn
Copy link
Member

dongahn commented Jul 17, 2018

Looks like my testcase fails with certain gcc/clang versions, not sure why. Will debug further in the evening/tmrw.

Sounds good. I also added some minor comments above. I believe we are close. Thanks!

@dongahn
Copy link
Member

dongahn commented Jul 17, 2018 via email

@tpatki
Copy link
Member Author

tpatki commented Jul 18, 2018

@dongahn, @SteVwonder, @grondo:

I think this is ready to go, please let me know if I missed anything. It also fixes #364, I had to add "-libevent-dev" to .travis.yml in sched after the pmix update in flux-core.

@dongahn
Copy link
Member

dongahn commented Jul 18, 2018

@tpatki: This looks great. Merging.

@dongahn dongahn merged commit e90be79 into flux-framework:master Jul 18, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sched's Travis build fails after pmix update
6 participants