Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix buffer overflow when handling slot type in a jobspec #548

Merged
merged 4 commits into from Dec 5, 2019

Conversation

@milroy
Copy link
Member

milroy commented Nov 27, 2019

This PR integrates the patch @dongahn created to address the segfault @cmisale reported in issue #509.

  • Add edge evaluator public method that returns the qualified granules size

  • Add corresponding scoring API public method

  • Add constraint check against qualified granules

This PR should be merged before merging #546, so the flux-sched 0.8.0 release contains the fix.

@milroy milroy requested a review from dongahn Nov 27, 2019
@milroy milroy changed the title resource: fix buffer overflow from issue #509 fix buffer overflow when handling slot type in a jobspec Nov 27, 2019
@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Nov 27, 2019

Codecov Report

Merging #548 into master will increase coverage by 0.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #548      +/-   ##
==========================================
+ Coverage    75.8%   75.83%   +0.03%     
==========================================
  Files          70       70              
  Lines        6992     7001       +9     
==========================================
+ Hits         5300     5309       +9     
  Misses       1692     1692
Impacted Files Coverage Δ
resource/evaluators/edge_eval_api.hpp 80.76% <ø> (ø) ⬆️
resource/evaluators/scoring_api.hpp 78.94% <ø> (ø) ⬆️
resource/evaluators/scoring_api.cpp 70.08% <100%> (+1.05%) ⬆️
resource/evaluators/edge_eval_api.cpp 63.85% <100%> (+0.89%) ⬆️
resource/traversers/dfu_impl.cpp 85.15% <100%> (+0.1%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update af4447f...7818411. Read the comment docs.

@SteVwonder

This comment has been minimized.

Copy link
Member

SteVwonder commented Nov 27, 2019

Thanks @milroy for the patch. Do you mind adding a test for this edge case? Just want to make sure the fix doesn’t get reverted down the road. Based on Dong’s comment here (#509 (comment)), the maximum number of these jobspecs (#509 (comment)) schedulable per node is 2 since it is memory bound and there are only two memory chunks in the resource graph.

@milroy

This comment has been minimized.

Copy link
Member Author

milroy commented Nov 27, 2019

Thanks @milroy for the patch. Do you mind adding a test for this edge case? Just want to make sure the fix doesn’t get reverted down the road.

That's a good idea @SteVwonder. I'll get to work on a test.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 27, 2019

BTW, you can preserve authorship of someone else's work by committing it with the git commit --author= option.

@milroy milroy force-pushed the milroy:issue509-fix branch from 19cfa65 to 30b2384 Nov 27, 2019
@milroy

This comment has been minimized.

Copy link
Member Author

milroy commented Nov 27, 2019

BTW, you can preserve authorship of someone else's work by committing it with the git commit --author= option.

Thanks for the recommendation @grondo; I've amended the commit to correct the authorship.

@grondo

This comment has been minimized.

Copy link
Contributor

grondo commented Nov 27, 2019

Cool, I can help review and merge this once you have the test case written.

@milroy milroy requested a review from grondo Nov 27, 2019
@milroy

This comment has been minimized.

Copy link
Member Author

milroy commented Nov 27, 2019

I'll let you know, @grondo, when I have a completed test and the PR's ready for your review.

@cmisale

This comment has been minimized.

Copy link

cmisale commented Nov 30, 2019

I have tested the patch and it resolves the segmentation fault.
I am attaching the GRUG I generate using Kubernetes' information format about the underlying system, and a simple job yaml file I am using for testing, in case you want to try with kube-flux-like example.

runtimegrug.graphml.txt
job.yaml.txt

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Dec 3, 2019

How did I miss this... @milroy: I believe I had some test cases for this. Let me look it up.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Dec 4, 2019

I think this would be the best to merge before a new tag. Will locate my test cases today.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Dec 4, 2019

@milroy: I now have a test case and will push my changes to your PR branch directly.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Dec 4, 2019

@milroy: OK, I pushed my tests into this. Please review them. Once you are okay with that, I will merge this PR in.

@milroy

This comment has been minimized.

Copy link
Member Author

milroy commented Dec 5, 2019

Thanks @dongahn! I'm updating the branch with changes merged in PR #543.

@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Dec 5, 2019

@milroy: Our workflow has been to rebase to "upstream/master" so that you won't get a new commit (0adeef4). I can post a new PR if you want.

Copying from Dong's comment:
We treat slot in a different way than any
other resource type because it is the only
non-physical resource type. How we handle
is this is : 1) perform a DFV subtree
walk for a slot type, 2) divide the
resource set discovered from this DFV
walk equally according to the slot shape,
3) check if the given number of slots can
be satisfied. There turns out to be a bug
in 2).

For example, if your jobspec is
socket[1]->slot[1]->core[2] and your
machine is socket[1]->core[8], then at the
end of the DFV walk on the subtree rooted
at a socket vertex, we will give you core: 8.
Then, we divide 8 by 2 which means that we
can have 4 slots. So clearly this satisfies
slot[2].

Now, there is an issue in the current code.
The code doesn't factor into account the
granularity in which a resource pool vertex
is modeled/constructed for this.

Say, each memory pool granularity is 64GB and
your machine has 4 of those. In terms of quantity,
your machines has socket[1]->memory[256].
But in terms of scheduleable units you only
have 4 (i.e., 4 x 64GB). Now if you jobspec
is socket[1]->memory[32], we will currently
say you have 8 slots when in fact you only
can fit 4 slots because of this granularity
constraint! This leads to a buffer overflow.

These changes fix the resulting segfault.
@milroy milroy force-pushed the milroy:issue509-fix branch from 0adeef4 to 1ea3bce Dec 5, 2019
dongahn added 3 commits Dec 4, 2019
Add them in preparation for the upcoming
t3017-resource-granule.t sharness test.
Add two subtests within t3017-resource-granule.t that checks
the correctness of matching when each memory resource vertice
is modeled as a pool (of size 2GBs).

Even though each jobspec requests 1GB of memory, because
the finest granularity of memory to be scheduled is 2GBs,
you can schedule only 4 jobspecs.
@dongahn

This comment has been minimized.

Copy link
Contributor

dongahn commented Dec 5, 2019

@milroy: Our workflow has been to rebase to "upstream/master" so that you won't get a new commit (0adeef4). I can post a new PR if you want.

@milroy and I corrected the issue in the current PR because it has all the history. I pushed the same test codes that @milroy already reviewed. So once Travis turns green, I will merge.

@dongahn dongahn merged commit 0e1c65e into flux-framework:master Dec 5, 2019
3 checks passed
3 checks passed
codecov/patch 100% of diff hit (target 75.8%)
Details
codecov/project 75.83% (+0.03%) compared to af4447f
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@milroy milroy mentioned this pull request Dec 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.