Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvs-watch: Fix race condition on getroot continuation #1878

Merged
merged 2 commits into from Dec 18, 2018

Conversation

Projects
None yet
3 participants
@chu11
Copy link
Contributor

chu11 commented Dec 17, 2018

In #1876, a race was found in which a namespace monitor was destroyed before a continuation was called, leading to a use after free error.

@garlick, I was never able to reproduce the segfault you show in the ticket. So I'm just assuming this use-after-free is what causes your segfault but just somehow doesn't show up for me. If you can, double check it solves the segfault.

@codecov-io

This comment has been minimized.

Copy link

codecov-io commented Dec 18, 2018

Codecov Report

Merging #1878 into master will decrease coverage by 0.02%.
The diff coverage is 92.85%.

@@            Coverage Diff            @@
##           master   #1878      +/-   ##
=========================================
- Coverage   80.12%   80.1%   -0.03%     
=========================================
  Files         197     197              
  Lines       35008   35015       +7     
=========================================
- Hits        28050   28048       -2     
- Misses       6958    6967       +9
Impacted Files Coverage Δ
src/modules/kvs-watch/kvs-watch.c 79.01% <92.85%> (+0.33%) ⬆️
src/common/libflux/response.c 81.48% <0%> (-1.24%) ⬇️
src/modules/connector-local/local.c 73.77% <0%> (-1.04%) ⬇️
src/modules/kvs/kvs.c 66.35% <0%> (-0.15%) ⬇️
src/common/libflux/message.c 81.51% <0%> (+0.24%) ⬆️
@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 18, 2018

Destroying the getroot future when a response is pending is going to leak a matchtag. It's not the worst thing in the world, but one can imagine an unlikely scenario where this repeats over and over and the handle eventually runs out.

Would it be better to add a usecount to the namespace, and have it be +1 while the getroot is pending?

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 18, 2018

Would it be better to add a usecount to the namespace, and have it be +1 while the getroot is pending?

We can certainly do that. I believe this destroy future pattern is also used with lookups. If a cancel comes in, and responses are pending (on the lookups list), will that leak matchtags too?

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 18, 2018

If a cancel comes in, and responses are pending (on the lookups list), will that leak matchtags too?

Yeah I guess so - anytime an RPC is outstanding and the future is destroyed before it is fulfilled, then the matchtag is not freed back to the pool, because it could be assigned to another RPC, and then the two responses could get confused.

Apologies for missing that earlier.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 18, 2018

I don't think this is too hard to solve by using some reference counts and some flags (i.e. a flag if a namespace is deleted). Should we make this a separate issue?

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 18, 2018

I'd lean towards solving it that way here, but however you prefer is OK with me.

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 18, 2018

Thinking about it a bit, I'll redo the getroot case in this PR to do it correctly and fix this issue, but I'll issue a new PR for the lookup case.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 18, 2018

Sounds good.

@chu11 chu11 force-pushed the chu11:issue1876 branch from ad49af6 to 1dd8b0c Dec 18, 2018

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 18, 2018

the fix was more obvious than I realized. Just don't zhash_delete() if the getroot has not completed. zhash_delete() in the getroot continuation.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 18, 2018

Great!

Does this comment need to be updated now that the continuation is called prior to namespace being destroyed? (or am I misunderstanding?)

        /* store future in namespace, so that if namespace is
         * destroyed, continuation will not be called */
modules/kvs-watch: Fix race on getroot call
Fix race condition in which a watcher could be destroyed before
the initial getroot call is completed.  In this situation, a
namespace being monitored could be destroyed and then used
afterwards.

Fixes #1876

@chu11 chu11 force-pushed the chu11:issue1876 branch from 1dd8b0c to 9c98bec Dec 18, 2018

@chu11

This comment has been minimized.

Copy link
Contributor Author

chu11 commented Dec 18, 2018

ahh good catch, pushed with a tweak.

@garlick

This comment has been minimized.

Copy link
Member

garlick commented Dec 18, 2018

Restarted one builder that seems to have failed in early setup (before flux was involved).

@garlick garlick merged commit 04de39e into flux-framework:master Dec 18, 2018

3 checks passed

codecov/patch 92.85% of diff hit (target 80.12%)
Details
codecov/project Absolute coverage decreased by -0.02% but relative coverage increased by +12.73% compared to d462fe4
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.