Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disable live module, require JSON payloads to be JSON objects #640

Merged
merged 20 commits into from Apr 13, 2016

Conversation

garlick
Copy link
Member

@garlick garlick commented Apr 12, 2016

The main point of this PR is to comment out the live module in rc1, rc3 as proposed in #638. Note that flux up still works, as it already contained fallback code to return all nodes up if the KVS key containing the live module's monitoring results is not found.

Not loading the live module broke some tests which send ping requests to the live module. When updating those tests to ping the broker rather than a module, I hit the segfault discussed in #181, where a test was attempting to elicit EPROTO by sending a json array as payload to a ping request. This would trigger a segfault in shortjson.h without workaround code that was added to the module ping handler but not the broker one.

It seemed like a good idea to tie up this loose end, so the few places where arrays were being sent as payloads were fixed to encapsulate those arrays in JSON objects, and the low level message payload get/set functions were updated to return EPROTO/EINVAL if the payload does not begin with the { character.

Some minor RFC updates will follow.

@garlick garlick added the review label Apr 12, 2016
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.08%) to 74.36% when pulling 8545943 on garlick:liveness into bc9221c on flux-framework:master.

@garlick
Copy link
Member Author

garlick commented Apr 12, 2016

See flux-framework/rfc#40

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.04%) to 74.402% when pulling 5d4a90d on garlick:liveness into bc9221c on flux-framework:master.

@garlick
Copy link
Member Author

garlick commented Apr 12, 2016

Note that flux up still works, as it already contained fallback code to return all nodes up if the KVS key containing the live module's monitoring results is not found.

Oops, something not working right here. Fix coming.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.06%) to 74.382% when pulling 4c2d6ff on garlick:liveness into bc9221c on flux-framework:master.

@grondo
Copy link
Contributor

grondo commented Apr 12, 2016

Great, looks good! Might be cool to add a fuzz test that sends garbage or noise as rpc and ensure you get EPROTO/EINVAL. If you already added that, sorry that I missed it.

@garlick
Copy link
Member Author

garlick commented Apr 12, 2016

True, OK, let me see if I can come up with something.

@garlick
Copy link
Member Author

garlick commented Apr 13, 2016

Given that messages are built with constructors that enforce the internal structure, it's a bit hard to make a fuzz test that's useful here. I think actually I'll just add simple tests for non-conformant JSON payloads.

Fuzzing the local:// connector might be useful to find bugs in the czmq message encode/decode functions, but that seems like an activity for another time.

Before I scrap the fast "AES counter mode seeded by /dev/urandom" utility function that I added (borrowed from scrub) to generate random blobs in support of this though, maybe I'd better check with @grondo to see if he had other ideas?

@grondo
Copy link
Contributor

grondo commented Apr 13, 2016

Given that messages are built with constructors that enforce the internal structure, it's a bit hard to make a fuzz test that's useful here.

I think manually constructing some malformed JSON is probably fine at this point. I guess I was mainly thinking about bugs sending truncated objects (starts with a { but otherwise not a valid json object) or eventually to protect against intentional misuse.

Could you build such a test using lower level calls? I guess I'd say it isn't worth too much time right now, as long as we are ttesting the case being fixed (valid JSON but not a valid object) and perhaps a couple other corner cases. Eventually I suppose there could be entire hardening testsuite just on this subject, applied to all services...

Encapsulate what was a bare JSON array in the lsmod response
in a JSON object.
Encapsulate what was a bare JSON array in the live.hello
response in a JSON object.
When setting a JSON payload, return EINVAL if the payload
does not begin with '{'.

When getting a JSON payload, return EPROTO if the payload
does nto begin with '{'.

Fixes flux-framework#181.
When attempting to send an illegal JSON payload, the
encoding will fail with EINVAL, and the message will
not be sent.  If it were sent, decoding would get EPROTO,
but we no longer allow things to get that far.
As discussed in flux-framework#638, the live module
- cannot function without pgm event distribution (default: off)
- cannot improve reliability without higher level "shrink" operation
Therefore, let's disable it by default.
Remove residual code from API conversion that was doing
an extra JSON encode/decode when generating the lsmod response
message.
If live module has not updated the KVS, list nodes as up
rather than "unknown".
Drop the FOREACH_ZHASH_KEYS macro, as it has no users, and
inherently leaks memory since zhash_keys() allocates a list
of keys without retaining a pointer to it.
In verlay_lspeer_encode() and overlay_mcast_child(), use
the FOREACH_ZHASH iterestor rather than allocate a list
of keys on each call.
Like the TBON, send keepalives on the ring overlay when
they are idle.

Track idle time in the overlay heartbeat handler and
log peers (at LOG_CRIT) that are idle for >= 3 heartbeats.
We see this message routinely, so eliminate redundant info.
Verify that flux_msg_set_payload_json() fails with EINVAL,
and flux_msg_get_payload_json() fails with EPROTO if payload
is not a JSON object.
@garlick
Copy link
Member Author

garlick commented Apr 13, 2016

OK, I just rebased and added some simple tests for the low level message functions for payload get/set.

Also, now that live is not being loaded by default, I added some LOG_CRIT messages that pop up if a TBON child (or upstream ring peer) doesn't send a keeplive for 3 or more heartbeats. And since the ring overlay wasn't sending keepalives, I started sending them.

There's some other minor cleanup as well.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 74.357% when pulling 350c8ad on garlick:liveness into b06b8d3 on flux-framework:master.

@grondo
Copy link
Contributor

grondo commented Apr 13, 2016

Great, is this one ready now?

@garlick
Copy link
Member Author

garlick commented Apr 13, 2016

I think so.

@grondo grondo merged commit 60c84fb into flux-framework:master Apr 13, 2016
@grondo grondo removed the review label Apr 13, 2016
@garlick garlick deleted the liveness branch April 13, 2016 17:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants