Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mgr/dashboard: Allow disabling redirection on standby Dashboards #29088

Merged
merged 1 commit into from Aug 2, 2019

Conversation

@votdev
Copy link
Contributor

commented Jul 17, 2019

SSL-enabled dashboard does not play nicely with a frontend HAproxy. To fix that issue there are two new configuration options:

  1. Disable redirection on standby managers. A HTTP error (500) will be returned instead of a redirection.
    ceph config set mgr mgr/dashboard/standby_behaviour "error"

  2. Configure the HTTP error status code.
    ceph config set mgr mgr/dashboard/standby_error_status_code 503

Fixes: https://tracker.ceph.com/issues/24662

Signed-off-by: Volker Theile vtheile@suse.com

  • References tracker ticket
  • Updates documentation if necessary
  • Includes tests for new functionality or reproducer for bug

@votdev votdev requested a review from rjfd Jul 17, 2019

@votdev votdev force-pushed the votdev:issue_24662 branch 2 times, most recently from cb3b8d5 to f08cea7 Jul 17, 2019

@rjfd

This comment has been minimized.

Copy link
Contributor

commented Jul 17, 2019

@votdev can you also add to the ceph-dashboard documentation the instructions to configure HAproxy, with an example of the HAproxy configuration, and using the commands implemented in this PR?

@votdev votdev marked this pull request as ready for review Jul 18, 2019

@votdev votdev force-pushed the votdev:issue_24662 branch from f08cea7 to 483e283 Jul 18, 2019

@votdev votdev force-pushed the votdev:issue_24662 branch from 483e283 to 8d7802b Jul 18, 2019

@sebastian-philipp

This comment has been minimized.

Copy link
Member

commented Jul 18, 2019

Relates to rook/rook#3076

@sebastian-philipp sebastian-philipp referenced this pull request Jul 18, 2019

Open

Allow configuration of two Ceph mgr daemons #3076

0 of 5 tasks complete
@epuertat
Copy link
Contributor

left a comment

That's nice work! Just tested in my environment and tried a few improvements. My comments left there.

@@ -20,6 +20,8 @@ class Options(object):
"""
ENABLE_BROWSABLE_API = (True, bool)
REST_REQUESTS_TIMEOUT = (45, int)
STANDBY_REDIRECT_ENABLED = (True, bool)
STANDBY_REDIRECT_URI = ('', str)

This comment has been minimized.

Copy link
@epuertat

epuertat Jul 18, 2019

Contributor

Is it not better to keep this to MODULE_OPTIONS as Options()? Settings was ok, but it hasn't been updated since a while, so, for example, the type is not enforced by the Mgr, but only within dashboard. And you cannot add description or other fields.

This comment has been minimized.

Copy link
@votdev

votdev Jul 19, 2019

Author Contributor

You are right, that's much better. Thus it is possible to configure the behavior per node (which is useless finally, but who knows, users are imaginative :-)).

This comment has been minimized.

Copy link
@rjfd

rjfd Jul 19, 2019

Contributor

Settings was ok, but it hasn't been updated since a while

Maybe we should revisit it and improve the settings module. (Not in this PR of course)

This comment has been minimized.

Copy link
@votdev

votdev Jul 19, 2019

Author Contributor

I like the new approach using MODULE_OPTIONS because it can be configured via common Ceph commands.

return template.format(delay=5)
# Return 'Not Found', so proxies can mark this
# node as 'DOWN'.
raise cherrypy.HTTPError(404)

This comment has been minimized.

Copy link
@epuertat

epuertat Jul 18, 2019

Contributor

Don't think 4xx is ok here. 4xx errors are client side and cacheable:

The 4xx (Client Error) class of status code indicates that the client
seems to have erred. Except when responding to a HEAD request, the
server SHOULD send a representation containing an explanation of the
error situation, and whether it is a temporary or permanent
condition. These status codes are applicable to any request method.
User agents SHOULD display any included representation to the user.

I think a 5xx would be the proper way to behave here:

Suggested change
raise cherrypy.HTTPError(404)
raise cherrypy.HTTPError(500, message="Keep on looking")

This comment has been minimized.

Copy link
@votdev

votdev Jul 19, 2019

Author Contributor

Done

http-check expect status 200
server x <HOST>:<PORT> check-ssl check verify none
server y <HOST>:<PORT> check-ssl check verify none
server z <HOST>:<PORT> check-ssl check verify none

This comment has been minimized.

Copy link
@epuertat

epuertat Jul 18, 2019

Contributor

Taking your example as input and modifying it:

backend dashboard_back_ssl
    mode tcp
    balance roundrobin
    option httpchk GET /
    http-check expect status 200
    server x 127.0.0.1:11000 check-ssl check verify none
    server y 127.0.0.1:13000 check-ssl check verify none
    server z 127.0.0.1:15000 check-ssl check verify none

And changing in dashboard the response from 300 to a 500, it works:

curl -vk https://172.20.0.2:443
# OK

# Let's force fail-over
bin/ceph mgr fail x

# HAproxy detects failure and new avail node and switches backend
[WARNING] 198/174415 (8148) : Server dashboard_back_ssl/y is UP, reason: Layer7 check passed, code: 200, info: "HTTP status check returned code <3C>200<3E>", check duration: 14ms. 2 active and 0 backup servers online. 0 sessions requeued, 0 total in queue.

curl -vk https://172.20.0.2:443
# OK

This comment has been minimized.

Copy link
@votdev

votdev Jul 19, 2019

Author Contributor

I think the balance algorithm doesn't care in our scenario because only one node is active, so HAProxy can only use this and does not have to select one out of multiple nodes where the balance algorithm takes action.

I will remove the balance and stick-table lines from the example config, they do not make sense in our special scenario.

This comment has been minimized.

Copy link
@epuertat

epuertat Jul 19, 2019

Contributor

Well, don't worry, the available checked nodes prevail over everything else, so the roundrobin with a single available dashboard is the same as no balancing.

This comment has been minimized.

Copy link
@epuertat

epuertat Jul 19, 2019

Contributor

BTW, I'm wondering if:

    http-check expect rstatus ^2

As technically REST APIs may return other 20x codes (201, 202, 204) when creating or deleting resources.

This comment has been minimized.

Copy link
@votdev

votdev Jul 19, 2019

Author Contributor

But in this case the health check does not check the REST API, but the WebUI which should return a 200 if it works.

This comment has been minimized.

Copy link
@epuertat

epuertat Jul 19, 2019

Contributor

You're completely right! As we are using SSL passthrough, HAProxy will never see the encrypted HTTP response codes.

if active_uri:
module.log.info("Redirecting to active '%s'", active_uri)
raise cherrypy.HTTPRedirect(active_uri)
if Settings.STANDBY_REDIRECT_ENABLED:

This comment has been minimized.

Copy link
@epuertat

epuertat Jul 18, 2019

Contributor

I'd just a this new option to set the new HTTP code. If not defined, the legacy behaviour should be in place.

This comment has been minimized.

Copy link
@votdev

votdev Jul 19, 2019

Author Contributor

I'd just a this new option to set the new HTTP code.

??? Don't understand that :-)

This comment has been minimized.

Copy link
@epuertat

epuertat Jul 19, 2019

Contributor

Sorry, I wasn't exactly brilliant there.

I just re-read the ceph tracker issue, and tried to understand which use-case we are trying to deal with here:

  • Legacy behaviour: ceph-dashboard with no proxy. Stand-by 3xx redirects to active dashboard. If for some reason that info is not available, it self-refreshes after 5 secs.
  • NEW: HAproxy: while this could potentially require dashboard to be aware of the virtual service address (VIP:VPT):
    1. We barely use explicit service addresses: perhaps only the 3xx Redirect to the active mgr, and the Grafana API URL.
    2. The 3xx can be made customizable, so dashboard returns the HTTP status code required for the HAProxy to mark it as unavailable. For example, 503 - Service Unavailable includes a Retry-After header that might be aligned with HAProxy check interval.
    3. While I think the above fulfills the use-case, for any edge case a network admin could work around that with DNS aliases (e.g.: an internal DNS resolves {x,y,z}.mgr resolve to the internal IP addresses, while the external one aliases those to the Reverse Proxy address).

So, that said, rather than adding 2 new settings and 2 new possible states (404 and custom url), why not just adding a new one (e.g. dashboard-standby-http-response: {300,400,500} or dashboard-standby-behaviour: {redirect,error}), by default 301/307 (whichever the Redirect is), and it can be optionally set to 404, 5xx, etc.

For planning dashboard integrations to 3rd parties I think it's best to stick to the existing use-cases than to fulfill all possible (but not current) use-cases (e.g.: general reverse proxy).

This comment has been minimized.

Copy link
@votdev

votdev Jul 22, 2019

Author Contributor

Done

@votdev votdev force-pushed the votdev:issue_24662 branch 4 times, most recently from 7113ef7 to c98bd04 Jul 19, 2019

@epuertat
Copy link
Contributor

left a comment

Looks great! I have a minor concern about the localized option for the error code, but everything else is perfect to me.

</html>
"""
return template.format(delay=5)
status = module.get_localized_module_option('standby_error_status_code', 500)

This comment has been minimized.

Copy link
@epuertat

epuertat Jul 22, 2019

Contributor

Mmm, this is the kind of thing I would't allow to be localized. Don't you think it'd cause a tricky behaviour for some nodes to redirect with 3xx while others fail with 5xx?

This comment has been minimized.

Copy link
@votdev

votdev Jul 23, 2019

Author Contributor

Indeed, an admin is able to do weird things with this, but maybe it needs to be necessary in some cases. The get_localized_module_option falls back to the global setting if there is no localized one. The documentation never talks about the localized configuration capability, so we should be worried about weird usage. But IMO we should implement this feature using the localized API to be able to configure it the weird way if it's necessary.

This comment has been minimized.

Copy link
@epuertat

epuertat Jul 23, 2019

Contributor

Mmm. Have to disagree here. Every new configuration option increases the degrees of freedom of a system, which involves a exponential increase in the end-to-end cases to be tested. The systems should be as flexible as strictly required and nothing more. In the end, this moves design/coding decision-making into customer's configuration management issues (a related paper from NetApp folks: "Hey, you have given me too many knobs").

Compare test cases:

  • 1 new option standby_behaviour=["redirect"|"error"]. 2 tests:
    1. standby_behaviour="redirect"
    2. standby_behaviour "error"

To:

  • 2 new localized options standby_behaviour=["redirect"|"error"] + standby_error_status_code=[300,500]' for 3-node mgr cluster. 3 ^ (1+2) = 27 tests:
    1. standby_behaviour[mgr.x]="redirect", standby_behaviour[mgr.y]="redirect", standby_behaviour[mgr.z]="redirect"
    2. standby_behaviour[mgr.x]="redirect", standby_behaviour[mgr.y]="redirect", standby_behaviour[mgr.z]="error" AND:
      1. standby_error_status_code[mgr.z]=300
      2. standby_error_status_code[mgr.z]=500
    3. standby_behaviour[mgr.x]="redirect", standby_behaviour[mgr.y]="error", standby_behaviour[mgr.z]="error" AND:
      1. standby_error_status_code[mgr.z]=300
        1. standby_error_status_code[mgr.y]=300
        2. standby_error_status_code[mgr.y]=500
      2. standby_error_status_code[mgr.z]=500
        ...

This comment has been minimized.

Copy link
@votdev

votdev Jul 24, 2019

Author Contributor

I can not share your opinion because i've had many situations in my developer life where such 'too much knobs' saved the life of my company out in the field at the customer. But i will adapt the code to get this PR merged soon.

This comment has been minimized.

Copy link
@epuertat

epuertat Jul 24, 2019

Contributor

That's great, @votdev, but can you please tell me how having different ceph-mgr instances returning for example: a 418 ("I'm a teapot"), 402 ("Payment required"), 505 ("HTTP version not supported") or 599 (??), all of them values that we accept here (and which we should never return even on error), are somehow going to help any customer?

I mean, we shouldn't do things "just in case". This PR has an issue ticket attached which clearly states the scope of changes: "support HAProxy with SSL-enabled dashboard". Can that be just handled with a single config option to choose between a 302 ("Found") or a 503 ("Service not available")? I think so. Then, IMHO everything else is added complexity with zero-value.

SSL-enabled dashboard does not play nicely with a frontend HAproxy. T…
…o fix that issue there are two new configuration options:

1. Disable redirection on standby managers. A HTTP error (500) will be returned instead of a redirection.
   $ ceph config set mgr mgr/dashboard/standby_behaviour "error"

2. Configure the HTTP error status code.
   $ ceph config set mgr mgr/dashboard/standby_error_status_code 503

Signed-off-by: Volker Theile <vtheile@suse.com>

@votdev votdev force-pushed the votdev:issue_24662 branch from c98bd04 to 5bb9869 Jul 24, 2019

@rjfd rjfd added the needs-qa label Aug 1, 2019

@callithea

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

QA run failed:
http://pulpito.ceph.com/laura-2019-08-01_21:23:14-rados:mgr-wip-lpaduano-testing-29088-29046-distro-basic-smithi/
due to:
Test failure: test_get (tasks.mgr.dashboard.test_mgr_module.MgrModuleTelemetryTest)
See: https://tracker.ceph.com/issues/41055

@callithea

This comment has been minimized.

Copy link
Member

commented Aug 2, 2019

As @rjfd mentioned in the stand-up all relevant dashboard related QA tests passed and the failure is not related to the PR itself, will remove the QA labels.

@rjfd rjfd merged commit 1fb9c64 into ceph:master Aug 2, 2019

5 of 6 checks passed

make check (arm64) make check failed
Details
Docs: build check OK - docs built
Details
Signed-off-by all commits in this PR are signed
Details
Unmodified Submodules submodules for project are unmodified
Details
ceph dashboard tests ceph dashboard tests succeeded
Details
make check make check succeeded
Details

@votdev votdev deleted the votdev:issue_24662 branch Aug 6, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.