New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Druid Doctor #8672
Druid Doctor #8672
Conversation
I am somewhat illiterate when it comes to modern UI frameworks and practices, so couldn't review the code but 👍 |
@himanshug you (and anyone) can always help by providing me with more trouble shooting tests to encode! |
@vogievetsky can we adjust the color scheme to more accurately reflect the severity of an issue and minimize false scares? |
@fjy are you reacting to something? Adjust if from what to what? Right now suggestions will show up in a neutral (dark) color and issues will show up in blueprint's warning style. You will get a red error if the tests early terminated due to an unrecoverable state (like in the screenshot I posted where the cluster is not even running). |
This is great, but as a followup It would be great if most of this is implemented on the router node with an endpoint , so web console would just call the endpoint to get results of checkup. advantage of that approach: same code at router could enable periodic checkups in background and automatically generate alerts for the user for many bad states proactively e.g. historical capacity is 90% utilized, more than x tasks are in pending state etc etc ... kind of things that we manually setup alerts for as operator of Druid cluster using other mechanisms. Building it inside Druid would help any user get a lot of that for free. Not saying that we build the "alert" mechanism itself (e.g. sending mail etc) .. but use existing emitter based Alert mechanism. makes sense ? |
@himanshug I 100% agree and I know @gianm will agree with you also. In fact @gianm was advocating for the checks should live in Java and I was like: "do I look like James Gosling to you?" But in all seriousness I think you should look at this PR as a working prototype. I want to verify/prove that a 1 button troubleshooting wizard would be a useful addition to the community. I also wanted to have a platform to go around the Druid devs and support ppl that I know to solicit check from them. I think if this proves to be useful then the next step would be to move the majority of the tests into Java and expose them as an endpoint. I would make an issue to track that as soon as this PR is merged. |
Can I suggest using something like
https://metrics.dropwizard.io/3.1.0/manual/healthchecks/?
…On Tue, Oct 15, 2019 at 6:47 PM Vadim Ogievetsky ***@***.***> wrote:
@himanshug <https://github.com/himanshug> I 100% agree and I know @gianm
<https://github.com/gianm> will agree with you also. In fact @gianm
<https://github.com/gianm> was advocating for the checks should live in
Java and I was like: "do I look like James Gosling to you?"
But in all seriousness I think you should look at this PR as a working
prototype. I want to verify/prove that a 1 button troubleshooting wizard
would be a useful addition to the community. I also wanted to have a
platform to go around the Druid devs and support ppl that I know to solicit
check from them.
I think if this proves to be useful then the next step would be to move
the majority of the tests into Java and expose them as an endpoint. I would
make an issue to track that as soon as this PR is merged.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#8672>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAPSYCSFGRL2UJIRLB5I2VTQOZXDBANCNFSM4JAWQPHA>
.
--
Jad Naous
Imply | VP R&D
650-521-3425
jad.naous@imply.io
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the concept but have some message and behavior suggestions. I didn't review everything, but I looked at the first couple chunks of tests. If people like the vibe I am going for with these comments, then some similar ones might apply to the other chunks of tests.
terminateChecks: () => { | ||
if (!this.mounted) return; | ||
this.setState({ | ||
earlyTermination: `${check.name} early terminated the check suite`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line made the screenshot confusing to me (I had a moment like, what does "Verify own status early terminated" mean? We want to verify that the status has early terminated?).
Maybe be more explicit that the check name is part of the message. Also, what does it mean when a check early terminated the check suite? Is it because the check found a problem? Or because it couldn't run at all? (This should be obvious through the message)
addToDiagnoses({ | ||
type: 'issue', | ||
check: check.name, | ||
message: `${check.name} encountered an unhandled exception`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What might cause this? (Anyone that sees it will be left scratching their heads, maybe we can help them out a bit)
let terminateChecks = false; | ||
|
||
// Slow down a bit so that the user can read the test name | ||
await delay(500); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this so they appear slowly and look like they are really working hard?
If so, lol.
Also, I dunno, maybe don't do it. Assuming it's milliseconds, 500ms is a long time.
|
||
if (!this.mounted) return; | ||
try { | ||
await check.check({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does anything get printed for checks that find no issues? Should it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for each test but if nothing finds anything to report you get a sweet message.
'druid.zk.service.host', | ||
]; | ||
|
||
const RUNTIME_PROPERTIES_ALL_NODES_SHOULD_AGREE_ON: string[] = ['java.version']; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could be too aggressive. It's fine if different servers have different java versions, and would be normal if you're rolling a java upgrade through your servers, or if you just simply aren't fastidious about keeping them all at exactly the same version.
try { | ||
properties = (await axios.get(`/status/properties`)).data; | ||
} catch (e) { | ||
controls.addIssue('Did not get a /status/properties response, something must be broken.'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"from the Router"?
Also, "something must be broken" isn't adding much. If we have any speculation about why this might happen, add it here, otherwise I think we don't have much to say.
Btw, I'd suggest checking for an unauthorized response and translate that into something like "You are not authorized to access the XXX API." (for every case where the response is generated). This way it'll degrade nicely for secure clusters.
// Check that the management proxy is on, it really should be for someone to access the console in the first place but everything could happen | ||
if (properties['druid.router.managementProxy.enabled'] !== 'true') { | ||
controls.addIssue( | ||
`The router's "druid.router.managementProxy.enabled" is not reported as "true" that is unusual.`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would go with different language, like telling the user that we recommend setting this property in order for Coordinator / Overlord APIs to be accessible through the Router (and therefore the web console).
The general idea is we should give people some hints about why each check is there and what they should do about them if issues are found.
!properties['java.runtime.version'].startsWith('1.8') | ||
) { | ||
controls.addSuggestion( | ||
`It looks like are running Java ${properties['java.runtime.version']}, Druid only officially supports Java 1.8.x`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
java.specification.version
may be easier to parse. (startsWith 1.8 is a little brittle, what happens when 1.80 is out 🙂)
A period seems more grammatical to me here (vs. a comma).
try { | ||
coordinatorStatus = (await axios.get(`/proxy/coordinator/status`)).data; | ||
} catch (e) { | ||
controls.addIssue('Did not get a /status response from the coordinator, is it running?'); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm I think it'd be nice to be consistent about the format of the checks. I'd suggest something like: Potential issue. Optional clarification of why it may or may not be an issue. Suggestion, if any.
In this case that'd be:
Did not get a /status response from the Coordinator. Try confirming that it is running and accessible.
Btw, these error message should be variable based on the specific error. 404 is different from 400/401/500 etc.
Btw also, please be consistent about capitalization (IMO I like capitalizing the process types).
|
||
if (myStatus.version !== coordinatorStatus.version) { | ||
controls.addSuggestion( | ||
`It looks like the Router and Coordinator nodes are on different versions of Druid, are you in the middle of a rolling upgrade?`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Putting into the format I suggested above, this would be:
It looks like the Router and Coordinator nodes are running different versions of Druid. This may indicate a problem if you are not in the middle of a rolling upgrade.
Great feedback! Thank you! |
It looks like something small but useful. We recently started using dropwizard in a couple other places and it has been nice. |
@vogievetsky can you resolve the merge conflicts? |
Adding a testing and troubleshooting framework to the console.
The Druid Doctor is a dialog that runs a series of tests that test the capabilities of a cluster via the HTTP APIs available via the router node. The aim is to be able to identify common cluster and data misconfigurations and present helpful suggestions to the user.
There are currently 9 tests:
In the future I would also like to add a test for task slots and pending tasks