Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Druid Doctor #8672

Merged
merged 9 commits into from Oct 23, 2019
Merged

Druid Doctor #8672

merged 9 commits into from Oct 23, 2019

Conversation

vogievetsky
Copy link
Contributor

image

Adding a testing and troubleshooting framework to the console.

The Druid Doctor is a dialog that runs a series of tests that test the capabilities of a cluster via the HTTP APIs available via the router node. The aim is to be able to identify common cluster and data misconfigurations and present helpful suggestions to the user.

There are currently 9 tests:

  • Verify own status
  • Verify own runtime properties
  • Verify the Coordinator and Overlord status
  • Verify the Coordinator and Overlord runtime properties
  • Verify that the sampler works
  • Verify that SQL works
  • Verify that there are historical nodes
  • Verify that the historicals are not overfilled
  • Look for time chunks that could benefit from compaction

In the future I would also like to add a test for task slots and pending tasks

@himanshug
Copy link
Contributor

I am somewhat illiterate when it comes to modern UI frameworks and practices, so couldn't review the code but 👍

@vogievetsky
Copy link
Contributor Author

@himanshug you (and anyone) can always help by providing me with more trouble shooting tests to encode!

@fjy
Copy link
Contributor

fjy commented Oct 15, 2019

@vogievetsky can we adjust the color scheme to more accurately reflect the severity of an issue and minimize false scares?

@vogievetsky
Copy link
Contributor Author

@fjy are you reacting to something? Adjust if from what to what? Right now suggestions will show up in a neutral (dark) color and issues will show up in blueprint's warning style. You will get a red error if the tests early terminated due to an unrecoverable state (like in the screenshot I posted where the cluster is not even running).

@himanshug
Copy link
Contributor

himanshug commented Oct 15, 2019

This is great, but as a followup It would be great if most of this is implemented on the router node with an endpoint , so web console would just call the endpoint to get results of checkup.

advantage of that approach: same code at router could enable periodic checkups in background and automatically generate alerts for the user for many bad states proactively e.g. historical capacity is 90% utilized, more than x tasks are in pending state etc etc ... kind of things that we manually setup alerts for as operator of Druid cluster using other mechanisms. Building it inside Druid would help any user get a lot of that for free.

Not saying that we build the "alert" mechanism itself (e.g. sending mail etc) .. but use existing emitter based Alert mechanism.
Also, router endpoint could just return the checkup results from previous run making web-console experience very fast .

makes sense ?

@vogievetsky
Copy link
Contributor Author

@himanshug I 100% agree and I know @gianm will agree with you also. In fact @gianm was advocating for the checks should live in Java and I was like: "do I look like James Gosling to you?"

But in all seriousness I think you should look at this PR as a working prototype. I want to verify/prove that a 1 button troubleshooting wizard would be a useful addition to the community. I also wanted to have a platform to go around the Druid devs and support ppl that I know to solicit check from them.

I think if this proves to be useful then the next step would be to move the majority of the tests into Java and expose them as an endpoint. I would make an issue to track that as soon as this PR is merged.

@jnaous
Copy link
Contributor

jnaous commented Oct 16, 2019 via email

Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the concept but have some message and behavior suggestions. I didn't review everything, but I looked at the first couple chunks of tests. If people like the vibe I am going for with these comments, then some similar ones might apply to the other chunks of tests.

terminateChecks: () => {
if (!this.mounted) return;
this.setState({
earlyTermination: `${check.name} early terminated the check suite`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line made the screenshot confusing to me (I had a moment like, what does "Verify own status early terminated" mean? We want to verify that the status has early terminated?).

Maybe be more explicit that the check name is part of the message. Also, what does it mean when a check early terminated the check suite? Is it because the check found a problem? Or because it couldn't run at all? (This should be obvious through the message)

addToDiagnoses({
type: 'issue',
check: check.name,
message: `${check.name} encountered an unhandled exception`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What might cause this? (Anyone that sees it will be left scratching their heads, maybe we can help them out a bit)

let terminateChecks = false;

// Slow down a bit so that the user can read the test name
await delay(500);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this so they appear slowly and look like they are really working hard?

If so, lol.

Also, I dunno, maybe don't do it. Assuming it's milliseconds, 500ms is a long time.


if (!this.mounted) return;
try {
await check.check({
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does anything get printed for checks that find no issues? Should it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for each test but if nothing finds anything to report you get a sweet message.

'druid.zk.service.host',
];

const RUNTIME_PROPERTIES_ALL_NODES_SHOULD_AGREE_ON: string[] = ['java.version'];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be too aggressive. It's fine if different servers have different java versions, and would be normal if you're rolling a java upgrade through your servers, or if you just simply aren't fastidious about keeping them all at exactly the same version.

try {
properties = (await axios.get(`/status/properties`)).data;
} catch (e) {
controls.addIssue('Did not get a /status/properties response, something must be broken.');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"from the Router"?

Also, "something must be broken" isn't adding much. If we have any speculation about why this might happen, add it here, otherwise I think we don't have much to say.

Btw, I'd suggest checking for an unauthorized response and translate that into something like "You are not authorized to access the XXX API." (for every case where the response is generated). This way it'll degrade nicely for secure clusters.

// Check that the management proxy is on, it really should be for someone to access the console in the first place but everything could happen
if (properties['druid.router.managementProxy.enabled'] !== 'true') {
controls.addIssue(
`The router's "druid.router.managementProxy.enabled" is not reported as "true" that is unusual.`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would go with different language, like telling the user that we recommend setting this property in order for Coordinator / Overlord APIs to be accessible through the Router (and therefore the web console).

The general idea is we should give people some hints about why each check is there and what they should do about them if issues are found.

!properties['java.runtime.version'].startsWith('1.8')
) {
controls.addSuggestion(
`It looks like are running Java ${properties['java.runtime.version']}, Druid only officially supports Java 1.8.x`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

java.specification.version may be easier to parse. (startsWith 1.8 is a little brittle, what happens when 1.80 is out 🙂)

A period seems more grammatical to me here (vs. a comma).

try {
coordinatorStatus = (await axios.get(`/proxy/coordinator/status`)).data;
} catch (e) {
controls.addIssue('Did not get a /status response from the coordinator, is it running?');
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm I think it'd be nice to be consistent about the format of the checks. I'd suggest something like: Potential issue. Optional clarification of why it may or may not be an issue. Suggestion, if any.

In this case that'd be:

Did not get a /status response from the Coordinator. Try confirming that it is running and accessible.

Btw, these error message should be variable based on the specific error. 404 is different from 400/401/500 etc.

Btw also, please be consistent about capitalization (IMO I like capitalizing the process types).


if (myStatus.version !== coordinatorStatus.version) {
controls.addSuggestion(
`It looks like the Router and Coordinator nodes are on different versions of Druid, are you in the middle of a rolling upgrade?`,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting into the format I suggested above, this would be:

It looks like the Router and Coordinator nodes are running different versions of Druid. This may indicate a problem if you are not in the middle of a rolling upgrade.

@vogievetsky
Copy link
Contributor Author

Great feedback! Thank you!

@gianm
Copy link
Contributor

gianm commented Oct 16, 2019

Can I suggest using something like https://metrics.dropwizard.io/3.1.0/manual/healthchecks/?

It looks like something small but useful. We recently started using dropwizard in a couple other places and it has been nice.

@fjy
Copy link
Contributor

fjy commented Oct 23, 2019

@vogievetsky can you resolve the merge conflicts?

@fjy fjy merged commit d9c9aef into apache:master Oct 23, 2019
@fjy fjy deleted the doctor branch October 23, 2019 23:50
@jon-wei jon-wei added this to the 0.17.0 milestone Dec 17, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants