Druid Doctor #8672

vogievetsky · 2019-10-15T03:00:14Z

Adding a testing and troubleshooting framework to the console.

The Druid Doctor is a dialog that runs a series of tests that test the capabilities of a cluster via the HTTP APIs available via the router node. The aim is to be able to identify common cluster and data misconfigurations and present helpful suggestions to the user.

There are currently 9 tests:

Verify own status
Verify own runtime properties
Verify the Coordinator and Overlord status
Verify the Coordinator and Overlord runtime properties
Verify that the sampler works
Verify that SQL works
Verify that there are historical nodes
Verify that the historicals are not overfilled
Look for time chunks that could benefit from compaction

In the future I would also like to add a test for task slots and pending tasks

himanshug · 2019-10-15T16:13:24Z

I am somewhat illiterate when it comes to modern UI frameworks and practices, so couldn't review the code but 👍

vogievetsky · 2019-10-15T17:55:51Z

@himanshug you (and anyone) can always help by providing me with more trouble shooting tests to encode!

fjy · 2019-10-15T18:04:50Z

@vogievetsky can we adjust the color scheme to more accurately reflect the severity of an issue and minimize false scares?

vogievetsky · 2019-10-15T19:21:52Z

@fjy are you reacting to something? Adjust if from what to what? Right now suggestions will show up in a neutral (dark) color and issues will show up in blueprint's warning style. You will get a red error if the tests early terminated due to an unrecoverable state (like in the screenshot I posted where the cluster is not even running).

himanshug · 2019-10-15T19:51:28Z

This is great, but as a followup It would be great if most of this is implemented on the router node with an endpoint , so web console would just call the endpoint to get results of checkup.

advantage of that approach: same code at router could enable periodic checkups in background and automatically generate alerts for the user for many bad states proactively e.g. historical capacity is 90% utilized, more than x tasks are in pending state etc etc ... kind of things that we manually setup alerts for as operator of Druid cluster using other mechanisms. Building it inside Druid would help any user get a lot of that for free.

Not saying that we build the "alert" mechanism itself (e.g. sending mail etc) .. but use existing emitter based Alert mechanism.
Also, router endpoint could just return the checkup results from previous run making web-console experience very fast .

makes sense ?

vogievetsky · 2019-10-16T01:47:16Z

@himanshug I 100% agree and I know @gianm will agree with you also. In fact @gianm was advocating for the checks should live in Java and I was like: "do I look like James Gosling to you?"

But in all seriousness I think you should look at this PR as a working prototype. I want to verify/prove that a 1 button troubleshooting wizard would be a useful addition to the community. I also wanted to have a platform to go around the Druid devs and support ppl that I know to solicit check from them.

I think if this proves to be useful then the next step would be to move the majority of the tests into Java and expose them as an endpoint. I would make an issue to track that as soon as this PR is merged.

jnaous · 2019-10-16T02:14:43Z

Can I suggest using something like https://metrics.dropwizard.io/3.1.0/manual/healthchecks/?

…

On Tue, Oct 15, 2019 at 6:47 PM Vadim Ogievetsky ***@***.***> wrote: @himanshug <https://github.com/himanshug> I 100% agree and I know @gianm <https://github.com/gianm> will agree with you also. In fact @gianm <https://github.com/gianm> was advocating for the checks should live in Java and I was like: "do I look like James Gosling to you?" But in all seriousness I think you should look at this PR as a working prototype. I want to verify/prove that a 1 button troubleshooting wizard would be a useful addition to the community. I also wanted to have a platform to go around the Druid devs and support ppl that I know to solicit check from them. I think if this proves to be useful then the next step would be to move the majority of the tests into Java and expose them as an endpoint. I would make an issue to track that as soon as this PR is merged. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#8672>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAPSYCSFGRL2UJIRLB5I2VTQOZXDBANCNFSM4JAWQPHA> .

-- Jad Naous Imply | VP R&D 650-521-3425 jad.naous@imply.io

gianm

I like the concept but have some message and behavior suggestions. I didn't review everything, but I looked at the first couple chunks of tests. If people like the vibe I am going for with these comments, then some similar ones might apply to the other chunks of tests.

gianm · 2019-10-16T02:19:50Z

web-console/src/dialogs/doctor-dialog/doctor-dialog.tsx

+          terminateChecks: () => {
+            if (!this.mounted) return;
+            this.setState({
+              earlyTermination: `${check.name} early terminated the check suite`,


This line made the screenshot confusing to me (I had a moment like, what does "Verify own status early terminated" mean? We want to verify that the status has early terminated?).

Maybe be more explicit that the check name is part of the message. Also, what does it mean when a check early terminated the check suite? Is it because the check found a problem? Or because it couldn't run at all? (This should be obvious through the message)

gianm · 2019-10-16T02:20:21Z

web-console/src/dialogs/doctor-dialog/doctor-dialog.tsx

+        addToDiagnoses({
+          type: 'issue',
+          check: check.name,
+          message: `${check.name} encountered an unhandled exception`,


What might cause this? (Anyone that sees it will be left scratching their heads, maybe we can help them out a bit)

gianm · 2019-10-16T02:22:47Z

web-console/src/dialogs/doctor-dialog/doctor-dialog.tsx

+      let terminateChecks = false;
+
+      // Slow down a bit so that the user can read the test name
+      await delay(500);


Is this so they appear slowly and look like they are really working hard?

If so, lol.

Also, I dunno, maybe don't do it. Assuming it's milliseconds, 500ms is a long time.

gianm · 2019-10-16T02:25:29Z

web-console/src/dialogs/doctor-dialog/doctor-dialog.tsx

+
+      if (!this.mounted) return;
+      try {
+        await check.check({


Does anything get printed for checks that find no issues? Should it?

Not for each test but if nothing finds anything to report you get a sweet message.

gianm · 2019-10-16T02:28:14Z