METRON-1249 Improve Metron MPack service checks #799

nickwallen · 2017-10-13T21:31:55Z

This PR enhances the Metron 'Service Check' functionality in the MPack. The Service Check is an easy way for a user to know if their Metron cluster is healthy. The Service Check can be run manually from within Ambari. It is also automatically executed after kerberization.

In the current version of the Service Check, healthy means the Parser and Indexing topologies are running. This PR enhances that to validate all of the install actions that occur across each of the Metron services. These checks include the following.

Kafka topics, user permissions, group permissions
HBase tables, column families, and user permissions
HDFS resources like the grok patterns and geo database
Ensures all Metron topologies are running
Ensures the web-based resources are responding

I added considerable logging so that if a check does fail, a user will have a reasonable chance to understand why. Ambari doesn't give me an easy way to tell a user "hey, this is the problem!", so the user still has to go through the output of the Service Check in the Operations Panel to know why the Service Check failed.

Testing

I manually tested each of the checks by, for example, deleting a Kafka topic then running the Service Check. This can be repeated for all of the different types of checks that I outlined above.
I tested the Service Check on a fresh deployment of Full Dev.
I then kerberized Full Dev and again validated the Service Check

Pull Request Checklist

Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
Has your PR been rebased against the latest commit within the target branch (typically master)?
Have you included steps to reproduce the behavior or problem that is being changed or addressed?
Have you included steps or a guide to how the change may be verified and tested manually?
Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
Have you written or updated unit tests and or integration tests to verify your changes?
If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

…needed

cestella · 2017-10-16T14:01:31Z

...ack/src/main/resources/common-services/METRON/CURRENT/package/scripts/enrichment_commands.py

+        :param env: Environment
+        """
+        Logger.info("Checking for Geo database")
+        metron_service.check_hdfs_file_exists(self.__params, self.__params.geoip_hdfs_dir + "/GeoLite2-City.mmdb.gz")


You know, honestly, the better approach here, unfortunately, is to pull the filename from the global config and ensure that the file exists. What if the user renames the GeoLite database to something other than GeoLite2-City.mmdb.gz? Can we make that a follow-on JIRA at least?

I agree. I don't like hard-coding "GeoLite2-City.mmdb.gz" here. I just didn't see anywhere in the Mpack where we had that value parameterized already.

Where in the code do we have it parameterized? I'd love to fix this.

Ok, I see it now under a global properties key "geo.hdfs.file". There is nothing in Ambari MPack for it, which might complicate using it. I am just thinking through, if we would need to first introduce it as an Ambari-managed configuration value.

Yeah, we just have the HDFS directory, which is what you're using. For this PR, I think it's good. I'd hate to make the perfect the enemy of the good.

Going forward, what would be cool is if we could execute a stellar script via the REPL and if it fails, fail the service check. Since the REPL can interact with global config parameters and has the capability to validate HDFS, that would be a clean way to do this.

cestella · 2017-10-16T14:46:53Z

...on-mpack/src/main/resources/common-services/METRON/CURRENT/package/scripts/metron_service.py

+  for topic in topics:
+    Logger.info("Checking existence of Kafka topic '{0}'".format(topic))
+    try:
+      Execute(


Any chance we could make a function rather than cutting and pasting? Something like:

def exec( cmd, user_as, fail_msg ) : try: Execute(cmd, tries=3, try_sleep=5, logoutput=True, path='/usr/sbin:/sbin:/usr/local/bin:/bin:/usr/bin', user=user_as) except: raise Fail(fail_msg)

I refactored this a bit and added some pydocs.

cestella · 2017-10-16T14:49:58Z

I really like this. Modulo a couple of very minor nits, I'm +1 by inspection.

ottobackwards · 2017-10-16T15:11:10Z

If my understanding is correct, this will work for all parser topologies listed to automatically start through the ambari configuration, but not all parser topologies that may in fact be running, as we still have the disconnect between the management ui not working with ambari. That is to say, topologies started through the management ui will not be tracked by this.

cestella · 2017-10-16T15:17:31Z

Yes, that's right. In my opinion, though, we should probably stop managing parsers in ambari and just focus on the management UI to have one place to manage all parsers.

ottobackwards · 2017-10-16T15:38:30Z

untangling ambari and zookeeper would be required. If we want ( as I think we need to ) to be able to manage the parsers from the ui / rest, but also have the ambari service management still work ( restart all effected services ). I'm not sure if is more than having ambari read and write to zookeeper for the all_parsers list or not

ottobackwards · 2017-10-16T15:40:01Z

But I am taking what you are saying about managing parsers to mean 'managing configured sensors is not done in ambari, but managing the services is'....

cestella · 2017-10-16T15:43:58Z

Really, what I'd like to see is a status on the parsers in the management UI that indicates that those parsers should be running all the time and a REST call to validate that they are running or not. I'd then expect ambari to call the REST call and fail the service check if any of the installed sensors that are marked as running aren't running.

What we have now is that list in Ambari and ambari managing to start them and stop them. I'd prefer to delegate teh starting and stopping of sensors to the management UI and JUST have ambari interact with the REST API to indicate whether the current state is nominal. That, however, is really more of a topic for a discuss thread, though.

nickwallen · 2017-10-16T16:16:20Z

@ottobackwards Do you have any concerns that need addressed in this PR?

ottobackwards · 2017-10-16T16:18:12Z

@nickwallen no, the problem exists before already

nickwallen · 2017-10-16T16:19:03Z

@ottobackwards Thanks. Your points definitely warrant a discussion. Just wanted to make sure we're green lighted on this PR.

nickwallen added 2 commits October 13, 2017 16:27

METRON-1249 Improve Metron MPack service checks

9619bad

Should use self.__get_kafka_acl_groups() in all places where that is …

c4dd8fb

…needed

cestella reviewed Oct 16, 2017

View reviewed changes

Refactored methods and added pydocs

01e6828

asfgit closed this in fef8833 Oct 16, 2017

nickwallen mentioned this pull request Oct 17, 2017

METRON-1260 Include Alerts UI in Ambari Service Check #804

Closed

9 tasks

nickwallen deleted the METRON-1249 branch September 17, 2018 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

METRON-1249 Improve Metron MPack service checks #799

METRON-1249 Improve Metron MPack service checks #799

nickwallen commented Oct 13, 2017 •

edited

cestella Oct 16, 2017 •

edited

nickwallen Oct 16, 2017

nickwallen Oct 16, 2017

cestella Oct 16, 2017

cestella Oct 16, 2017 •

edited

nickwallen Oct 16, 2017

cestella commented Oct 16, 2017

ottobackwards commented Oct 16, 2017

cestella commented Oct 16, 2017

ottobackwards commented Oct 16, 2017

ottobackwards commented Oct 16, 2017

cestella commented Oct 16, 2017

nickwallen commented Oct 16, 2017

ottobackwards commented Oct 16, 2017

nickwallen commented Oct 16, 2017

METRON-1249 Improve Metron MPack service checks #799

METRON-1249 Improve Metron MPack service checks #799

Conversation

nickwallen commented Oct 13, 2017 • edited

Testing

Pull Request Checklist

cestella Oct 16, 2017 • edited

Choose a reason for hiding this comment

nickwallen Oct 16, 2017

Choose a reason for hiding this comment

nickwallen Oct 16, 2017

Choose a reason for hiding this comment

cestella Oct 16, 2017

Choose a reason for hiding this comment

cestella Oct 16, 2017 • edited

Choose a reason for hiding this comment

nickwallen Oct 16, 2017

Choose a reason for hiding this comment

cestella commented Oct 16, 2017

ottobackwards commented Oct 16, 2017

cestella commented Oct 16, 2017

ottobackwards commented Oct 16, 2017

ottobackwards commented Oct 16, 2017

cestella commented Oct 16, 2017

nickwallen commented Oct 16, 2017

ottobackwards commented Oct 16, 2017

nickwallen commented Oct 16, 2017

nickwallen commented Oct 13, 2017 •

edited

cestella Oct 16, 2017 •

edited

cestella Oct 16, 2017 •

edited