Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

METRON-1249 Improve Metron MPack service checks #799

Closed
wants to merge 3 commits into from

Conversation

nickwallen
Copy link
Contributor

@nickwallen nickwallen commented Oct 13, 2017

This PR enhances the Metron 'Service Check' functionality in the MPack. The Service Check is an easy way for a user to know if their Metron cluster is healthy. The Service Check can be run manually from within Ambari. It is also automatically executed after kerberization.

In the current version of the Service Check, healthy means the Parser and Indexing topologies are running. This PR enhances that to validate all of the install actions that occur across each of the Metron services. These checks include the following.

  • Kafka topics, user permissions, group permissions
  • HBase tables, column families, and user permissions
  • HDFS resources like the grok patterns and geo database
  • Ensures all Metron topologies are running
  • Ensures the web-based resources are responding

I added considerable logging so that if a check does fail, a user will have a reasonable chance to understand why. Ambari doesn't give me an easy way to tell a user "hey, this is the problem!", so the user still has to go through the output of the Service Check in the Operations Panel to know why the Service Check failed.

Testing

  • I manually tested each of the checks by, for example, deleting a Kafka topic then running the Service Check. This can be repeated for all of the different types of checks that I outlined above.
  • I tested the Service Check on a fresh deployment of Full Dev.
  • I then kerberized Full Dev and again validated the Service Check

Pull Request Checklist

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?
  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?
  • Have you included steps or a guide to how the change may be verified and tested manually?
  • Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
  • Have you written or updated unit tests and or integration tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

:param env: Environment
"""
Logger.info("Checking for Geo database")
metron_service.check_hdfs_file_exists(self.__params, self.__params.geoip_hdfs_dir + "/GeoLite2-City.mmdb.gz")
Copy link
Member

@cestella cestella Oct 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You know, honestly, the better approach here, unfortunately, is to pull the filename from the global config and ensure that the file exists. What if the user renames the GeoLite database to something other than GeoLite2-City.mmdb.gz? Can we make that a follow-on JIRA at least?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I don't like hard-coding "GeoLite2-City.mmdb.gz" here. I just didn't see anywhere in the Mpack where we had that value parameterized already.

Where in the code do we have it parameterized? I'd love to fix this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I see it now under a global properties key "geo.hdfs.file". There is nothing in Ambari MPack for it, which might complicate using it. I am just thinking through, if we would need to first introduce it as an Ambari-managed configuration value.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we just have the HDFS directory, which is what you're using. For this PR, I think it's good. I'd hate to make the perfect the enemy of the good.

Going forward, what would be cool is if we could execute a stellar script via the REPL and if it fails, fail the service check. Since the REPL can interact with global config parameters and has the capability to validate HDFS, that would be a clean way to do this.

for topic in topics:
Logger.info("Checking existence of Kafka topic '{0}'".format(topic))
try:
Execute(
Copy link
Member

@cestella cestella Oct 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any chance we could make a function rather than cutting and pasting? Something like:

def exec( cmd, user_as, fail_msg ) :
  try:
    Execute(cmd,
                tries=3,
                try_sleep=5,
                logoutput=True,
                path='/usr/sbin:/sbin:/usr/local/bin:/bin:/usr/bin',
                user=user_as)
    except:
        raise Fail(fail_msg)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I refactored this a bit and added some pydocs.

@cestella
Copy link
Member

I really like this. Modulo a couple of very minor nits, I'm +1 by inspection.

@ottobackwards
Copy link
Contributor

If my understanding is correct, this will work for all parser topologies listed to automatically start through the ambari configuration, but not all parser topologies that may in fact be running, as we still have the disconnect between the management ui not working with ambari. That is to say, topologies started through the management ui will not be tracked by this.

@cestella
Copy link
Member

Yes, that's right. In my opinion, though, we should probably stop managing parsers in ambari and just focus on the management UI to have one place to manage all parsers.

@ottobackwards
Copy link
Contributor

untangling ambari and zookeeper would be required. If we want ( as I think we need to ) to be able to manage the parsers from the ui / rest, but also have the ambari service management still work ( restart all effected services ). I'm not sure if is more than having ambari read and write to zookeeper for the all_parsers list or not

@ottobackwards
Copy link
Contributor

But I am taking what you are saying about managing parsers to mean 'managing configured sensors is not done in ambari, but managing the services is'....

@cestella
Copy link
Member

Really, what I'd like to see is a status on the parsers in the management UI that indicates that those parsers should be running all the time and a REST call to validate that they are running or not. I'd then expect ambari to call the REST call and fail the service check if any of the installed sensors that are marked as running aren't running.

What we have now is that list in Ambari and ambari managing to start them and stop them. I'd prefer to delegate teh starting and stopping of sensors to the management UI and JUST have ambari interact with the REST API to indicate whether the current state is nominal. That, however, is really more of a topic for a discuss thread, though.

@nickwallen
Copy link
Contributor Author

@ottobackwards Do you have any concerns that need addressed in this PR?

@ottobackwards
Copy link
Contributor

@nickwallen no, the problem exists before already

@nickwallen
Copy link
Contributor Author

@ottobackwards Thanks. Your points definitely warrant a discussion. Just wanted to make sure we're green lighted on this PR.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants