Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

METRON-1348 Metron Service Checks Use Wrong Hostname #864

Closed
wants to merge 1 commit into from

Conversation

nickwallen
Copy link
Contributor

@nickwallen nickwallen commented Dec 12, 2017

The Metron service check can often use the incorrect hostname when checking the Alerts UI, Management UI, and REST services. This results in a failed service check, even when the services are running successfully.

Ambari can run the service check on any node in the cluster, not just the node the service is actually running on. The service check code currently uses the hostname on which the service check is running. If the service is not actually installed on that host, the service check will incorrectly fail.

The service check code was updated to find the hostname where the service is installed and use that hostname.

Testing

This change was tested by deploying Metron on Full Dev and running Metron > Service Check in Ambari. The service check should complete successfully when the cluster is healthy. The fix has also been tested on a multi-node cluster in the same manner.

Pull Request Checklist

  • Is there a JIRA ticket associated with this PR? If not one needs to be created at Metron Jira.
  • Does your PR title start with METRON-XXXX where XXXX is the JIRA number you are trying to resolve? Pay particular attention to the hyphen "-" character.
  • Has your PR been rebased against the latest commit within the target branch (typically master)?
  • Have you included steps to reproduce the behavior or problem that is being changed or addressed?
  • Have you included steps or a guide to how the change may be verified and tested manually?
  • Have you ensured that the full suite of tests and checks have been executed in the root metron folder via:
  • Have you written or updated unit tests and or integration tests to verify your changes?
  • If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under ASF 2.0?
  • Have you verified the basic functionality of the build by building and running locally with Vagrant full-dev environment or the equivalent?

# UI
metron_management_ui_port = config['configurations']['metron-management-ui-env']['metron_management_ui_port']
# Alerts UI
metron_alerts_ui_host = default("/clusterHostInfo/metron_alerts_ui_hosts", [hostname])[0]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I needed to somehow find the hosts running each service and I knew it was contained in this clusterHostInfo configuration. But it was really difficult to uncover what values Ambari keeps in this clusterHostInfo configuration. I have not been able to find any documentation on this.

I actually had to add some debug statement to a live instance of Ambari to find out what values are stored here and how they are named. Fun, fun.

For the record, here is what is stored in clusterHostInfo when spinning up the current state of Full Dev.

{  
   'snamenode_host':[  
      'node1'
   ],
   'metron_alerts_ui_hosts':[  
      'node1'
   ],
   'nm_hosts':[  
      'node1'
   ],
   'drpc_server_hosts':[  
      'node1'
   ],
   'ambari_server_use_ssl':[  
      'false'
   ],
   'all_ping_ports':[  
      '8670'
   ],
   'all_hosts':[  
      'node1'
   ],
   'rm_host':[  
      'node1'
   ],
   'kafka_broker_hosts':[  
      'node1'
   ],
   'slave_hosts':[  
      'node1'
   ],
   'metron_profiler_hosts':[  
      'node1'
   ],
   'storm_ui_server_hosts':[  
      'node1'
   ],
   'all_racks':[  
      '/default-rack'
   ],
   'all_ipv4_ips':[  
      '127.0.0.1'
   ],
   'app_timeline_server_hosts':[  
      'node1'
   ],
   'hs_host':[  
      'node1'
   ],
   'ambari_server_port':[  
      '8080'
   ],
   'metron_rest_hosts':[  
      'node1'
   ],
   'metron_management_ui_hosts':[  
      'node1'
   ],
   'es_master_hosts':[  
      'node1'
   ],
   'metron_parsers_hosts':[  
      'node1'
   ],
   'kibana_master_hosts':[  
      'node1'
   ],
   'metron_enrichment_master_hosts':[  
      'node1'
   ],
   'hbase_rs_hosts':[  
      'node1'
   ],
   'namenode_host':[  
      'node1'
   ],
   'nimbus_hosts':[  
      'node1'
   ],
   'hbase_master_hosts':[  
      'node1'
   ],
   'metron_indexing_hosts':[  
      'node1'
   ],
   'ambari_server_host':[  
      'node1'
   ],
   'zookeeper_hosts':[  
      'node1'
   ],
   'supervisor_hosts':[  
      'node1'
   ]
}

@ottobackwards
Copy link
Contributor

Ran up in full dev, works as described.
Nice Job.

+1

@anandsubbu
Copy link
Contributor

anandsubbu commented Dec 20, 2017

Hi @nickwallen , I tried this on a 12-node cluster. I validated that clusterHostInfo is populated properly for the alerts_ui, management_ui and rest_ui hosts.

However, in my case it failed on the parser service check since the 'Metron Check' step landed on a host without Kafka broker installed.

Here's the error excerpt:

<snip>
2017-12-20 18:42:54,285 - Performing Parser service check
2017-12-20 18:42:54,285 - Checking for grok patterns in HDFS for Parsers
2017-12-20 18:42:54,285 - Checking HDFS; directory=/apps/metron/patterns user=metron
2017-12-20 18:42:54,285 - Execute['/usr/hdp/2.5.3.0-37/hadoop/bin/hdfs dfs -test -d /apps/metron/patterns'] {'logoutput': True, 'path': ['/usr/sbin:/sbin:/usr/local/bin:/bin:/usr/bin'], 'tries': 3, 'user': 'metron', 'try_sleep': 5}
2017-12-20 18:42:56,822 - Checking Kafka topics for Parsers
2017-12-20 18:42:56,822 - Checking existence of Kafka topic 'bro'
2017-12-20 18:42:56,823 - Execute['/usr/hdp/current/kafka-broker/bin/kafka-topics.sh       --zookeeper metronc-1.openstacklocal:2181,metronc-11.openstacklocal:2181,metronc-10.openstacklocal:2181       --list |       awk 'BEGIN {cnt=0;} /bro/ {cnt++} END {if (cnt > 0) {exit 0} else {exit 1}}''] {'logoutput': True, 'path': ['/usr/sbin:/sbin:/usr/local/bin:/bin:/usr/bin'], 'tries': 3, 'user': 'kafka', 'try_sleep': 5}
-bash: /usr/hdp/current/kafka-broker/bin/kafka-topics.sh: No such file or directory
2017-12-20 18:42:56,900 - Retrying after 5 seconds. Reason: Execution of '/usr/hdp/current/kafka-broker/bin/kafka-topics.sh       --zookeeper metronc-1.openstacklocal:2181,metronc-11.openstacklocal:2181,metronc-10.openstacklocal:2181       --list |       awk 'BEGIN {cnt=0;} /bro/ {cnt++} END {if (cnt > 0) {exit 0} else {exit 1}}'' returned 1. -bash: /usr/hdp/current/kafka-broker/bin/kafka-topics.sh: No such file or directory
-bash: /usr/hdp/current/kafka-broker/bin/kafka-topics.sh: No such file or directory
2017-12-20 18:43:01,987 - Retrying after 5 seconds. Reason: Execution of '/usr/hdp/current/kafka-broker/bin/kafka-topics.sh       --zookeeper metronc-1.openstacklocal:2181,metronc-11.openstacklocal:2181,metronc-10.openstacklocal:2181       --list |       awk 'BEGIN {cnt=0;} /bro/ {cnt++} END {if (cnt > 0) {exit 0} else {exit 1}}'' returned 1. -bash: /usr/hdp/current/kafka-broker/bin/kafka-topics.sh: No such file or directory
-bash: /usr/hdp/current/kafka-broker/bin/kafka-topics.sh: No such file or directory

Command failed after 1 tries
<snip>

I noticed that the clusterHostInfo indeed has a list of the kafka_broker_hosts (see attached
clusterHostInfo-12-node.txt. Would it be possible to either a) force Ambari to run metron service check on one of the kafka broker hosts; or b) run check_kafka_topics on a kafka_broker_host

I am perfectly fine if you think the kafka_broker fix should be a different PR.

@nickwallen
Copy link
Contributor Author

Thanks for testing @anandsubbu . I did not try to fix the 'Kafka not installed' issue with the service check. I am not yet sure how to fix that. I focused this PR on just fixing the bad host names.

@anandsubbu
Copy link
Contributor

I did another 12-node deployment on Centos 7 with this PR (bypassed the kafka issue by installing Kafka broker on all nodes). The fix worked just perfect. Thanks much @nickwallen !

+1 (non-binding)

@nickwallen
Copy link
Contributor Author

I appreciate the reviews @ottobackwards and @anandsubbu .

@asfgit asfgit closed this in 196da12 Dec 21, 2017
iraghumitra pushed a commit to iraghumitra/incubator-metron that referenced this pull request Feb 17, 2018
@nickwallen nickwallen deleted the METRON-1348 branch September 17, 2018 19:21
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
3 participants