Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

METRON-671: Refactor existing Ansible deployment to use Ambari MPack #436

Closed
wants to merge 22 commits into from

Conversation

dlyle65535
Copy link
Contributor

@dlyle65535 dlyle65535 commented Feb 2, 2017

Updated Update - I think this is ready for review.

Update - Documentation pending, this is pretty much ready to go and could use some eyes on. The major change is that you'll need Docker (Docker for Mac on Macs) running for the build to complete. This is because the MPack requires the RPMs be built.

To test, run Quick Dev or Full Dev or both.

I'll be working on the documentation in the next day or so.

This is the first set of changes to enable Ansible installation to use the Ambari MPack. It currently works (in my environment) with full-dev using sensor-stubs. It will not work with Quick-Dev or EC2 at this time. Update: All environments worked in my testing.

It's (well past - sorry) starting to get large, so I wanted to push it out for feedback while I'm working issues with the distributed install.

Some points of interest:

  • As discussed, components (ES, Kibana, Metron Topologies) that are managed by Ambari are removed from Monit
  • I had added a notion of 'required configurations' to the blueprint and ambari_cluster_state script so blueprint verification would pass.
  • I've refactored our Ansible host groupings (see inventory/full-dev-platform/hosts). The Metron group is now the host that we use as the Metron services master.
  • I cleaned up the very minimum number of issues that @mattf-horton found in METRON-609 required to make single-node deployment work with the MPack. It's my intention to work to get his changes incorporated on top of the final version of this changeset.
  • I removed the Solr role from Ansible. My current intention is to open a Jira to add a Solr Indexing component to the MPack.

Immediate next steps:

  • Get Full-Dev working.
  • Get the EC2 distributed install working.
  • Get Quick-Dev with enrichment replacement working.
  • Adjust readmes and docs where required.
  • Document test plan

I also hope to receive and respond to feedback.

@@ -1,35 +0,0 @@
Kibana 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this documented anywhere else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is what documented? I just removed the role as it's deprecated with the MPack.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The file I commented was the readme that documented how to add or modify the kibana page template

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, it is, but not well. I'll make sure I get it as part of this effort.

@@ -15,4 +15,4 @@
# limitations under the License.
#
---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason not to go to the latest ( _121 ) if we are going to update?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used Ambari's default so I don't have to worry about any introduced non-compatibility.

@ottobackwards
Copy link
Contributor

How are you running the playbook? I cannot get it to execute

@dlyle65535
Copy link
Contributor Author

I am, for both full dev and ec2. What are you trying to do?

@ottobackwards
Copy link
Contributor

ottobackwards commented Mar 2, 2017

I want to run the playbook to just build the rpm's, not deploy. So just metron_build.
I need to test the POM and a new task, then test rpm build changes.

@dlyle65535
Copy link
Contributor Author

Yes, that works on my rig. Can you tell me a bit more about what you're experiencing?

@ottobackwards
Copy link
Contributor

ottobackwards commented Mar 2, 2017

can you share your command line? Do you run it from /playbooks? do you -i an inventory? Did you copy or create an ansible.cfg?

@dlyle65535
Copy link
Contributor Author

Sure. It's vagrant up or ./run.sh. :) I run it as part of running full dev or ec2.

What error are you getting?

Btw, to only build the rpms, you may can still do an mvn package -DskipTests -Pbuild-rpms from maven-deploy.

@ottobackwards
Copy link
Contributor

I need to run the playbook though.
It doesn't match any hosts

@ottobackwards
Copy link
Contributor

ansible-playbook -v playbooks/metron_build.yml
Using /Users/ottofowler/src/apache/forks/incubator-metron/metron-deployment/ansible.cfg as config file
[WARNING]: provided hosts list is empty, only localhost is available


< PLAY >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/\
            ||----w |
            ||     ||

skipping: no hosts matched


< PLAY RECAP >

    \   ^__^
     \  (oo)\_______
        (__)\       )\/\
            ||----w |
            ||     ||

@ottobackwards
Copy link
Contributor

Do I have to create an inventory with my local machine name?

@dlyle65535
Copy link
Contributor Author

Yeah, that seems like a good thing to try.

@ottobackwards
Copy link
Contributor

ansible-playbook -v -i "localhost," -c local playbooks/metron_build.yml

now it is running and i'll see about the errors

@ottobackwards
Copy link
Contributor

Maybe we should add that line to the doc or create a script??

@dlyle65535
Copy link
Contributor Author

No. It's not meant for that. If you want to build standalone, Maven is the supported way.

@ottobackwards
Copy link
Contributor

ok - when you review what I'm doing in the playbook/role we can talk about alternatives

@dlyle65535
Copy link
Contributor Author

Sounds good. The important thing to remember is the overall goal of reducing reliance on Ansible. I had to add a build task because installation will fail without it, but I fully expect that task to disappear sooner rather than later, so adding additional dependencies on it will need to be scrutinized carefully.

@ottobackwards
Copy link
Contributor

ottobackwards commented Mar 2, 2017

Well, if we don't want to put deployment things with the src, then ansible is a more flexible and easier to use tool for certain tasks too. But we will talk about it if I get it working

@ottobackwards
Copy link
Contributor

ottobackwards commented Mar 4, 2017

@dlyle65535 , after fetch and merge of your latest I can no longer vagrant up full_dev

==> node1: Adding box 'new_base' (v0) for provider: virtualbox
node1: Downloading: /Users/dml/projects/metron-dlyle/metron-deployment/packer-build/builds/base-centos-6.7-2.1.20170303223924.git.33abe8cf13c347a2dfdece145a7b8c17f2a423c0_dirty.virtualbox.box
An error occurred while downloading the remote file. The error
message, if any, is reproduced below. Please fix this error and try
again.

Couldn't open file /Users/dml/projects/metron-dlyle/metron-deployment/packer-build/builds/base-centos-6.7-2.1.20170303223924.git.33abe8cf13c347a2dfdece145a7b8c17f2a423c0_dirty.virtualbox.box
➜ full-dev-platform git:(METRON-258-RPM)

@nickwallen
Copy link
Contributor

I have been able to launch "Quick Dev" with deployment report. Thanks for the fix @dlyle65535

I have been fighting a bit with the AWS deployment. I ran into two issues.

(1) On one pass the setup of Ambari seems to fail, but the deployment continued, which causes it to fail later on in the deployment. To fix, I manually logged into the host and ran the Ambari setup and then re-ran the deployment which addressed the problem.

I am almost certain that I have seen this before prior to the work in this PR.

$ ./run.sh
...

TASK [ambari_master : Setup ambari server] *************************************
...

"Successfully downloaded JDK distribution to /var/lib/ambari-server/resources/jdk-8u77-linux-x64.tar.gz", "Installing JDK to /usr/jdk64/", "Successfully installed JDK to /usr/jdk64/", "Downloading JCE Policy archive from http://public-repo-1.hortonworks.com/ARTIFACTS/jce_policy-8.zip to /var/lib/ambari-server/resources/jce_policy-8.zip", "", "Successfully downloaded JCE Policy archive to /var/lib/ambari-server/resources/jce_policy-8.zip", "Installing JCE policy...", "Completing setup...", "Configuring database...", "Enter advanced database configuration [y/n] (n)? ", "Configuring database...", "Default properties detected. Using built-in database.", "Configuring ambari database...", "Checking PostgreSQL...", "Running initdb: This may take up to a minute.", "Initializing database: [  OK  ]", "", "About to start PostgreSQL", "Configuring local database...", "Connecting to local database...connection timed out...retrying (1)", "Connecting to local database...connection timed out...retrying (2)", "Connecting to local database...unable to connect to database", "ERROR: could not change directory to \"/home/centos\"", "psql: FATAL:  the database system is starting up", "", "ERROR: Exiting with exit code 2. ", "REASON: Running database init script failed. Exiting."], "warnings": []}

$ ./run.sh
...

TASK [ambari_config : check if ambari-server is up on ec2-52-37-229-181.us-west-2.compute.amazonaws.com:8080] ***
fatal: [ec2-52-37-229-181.us-west-2.compute.amazonaws.com]: FAILED! => {"changed": false, "elapsed": 300, "failed": true, "msg": "Timeout when waiting for ec2-52-37-229-181.us-west-2.compute.amazonaws.com:8080"}

(2) The second issue was more unexpected. On all but one of the 10 AWS nodes, the deployment went smoothly. At some point during the deployment, Ansible could not talk to one node, but it continued on anyways. After the 9 were finished, Ambari showed all 10 nodes, except the one, which it showed in yellow indicating that it could not get a heartbeat.

After Ansible was done with the 9 nodes, it then seemed to almost start over on the last node. It went and rebuilt the source code, pushed out the RPMs, reinstalled the MPack, etc. That really confused the cluster and it has not processed any data.

I'm sure a little manual effort could fix-up the cluster, but the behavior of Ansible was weird. Before when I've worked with the AWS deployment, it would fail if any one node failed. Now it seems to retry failed nodes at a later point in time, which has some negative implications when we expect actions like the build, mpack install, etc to only occur once.

Not sure what to make of this issue.

@dlyle65535
Copy link
Contributor Author

Travis failure. Upsetting. This came in with my latest Master merge. Don't fail locally with Travis command line. Log below. It's in unrelated code. Any ideas?


T E S T S

Running org.apache.metron.common.stellar.maas.StellarMaaSIntegrationTest
Tests run: 5, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 6.359 sec - in org.apache.metron.common.stellar.maas.StellarMaaSIntegrationTest
Running org.apache.metron.common.dsl.functions.resolver.ClasspathFunctionResolverIntegrationTest
2017-03-07 21:54:03 ERROR ServiceDiscoverer:171 - instance must be started before calling this method
java.lang.IllegalStateException: instance must be started before calling this method
at com.google.common.base.Preconditions.checkState(Preconditions.java:176)
at org.apache.curator.framework.imps.CuratorFrameworkImpl.getChildren(CuratorFrameworkImpl.java:379)
at org.apache.curator.x.discovery.details.ServiceDiscoveryImpl.queryForNames(ServiceDiscoveryImpl.java:276)
at org.apache.metron.maas.discovery.ServiceDiscoverer.updateState(ServiceDiscoverer.java:130)
at org.apache.metron.maas.discovery.ServiceDiscoverer.lambda$new$0(ServiceDiscoverer.java:94)
at org.apache.metron.maas.discovery.ServiceDiscoverer$$Lambda$4/1527152775.childEvent(Unknown Source)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:685)
at org.apache.curator.framework.recipes.cache.TreeCache$2.apply(TreeCache.java:679)
at org.apache.curator.framework.listen.ListenerContainer$1.run(ListenerContainer.java:92)
at com.google.common.util.concurrent.MoreExecutors$SameThreadExecutorService.execute(MoreExecutors.java:297)
at org.apache.curator.framework.listen.ListenerContainer.forEach(ListenerContainer.java:84)
at org.apache.curator.framework.recipes.cache.TreeCache.callListeners(TreeCache.java:678)
at org.apache.curator.framework.recipes.cache.TreeCache.access$1400(TreeCache.java:69)
at org.apache.curator.framework.recipes.cache.TreeCache$4.run(TreeCache.java:790)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Formatting using clusterid: testClusterID
Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 1.249 sec <<< FAILURE! - in org.apache.metron.common.dsl.functions.resolver.ClasspathFunctionResolverIntegrationTest
org.apache.metron.common.dsl.functions.resolver.ClasspathFunctionResolverIntegrationTest Time elapsed: 1.249 sec <<< ERROR!
java.lang.NoSuchMethodError: org.apache.hadoop.security.authentication.server.AuthenticationFilter.constructSecretProvider(Ljavax/servlet/ServletContext;Ljava/util/Properties;Z)Lorg/apache/hadoop/security/authentication/util/SignerSecretProvider;
at org.apache.hadoop.http.HttpServer2.constructSecretProvider(HttpServer2.java:447)
at org.apache.hadoop.http.HttpServer2.(HttpServer2.java:340)
at org.apache.hadoop.http.HttpServer2.(HttpServer2.java:114)
at org.apache.hadoop.http.HttpServer2$Builder.build(HttpServer2.java:290)
at org.apache.hadoop.hdfs.server.namenode.NameNodeHttpServer.start(NameNodeHttpServer.java:126)
at org.apache.hadoop.hdfs.server.namenode.NameNode.startHttpServer(NameNode.java:752)
at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:638)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:811)
at org.apache.hadoop.hdfs.server.namenode.NameNode.(NameNode.java:795)
at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode(NameNode.java:1488)
at org.apache.hadoop.hdfs.MiniDFSCluster.createNameNode(MiniDFSCluster.java:1111)
at org.apache.hadoop.hdfs.MiniDFSCluster.createNameNodesAndSetConf(MiniDFSCluster.java:982)
at org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:811)
at org.apache.hadoop.hdfs.MiniDFSCluster.(MiniDFSCluster.java:471)
at org.apache.hadoop.hdfs.MiniDFSCluster$Builder.build(MiniDFSCluster.java:430)
at org.apache.metron.integration.components.MRComponent.start(MRComponent.java:58)
at org.apache.metron.common.dsl.functions.resolver.ClasspathFunctionResolverIntegrationTest.setup(ClasspathFunctionResolverIntegrationTest.java:49)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:24)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
org.apache.metron.common.dsl.functions.resolver.ClasspathFunctionResolverIntegrationTest Time elapsed: 1.249 sec <<< ERROR!
java.lang.NullPointerException
at org.apache.metron.integration.components.MRComponent.stop(MRComponent.java:66)
at org.apache.metron.common.dsl.functions.resolver.ClasspathFunctionResolverIntegrationTest.teardown(ClasspathFunctionResolverIntegrationTest.java:63)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:483)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:50)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:47)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:33)
at org.junit.runners.ParentRunner.run(ParentRunner.java:363)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:283)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:173)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:153)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:128)
at org.apache.maven.surefire.booter.ForkedBooter.invokeProviderInSameClassLoader(ForkedBooter.java:203)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:155)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:103)
Running org.apache.metron.common.cli.ConfigurationManagerIntegrationTest
Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.251 sec - in org.apache.metron.common.cli.ConfigurationManagerIntegrationTest
Results :
Tests in error:
org.apache.metron.common.dsl.functions.resolver.ClasspathFunctionResolverIntegrationTest.org.apache.metron.common.dsl.functions.resolver.ClasspathFunctionResolverIntegrationTest
Run 1: ClasspathFunctionResolverIntegrationTest.setup:49 » NoSuchMethod org.apache.ha...
Run 2: ClasspathFunctionResolverIntegrationTest.teardown:63 » NullPointer
Tests run: 8, Failures: 0, Errors: 1, Skipped: 0

@dlyle65535
Copy link
Contributor Author

@nickwallen - I had run EC2 testing a bunch and it worked post at least as well as it did prior (sometimes AWS zigs when it should zag). But, I have made a quite few changes since my last EC2 run. I'll spin it up and see if I get it too.

@dlyle65535
Copy link
Contributor Author

@nickwallen - For your first issue, I think we're hitting a transient issue with Ambari where ambari-setup -s completes successfully but Ambari won't actually start. I'll see if I can get some better diagnosis.
On the second, did Ambari report that the install succeeded? Or did it fail and I didn't catch it?

In other news- good news, bad news. Good news: I am able to replicate the integration test failure by running them in my local environment. Bad news: it's not in the code I touched and I'm completely flummoxed. Help would be much appreciated. Once I can get past these EC2 issues, I can diff master, but like I said. Help? Appreciated.

@dlyle65535
Copy link
Contributor Author

Good news, I think I found the issue with the failing tests. The Maven reported "duplicated" dependencies weren't. I've replaced them. Travis will tell.

@nickwallen - I did see the error you're talking about in your first point above. I think your memory is correct, it's one of those transients that we see sometimes. There's not much that can be done, but I am testing a patch that looks for "FATAL" in the ambari-setup stdout so at least we'll fail where the problem occurs.

@dlyle65535
Copy link
Contributor Author

@nickwallen - I pushed up a changeset that will address both your points. For 1, I added a test to ambari-setup to fail if FATAL appears in stdout. Making sure the EC2 build doesn't run the quick_dev role addresses the second.

@dlyle65535
Copy link
Contributor Author

@justinleet - that last commit adds the quotation requiremnt to the tool tip.

@justinleet
Copy link
Contributor

@dlyle65535 METRON-745 is in (as I'm sure you can tell from the conflict list). I already incorporated the Kibana map changes, so you should just be able to accept master's version.

@dlyle65535
Copy link
Contributor Author

@justinleet - Accepted. Thanks.

@mmiklavc
Copy link
Contributor

mmiklavc commented Mar 9, 2017

@dlyle65535 fyi I'm deploying to ec2 right now. I'll update shortly.

@mmiklavc
Copy link
Contributor

mmiklavc commented Mar 9, 2017

@dlyle65535 is the test for EC2 to just verify everything spins up as normal? Any additional specific items to test or smoke test?

@dlyle65535
Copy link
Contributor Author

@mmiklavc - Exactly. We were wanting to make sure that @nickwallen was having environmental issues.

@mmiklavc
Copy link
Contributor

mmiklavc commented Mar 9, 2017

Looks like everything came up correctly for me in AWS. So +1 to that part of it.

image

@dlyle65535
Copy link
Contributor Author

Thanks @mmiklavc!

@nickwallen
Copy link
Contributor

That's great, @mmiklavc. I am also a +1. I was able to test this successfully on Quick Dev and Full Dev.

@dlyle65535
Copy link
Contributor Author

Thanks for all the help! I intend to merge this in tomorrow afternoon. @ottobackwards, @justinleet, I wanted to make sure you two were good. I also want to make sure there's no other feedback I've missed.

@ottobackwards
Copy link
Contributor

I am +1. I have been working downstream and building off of this for a bit, and been able to get quick and full up with everything started. I have some things that I would like to improve on, but like @dlyle65535 says, this is step 2 of many

@justinleet
Copy link
Contributor

I'm +1. I was just waiting for the EC2 component, but was able to get quick-dev, etc. spun up without issue.

@asfgit asfgit closed this in 68a334a Mar 10, 2017
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Mar 19, 2017
lucesape pushed a commit to repairnator/repairnator-experiments that referenced this pull request Mar 23, 2017
lucesape pushed a commit to surli/librepair-XP that referenced this pull request Apr 7, 2017
lucesape pushed a commit to surli/librepair-XP that referenced this pull request Apr 8, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
6 participants