Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible catch-22 problem when using automatic deployment #182

Closed
vegardx opened this issue Jul 25, 2016 · 26 comments

Comments

Projects
None yet
9 participants
@vegardx
Copy link

commented Jul 25, 2016

We're deploying search-guard-ssl (2.3.4.14) and search-guard-2 (2.3.4.3) to a few elasticsearch clusters (2.3.4), and stumbled upon a strange issue that I cannot see mentioned in the documentation or that anyone else has seem to be having issues with.

I haven't been able to completely verify that this is the problem, but it seems like if you install and configure search-guard-ssl and search-guard-2 at the same time you end up in a catch-22 problem where the cluster seems to be correctly configured, but when you try to initialize it the searchguard index is not replicated across the cluster. They are clearly communicating with each other, as the other nodes are notified about what configuration it should have loaded, see below.

However, if you uninstall search-guard-2 and restart the cluster, then let puppet or whatever reinstall the plugin again, the cluster becomes operational and the index is replicated across the entire cluster.

This was a real gotcha, as the error messages we got back was Search Guard not initialized (SG11) and Not yet initialized, which sort of makes sense, but all seemingly similar issues was solved by running the sgadmin tool, which we could successfully run, it was just not synced across the cluster.

The solution is naturally simple, just let search-guard-ssl be installed first, wait for the cluster to return to a stable state, then install search-guard-2. But I seems like the kind of things that could be solved in the plugins themselves.

Message you get on other nodes when applying configuration using sgadmin:
[2016-07-25 07:39:03,769][ERROR][com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction] [test-apps-vpc-elasticsearch03c-node] Unable to load all configurations types. Loaded '[]' but should '[config, roles, rolesmapping, internalusers, actiongroups]'

@brdaugherty

This comment has been minimized.

Copy link

commented Jul 26, 2016

I believe I've may be seeing this as well on es/sg v2.3.3 using elastic/puppet-elasticsearch to install both SG-SSL and SG2. I've taken to re-running puppet a second time after a bit of a delay to get everything working. The uninitialized failure seems to only happen when the searchguard index is created the first time. Later restarts or changes to elasticsearch or searchguard plugins seem to always do the right thing. Maybe not relevant but nodes are in AWS and transport/node discovery on port 9300 is handled by the elasticsearch-cloud-aws plugin.

@ewolinetz

This comment has been minimized.

Copy link
Contributor

commented Jul 26, 2016

The documentation [1] actually states that searchguard-ssl should be running before searchguard-2 is installed:

"First you need to install and configure Search Guard SSL on all nodes in your cluster. Please note that while most of the configuration settings for Search Guard can be changed at runtime, configuration changes for the SSL/TLS layer require the restart of all participating nodes. So please make sure that Search Guard SSL is running properly before installing Search Guard, and pay attention to the different types of certificates you need for server- and client nodes."

[1] https://github.com/floragunncom/search-guard-docs/blob/master/installation.md

@brdaugherty

This comment has been minimized.

Copy link

commented Jul 26, 2016

Thanks for the explaination.

Would it be possible to narrow down which Search Guard SSL es index, file, etc that SG2 requires from SG SSL to successfully startup ?

This would make it possible to define a dependency in configuration management tools like Chef and Puppet... enabling them to start SG SSL, wait for it to come fully up and then start SG.

@vegardx

This comment has been minimized.

Copy link
Author

commented Jul 27, 2016

There is no reliable way to install both search-guard-ssl and search-guard-2 without adding a lot of logic around the deployment process to avoid running the commands on every run. Wouldn't it be better that search-guard-2 before fully initializing checked wether or not search-guard-ssl was actually running and configured, seeing as it's a required dependency? It makes very little sense to load search-guard-2 if that is not the case. Even more so if that blocks search-guard-ssl when done in parallel.

I did see the docs you are referring to, but it's was not entirely clear that you need to install, configure and restart the nodes first. I knew that search-guard-ssl was configured correctly, so I just imagined it would be able to install search-guard-2.

@ewolinetz

This comment has been minimized.

Copy link
Contributor

commented Jul 27, 2016

For what it's worth, for my system I'm installing Elasticsearch, Searchguard SSL, and Searchguard 2 (in that order) before I start up Elasticsearch for the first time and it looks to start up correctly. I then run the sg-admin tool after Elasticsearch has started up and consider it "ready" after that tool has completed successfully.

@floragunncom

This comment has been minimized.

Copy link
Owner

commented Jul 27, 2016

To bring in some clarification here:

  • Do do not need install searchguard-ssl first and then start your nodes or something like this
  • The normal procedure is:
    • Shutdown all nodes if they are running
    • Install searchguard-ssl plugin and search-guard-2 plugin on every node
    • Make you configuration elasticsearch.yml on every node
    • Start up all nodes
    • Wait a minute (just to be sure cluster is established)
    • Run sgadmin.sh against one of the nodes
    • The "dynamic configuration (like roles, users, ...)" are immediately and instantly applied to all nodes without an restart
    • Done

To update the dynamic configuration (like roles, users, ...) just

  • Run sgadmin.sh against one of the nodes. The dynamic configuration is immediately and instantly applied to all nodes without an restart

If you need to update config properties in elasticsearch.yml then you have to make those changes on every node and then you need to restart every node cause elasticsearch.yml in only read once during node startup.

@vegardx Your

Unable to load all configurations types. Loaded '[]' but should '[config, roles, rolesmapping, internalusers, actiongroups]'
is strange, i will look into this

@ewolinetz The documentation you cited [1] is meant like that: If you misconfigure SSL then your cluster could not be established because the nodes can not talk to each other and then you have n clusters with a node count of one and of course then only the node you connect to with sgadmin will be properly initialized. So if you know for sure that you SSL stuff is working there is no need to install search-guard-ssl "first".

@john-bakker

This comment has been minimized.

Copy link

commented Aug 3, 2016

@vegardx did you have an update on this already ? having the same issue

@rocky4bmw

This comment has been minimized.

Copy link

commented Aug 4, 2016

Hi All,
Can anyone help me with the below issue I am facing

I used tarfile which was said to be the easiest way to installtion searchguard which also includes elasticsearch package. Please refer to below link
https://github.com/floragunncom/search-guard/wiki/Search-Guard-Bundle.

After extracting tarfile I have started elasticsearch using the steps as mentioned in link and later ran sgadmin.sh script and evrything was looking great.

Once I have installed Kibana and Logstash as root user , installed beats on client machine . Currently beats are able to push logs to logstash whereas logstash is not able to comminucate to elasticsearch.

When I found reason it says "Search Guard not initialized (SG11)" and searchguard index not healthy (timeout: truewhich is really strange.

Note: I deployed search guard tarfile using non rootuser.I also installed Kibana logastah and Elasticsearch with search guard on same server. beats on client machines.

Can anyone guide me here please as I am going crazy with this searchguard installation?

@brdaugherty

This comment has been minimized.

Copy link

commented Aug 4, 2016

I tried exactly what you did. What worked for me was to comment SG SSL, SG
out of ES configs, restart ES, uncomment SG SSL and restart ES, verify
TLS/HTTP encryption.. then finally uncomment SG, restart ES and finally
re-run sgadmin.sh. I gave up trying to get it working with both SG SSL and
SG plugins enabled and loaded on a new ES cluster. Appears there is some
dependency and possibly race condition between SG SSL / SG plugins and the
sgadmin.sh interface on the first install... at least in my experience.
Others experiences may differ.

Using puppet we try and tear down the entire stack and bring it up from
scratch regularly--new ES, SG SSL, SG, ES templates, indexes, etc.
Hopefully we can uncover what the missing piece or timing delay which
causes this one... it works beautifully apart from this.

On Thu, Aug 4, 2016 at 6:04 AM, rocky4bmw notifications@github.com wrote:

Hi All,
Can anyone help me with the below issue I am facing

I used tarfile which was said to be the easiest way to installtion
searchguard which also includes elasticsearch package. Please refer to
below link
https://github.com/floragunncom/search-guard/wiki/Search-Guard-Bundle.

After extracting tarfile I have started elasticsearch using the steps as
mentioned in link and later ran agadmin.sh script and evrything was looking
great.

Once I have installed Kibana and Logstash as root user , installed beats
on client machine . Currently beats are able to push logs to logstash
whereas logstash is not able t uso comminucate to elasticsearch.

When I found reason it says "Search Guard not initialized (SG11)" which is
really strange.

Note: I deployed search guard tarfile using non rootuser.I also installed
Kibana logastah and Elasticsearch with search guard on same server. beats
on client machines.

Can anyone guide me here please as I am going crazy with this searchguard
installation?


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#182 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AOPoEzLdkVvVh8DhgXRreS64F7ixj5GGks5qcdVPgaJpZM4JUEuz
.

floragunncom added a commit that referenced this issue Aug 4, 2016

Catch "No Master Exception", This makes sure that Search Guard keeps …
…on trying to initialize itself.

Related to #178 #182
@floragunncom

This comment has been minimized.

Copy link
Owner

commented Aug 4, 2016

Maybe this will fix it (or at least make it more robust): fad8b4d

floragunncom added a commit that referenced this issue Aug 10, 2016

floragunncom added a commit that referenced this issue Aug 10, 2016

floragunncom added a commit that referenced this issue Aug 10, 2016

@floragunncom

This comment has been minimized.

Copy link
Owner

commented Aug 10, 2016

Can you pls test the new 2.3.4.5 release, issue should be fixed with this version.

@Lamabu

This comment has been minimized.

Copy link

commented Aug 26, 2016

I get same problem with Search Guard bundle last version. it gives to me Unable to load all configurations types. Loaded '[]' but should '[config, roles, rolesmapping, internalusers, actiongroups].
It seems that version 2.3.4 for ES gives me the same error. Not even a clue what is all about.
Please help (running Centos 7)

@rocky4bmw

This comment has been minimized.

Copy link

commented Aug 26, 2016

@floragunncom I am using elasticsearch 2.3.4 with searchguard bundle. When I added newusers and roles
after that when I start sgadmin.sh to update roles and users it says cluster timedout. Can you please let us know how can I fix this issue?
Also if I want to change password of admin and other users how can I achieve it?

@john-bakker

This comment has been minimized.

Copy link

commented Aug 26, 2016

I stopped trying in my vagrant environment, switched to my live
environment, there it loaded without problems. I could easily connect
kibana as well, however when trying to connect logstash I was running in
lots of problems again.

I stopped trying to get search-guard up and running, I switched to elastic
shield and got everything running within a day in a multi server cluster
with kibana logstash and ldap all connected.

Op 26 aug. 2016 20:00 schreef "Laura" notifications@github.com:

I get same problem with search gurad bundle last version. it gives to me Unable
to load all configurations types. Loaded '[]' but should '[config, roles,
rolesmapping, internalusers, actiongroups].
Please help


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#182 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/ARwD0nYF5oZqGqYAvZCV1xZj-e3r_ZWuks5qjv-RgaJpZM4JUEuz
.

@floragunncom

This comment has been minimized.

Copy link
Owner

commented Aug 26, 2016

@vegardx @ewolinetz @john-bakker @Lamabu @brdaugherty seems we found the issue but we cannot reproduce the issue locally so we kindly ask you if you can test https://oss.sonatype.org/content/repositories/snapshots/com/floragunn/search-guard-2/2.3.5.6-SNAPSHOT/search-guard-2-2.3.5.6-20160826.200736-47.zip and report back if the problem persists or is gone. It was likely a nasty threading issue. Thanks in advance.

@Lamabu

This comment has been minimized.

@Lamabu

This comment has been minimized.

Copy link

commented Aug 29, 2016

Seems that version http://downloads.floragunn.com/elasticsearch-866a7b29-59d7-444c-833c-f10c42d533e9.tar.gz passes the running of sgadmin.sh successfullly

@modax

This comment has been minimized.

Copy link

commented Sep 1, 2016

@vegardx colleague here. @floragunncom I have tried search-guard-2-2.3.5.6-20160831.101820-74.zip (it is master branch, isn't it?) and it still doesn't work, i.e. consistently does not work. So I looked at the code.

First of all, some context. I'm testing with:

  • ES 2.3.5
  • com.floragunn/search-guard-ssl/2.3.5.15
  • com.floragunn/search-guard/ (v2.3.3.2-47-g25142b8 as of writing)
  • Clean data state.

No other plugins or anything.

  • Single node.
  • Single processor.
  • elasticsearch.yaml as follows:
processors: 1
cluster: 
  name: sgtest
network: 
  bind_host: 
    - _local_
    - _eth0_
  publish_host: _eth0_
node: 
  name: first
path: 
  data: /usr/share/elasticsearch/data/node
  logs: /var/log/elasticsearch/node
searchguard: 
  authcz: 
    admin_dn: 
            - CN=admin
  ssl: 
    http: 
      enabled: true
      keystore_filepath: keystore.jks
      keystore_password: changeit
      truststore_filepath: truststore.jks
      truststore_password: changeit
    transport: 
      enabled: true
      enforce_hostname_verification: false
      keystore_filepath: keystore.jks
      keystore_password: changeit
      truststore_filepath: truststore.jks
      truststore_password: changeit
  • Pay attention to "processors: 1" is there. This is my default case but your test machines are probably multicore so this setting emulates my case.
  • Start ES.
  • Run sgadmin:
Contacting elasticsearch cluster 'sgtest' and wait for YELLOW clusterstate ...
Clustername: sgtest
Clusterstate: GREEN
Number of nodes: 1
Number of data nodes: 1
searchguard index does not exists, attempt to create it ... done
Populate config from /etc/elasticsearch/node/searchguard/
Will update 'config' with /etc/elasticsearch/node/searchguard/sg_config.yml
   SUCC: Configuration for 'config' created or updated
Will update 'roles' with /etc/elasticsearch/node/searchguard/sg_roles.yml
   SUCC: Configuration for 'roles' created or updated
Will update 'rolesmapping' with /etc/elasticsearch/node/searchguard/sg_roles_mapping.yml
   SUCC: Configuration for 'rolesmapping' created or updated
Will update 'internalusers' with /etc/elasticsearch/node/searchguard/sg_internal_users.yml
   SUCC: Configuration for 'internalusers' created or updated
Will update 'actiongroups' with /etc/elasticsearch/node/searchguard/sg_action_groups.yml
   SUCC: Configuration for 'actiongroups' created or updated
FAIL: Expected 5 config types for node O8lLSUZETm6IyWjPU2KoAA but got only []
Done with failures

sgadmin hangs just there until it times out later. Node stays in not initialized state.

If I set "processors: 2" and restart with clean state, sgadmin completes with success. This is because you have:

clusterService.addLifecycleListener(new LifecycleListener() {

            @Override
            public void afterStart() {
threadPool.executor(ThreadPool.Names.GET).execute(new Runnable() {

in TransportConfigUpdateAction class while "get" thread pool is defined as follows in ES ThreadPool class:

add(defaultExecutorTypeSettings, new ExecutorSettingsBuilder(Names.GET).size(availableProcessors).queueSize(1000));

So maximum one GET thread is available, that is already used by TransportConfigUpdateAction initialization code above waiting forever for multiGet in ConfigurationLoader.load() (which is async). MultiGet has no chance of completing because there is no thread available (it is in queue) for it to happen. So we have a guaranteed deadlock here.

Stacktrace bits mentioning com.floragunn (sgadmin is in progress on the node)

"elasticsearch[admessg-admin-vpc-elasticsearch01a-node][management][T#2]" #43 daemon prio=5 os_prio=0 tid=0x00007f7124150000 nid=0x3925 waiting on condition [0x00007f710c71d000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000008ccd3af8> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)                         
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)                                                                                   
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)                      
        at com.floragunn.searchguard.configuration.ConfigurationLoader.load(ConfigurationLoader.java:180) 
        at com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction.nodeOperation(TransportConfigUpdateAction.java:200)                                                                            
        at com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction.nodeOperation(TransportConfigUpdateAction.java:55)
        at org.elasticsearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:92) 
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:228)
        at org.elasticsearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:224)                                                                          
        at com.floragunn.searchguard.ssl.transport.SearchGuardSSLTransportService.messageReceivedDecorate(SearchGuardSSLTransportService.java:165)
        at com.floragunn.searchguard.transport.SearchGuardTransportService.messageReceivedDecorate(SearchGuardTransportService.java:190)
        at com.floragunn.searchguard.ssl.transport.SearchGuardSSLTransportService$Interceptor.messageReceived(SearchGuardSSLTransportService.java:104)
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:75)                                                                                                
        at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:376)                
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)                
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)                
        at java.lang.Thread.run(Thread.java:745)                                                          

"elasticsearch[admessg-admin-vpc-elasticsearch01a-node][get][T#1]" #26 daemon prio=5 os_prio=0 tid=0x00007f7141217000 nid=0x3913 waiting on condition [0x00007f710e34c000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x000000008b0e5e08> (a java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)                                                                                   
        at java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at com.floragunn.searchguard.configuration.ConfigurationLoader.load(ConfigurationLoader.java:180) 
        at com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction$1$1.run(TransportConfigUpdateAction.java:111)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)                
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

When stopping node, the following appears in the logs (as multiget gets its chance but it is too late):

[2016-09-01 11:16:55,309][INFO ][node                     ] [sgtest] started
[2016-09-01 11:16:55,337][INFO ][gateway                  ] [sgtest] recovered [0] indices into cluster_state
[2016-09-01 11:17:18,789][INFO ][cluster.metadata         ] [sgtest] [searchguard] creating index, cause [api], templates [], shards [1]/[0], mappings []
[2016-09-01 11:17:19,016][DEBUG][com.floragunn.searchguard.configuration.SearchGuardIndexSearcherWrapperModule] FLS/DLS not enabled
[2016-09-01 11:17:19,712][INFO ][cluster.routing.allocation] [sgtest] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[searchguard][0]] ...]).
[2016-09-01 11:17:19,795][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] searchguard index exists
[2016-09-01 11:17:20,346][INFO ][cluster.metadata         ] [sgtest] [searchguard] create_mapping [config]
[2016-09-01 11:17:21,054][INFO ][cluster.metadata         ] [sgtest] [searchguard] create_mapping [roles]
[2016-09-01 11:17:21,221][INFO ][cluster.metadata         ] [sgtest] [searchguard] create_mapping [rolesmapping]
[2016-09-01 11:17:21,397][INFO ][cluster.metadata         ] [sgtest] [searchguard] create_mapping [internalusers]
[2016-09-01 11:17:21,614][INFO ][cluster.metadata         ] [sgtest] [searchguard] create_mapping [actiongroups]
[2016-09-01 11:17:21,802][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] searchguard index exists
[2016-09-01 11:19:19,861][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (first object) due to null (null means timeout)
[2016-09-01 11:19:19,862][WARN ][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (first object) due to timeout
[2016-09-01 11:19:21,802][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (first object) due to null (null means timeout)
[2016-09-01 11:19:21,802][WARN ][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (first object) due to timeout
[2016-09-01 11:21:19,862][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (2 object) due to null (null means timeout)
[2016-09-01 11:21:19,862][WARN ][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (2 object) due to timeout
[2016-09-01 11:21:21,803][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (2 object) due to null (null means timeout)
[2016-09-01 11:21:21,803][WARN ][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (2 object) due to timeout
[2016-09-01 11:23:19,862][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (3 object) due to null (null means timeout)
[2016-09-01 11:23:19,862][WARN ][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (3 object) due to timeout
[2016-09-01 11:23:21,803][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (3 object) due to null (null means timeout)
[2016-09-01 11:23:21,803][WARN ][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (3 object) due to timeout
[2016-09-01 11:25:19,863][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (4 object) due to null (null means timeout)
[2016-09-01 11:25:19,863][WARN ][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (4 object) due to timeout
[2016-09-01 11:25:21,803][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (4 object) due to null (null means timeout)
[2016-09-01 11:25:21,803][WARN ][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (4 object) due to timeout
[2016-09-01 11:27:19,863][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (5 object) due to null (null means timeout)
[2016-09-01 11:27:19,863][WARN ][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (5 object) due to timeout
[2016-09-01 11:27:20,864][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] searchguard index exists
[2016-09-01 11:27:21,804][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (5 object) due to null (null means timeout)
[2016-09-01 11:27:21,804][WARN ][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (5 object) due to timeout
[2016-09-01 11:27:21,804][ERROR][com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction] [sgtest] Unable to load all configurations types. Loaded '[]' but should '[config, roles, rolesmapping, internalusers, actiongroups]' 
[2016-09-01 11:27:21,804][DEBUG][com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction] [sgtest] Retrieved config due to config update request and will now update config change listeners
[2016-09-01 11:27:23,902][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2016-09-01 11:27:28,567][ERROR][com.floragunn.searchguard.auth.BackendRegistry] Not yet initialized
[2016-09-01 11:29:20,865][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (first object) due to null (null means timeout)
[2016-09-01 11:29:20,865][WARN ][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (first object) due to timeout
[2016-09-01 11:30:22,067][INFO ][node                     ] [sgtest] stopping ...
[2016-09-01 11:30:22,094][DEBUG][action.get               ] [sgtest] null: failed to execute [org.elasticsearch.action.get.MultiGetShardRequest@334dd681]
TransportException[transport stopped, action: indices:data/read/mget[shard][s]]
        at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2016-09-01 11:30:22,125][DEBUG][action.get               ] [sgtest] null: failed to execute [org.elasticsearch.action.get.MultiGetShardRequest@2d799de7]
TransportException[transport stopped, action: indices:data/read/mget[shard][s]]
        at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2016-09-01 11:30:22,126][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (2 object) due to failure config transport stopped, action: indices:data/read/mget[shard][s] (null means timeout)
[2016-09-01 11:30:22,126][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (3 object) due to failure roles transport stopped, action: indices:data/read/mget[shard][s] (null means timeout)
[2016-09-01 11:30:22,126][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (4 object) due to failure rolesmapping transport stopped, action: indices:data/read/mget[shard][s] (null means timeout)
[2016-09-01 11:30:22,126][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Cannot retrieve configuration (5 object) due to failure internalusers transport stopped, action: indices:data/read/mget[shard][s] (null means timeout)
[2016-09-01 11:30:22,112][DEBUG][action.get               ] [sgtest] null: failed to execute [org.elasticsearch.action.get.MultiGetShardRequest@2113b6b2]
TransportException[transport stopped, action: indices:data/read/mget[shard][s]]
        at org.elasticsearch.transport.TransportService$2.run(TransportService.java:206)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
[2016-09-01 11:30:22,153][INFO ][node                     ] [sgtest] stopped
[2016-09-01 11:30:22,153][INFO ][node                     ] [sgtest] closing ...
[2016-09-01 11:30:23,127][DEBUG][com.floragunn.searchguard.configuration.ConfigurationLoader] Unexpected exception while checking if searchguard index exists: EsRejectedExecutionException[rejected execution of org.elasticsearch.action.support.ThreadedActionListener$1@2967ef28 on EsThreadPoolExecutor[listener, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@13eaf203[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 12]]]
[2016-09-01 11:30:23,129][DEBUG][com.floragunn.searchguard.transport.SearchGuardTransportService] [sgtest] failed to notify response handler on rejection
EsRejectedExecutionException[rejected execution of org.elasticsearch.transport.TransportService$3@7bd232d9 on EsThreadPoolExecutor[generic, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@3fecad4a[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 121]]]
        at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:50)
        at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
        at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1369)
        at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:85)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:341)
        at com.floragunn.searchguard.transport.SearchGuardTransportService.sendRequest(SearchGuardTransportService.java:87)
        at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:299)
        at com.floragunn.searchguard.transport.SearchGuardTransportService.sendRequest(SearchGuardTransportService.java:79)
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.perform(TransportSingleShardAction.java:221)
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction$AsyncSingleAction.start(TransportSingleShardAction.java:178)
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.doExecute(TransportSingleShardAction.java:87)
        at org.elasticsearch.action.support.single.shard.TransportSingleShardAction.doExecute(TransportSingleShardAction.java:51)
        at org.elasticsearch.action.support.TransportAction.doExecute(TransportAction.java:149)
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:172)
        at com.floragunn.searchguard.filter.SearchGuardFilter.apply(SearchGuardFilter.java:111)
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:144)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:85)
        at org.elasticsearch.action.get.TransportMultiGetAction.doExecute(TransportMultiGetAction.java:97)
        at org.elasticsearch.action.get.TransportMultiGetAction.doExecute(TransportMultiGetAction.java:42)
        at org.elasticsearch.action.support.TransportAction.doExecute(TransportAction.java:149)
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:172)
        at com.floragunn.searchguard.filter.SearchGuardFilter.apply(SearchGuardFilter.java:111)
        at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:144)
        at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:85)
        at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:58)
        at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:359)
        at org.elasticsearch.client.support.AbstractClient.multiGet(AbstractClient.java:567)
        at com.floragunn.searchguard.configuration.ConfigurationLoader.load(ConfigurationLoader.java:113)
        at com.floragunn.searchguard.action.configupdate.TransportConfigUpdateAction$1$1.run(TransportConfigUpdateAction.java:120)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

I'm not sure but probably it is makes sense to use MANAGEMENT thread here, because it is the only one which does not depend on available processors:

       add(defaultExecutorTypeSettings, new ExecutorSettingsBuilder(Names.MANAGEMENT).size(5).keepAlive("5m"));

So lets do that, rebuild and test. Now it works with "processors: 1".

But this still does not explain why it sometimes works and sometimes does not for other people (the lockup I described above should be pretty much 100% reproducible with "processors: 1") . I used to be able to reproduce the issue even with MANAGEMENT thread fix quite reliably but now I does not happen (maybe ES used to run out of MANAGEMENT threads even). I don't know why but I will report back if I notice anything.

Could release a new version with a fix ASAP? So more testing could continue.

Just one more remark. I really do not quite like the code in ConfigurationLoader.load() which is basically a critical method (i.e. nothing works until it succeeds). You attempt to synchronize asynchonization (of multiGet) with BlockingQueue which has a huge potential of deadlocks as demonstrated above. Add that its usage on startup is looping forever until load() succeeds is simply a disaster to happen and you never know what may hang for a long time (e.g. due to unsatisfied quorum or something else) if not forever. What is more, I'm also not sure how long running ES thread jobs are supposed to be. I believe ES excepts them to be quick since thead sizes are small and queues are big. I suggest you rewrite the code with callbacks in a fully async manner rather than use a BlockingQueue. The code will be uglier but it will be more bullet proof. Even if you stay with current code, it seems like timeouts of 2 minutes for each config (10 minutes in total) are way too big.

@floragunncom

This comment has been minimized.

Copy link
Owner

commented Sep 1, 2016

@modax thanks for this great analysis !

The snapshot v2.3.3.2-47-g25142b8 was not build from master, it was build from a local branch which already contained some of your suggestions but you analysis helped to figure out another one. I created a new branch for this: https://github.com/floragunncom/search-guard/commits/issue-182 Can you build it yourself or should i publish a snapshot build?

Your right regarding the ConfigurationLoader.load() code, we will revise it and will commit it to the branch mentioned above. Would be great if you can revise it and/or help with a PR

@floragunncom

This comment has been minimized.

Copy link
Owner

commented Sep 1, 2016

@modax If you like we can have a Gitter

@floragunncom

This comment has been minimized.

Copy link
Owner

commented Sep 3, 2016

related #199

@floragunncom

This comment has been minimized.

Copy link
Owner

commented Sep 3, 2016

related #203

@floragunncom

This comment has been minimized.

Copy link
Owner

commented Sep 3, 2016

related #178

floragunncom added a commit that referenced this issue Sep 6, 2016

floragunncom added a commit that referenced this issue Sep 15, 2016

Merge branch 'master' into es-2.3.5
* master:
  Make org.ldaptive safe for de/serialization to cope with "ElasticsearchException with LDAP authentication" #206
  Make org.ldaptive.SearchEntry safe for de/serialization to cope with "ElasticsearchException with LDAP authentication" #206
  Make tribe nodes working (fix #47) add replica autoexpand to sgadmin add reload option to sgadmin
  Set index.auto_expand_replicas="0-all"
  fix concurrent modification exception
  active debug log for tests
  build all branches in circleci
  Fix #182
  Fix threading issue regarding initialization
  Merge pull request #197 from ewolinetz/configurable_sg_index
  too much logs for travis
  bump version to 6-SNAPSHOT
  Renamed test method to match test reason
  Allow sg admin actions via REST if client presents a client certificate and DN of certificate matches the configured admin DN's
  Catch EsRejectedExecutionException and logb them just on debug level. Do not rethrow interrupted exception.
  Make check for sg index async and improved logging
  Fix error handling and improved logging
  Invalidate cache only if config update succeeded
  Fix typo in log statement
  Updated vagrant demo to 2.3.5.5

floragunncom added a commit that referenced this issue Sep 15, 2016

Merge branch 'master' into es-2.3.3
* master:
  Make org.ldaptive safe for de/serialization to cope with "ElasticsearchException with LDAP authentication" #206
  Make org.ldaptive.SearchEntry safe for de/serialization to cope with "ElasticsearchException with LDAP authentication" #206
  Make tribe nodes working (fix #47) add replica autoexpand to sgadmin add reload option to sgadmin
  Set index.auto_expand_replicas="0-all"
  fix concurrent modification exception
  active debug log for tests
  build all branches in circleci
  Fix #182
  Fix threading issue regarding initialization
  Merge pull request #197 from ewolinetz/configurable_sg_index
  too much logs for travis
  bump version to 6-SNAPSHOT
  Renamed test method to match test reason
  Allow sg admin actions via REST if client presents a client certificate and DN of certificate matches the configured admin DN's
  Catch EsRejectedExecutionException and logb them just on debug level. Do not rethrow interrupted exception.
  Make check for sg index async and improved logging
  Fix error handling and improved logging
  Invalidate cache only if config update succeeded
  Fix typo in log statement
  Updated vagrant demo to 2.3.5.5

# Conflicts:
#	pom.xml

floragunncom added a commit that referenced this issue Sep 15, 2016

Merge branch 'master' into es-2.3.4
* master:
  Make org.ldaptive safe for de/serialization to cope with "ElasticsearchException with LDAP authentication" #206
  Make org.ldaptive.SearchEntry safe for de/serialization to cope with "ElasticsearchException with LDAP authentication" #206
  Make tribe nodes working (fix #47) add replica autoexpand to sgadmin add reload option to sgadmin
  Set index.auto_expand_replicas="0-all"
  fix concurrent modification exception
  active debug log for tests
  build all branches in circleci
  Fix #182
  Fix threading issue regarding initialization
  Merge pull request #197 from ewolinetz/configurable_sg_index
  too much logs for travis
  bump version to 6-SNAPSHOT
  Renamed test method to match test reason
  Allow sg admin actions via REST if client presents a client certificate and DN of certificate matches the configured admin DN's
  Catch EsRejectedExecutionException and logb them just on debug level. Do not rethrow interrupted exception.
  Make check for sg index async and improved logging
  Fix error handling and improved logging
  Invalidate cache only if config update succeeded
  Fix typo in log statement
  Updated vagrant demo to 2.3.5.5

# Conflicts:
#	pom.xml
@floragunncom

This comment has been minimized.

Copy link
Owner

commented Sep 22, 2016

fixed with version 6

floragunncom added a commit that referenced this issue Oct 2, 2016

Merge branch 'master' into ci-only_windows
* master: (32 commits)
  Added testcase for _bulk permissions
  moved registration of handlers to rest api module
  Deny access to an index with more than one filtered alias. We will support this later in conjunction with document level security (DLS). #190 Fixed an issue with permission evaluation when multiple indices were accessed
  fix #208, honor rest status in exception
  fix #207 - Setting an Authorization header in transport client requests
  Version bumped to 2.4.0.7-SNAPSHOT
  Bump ES to 2.4.0
  Make org.ldaptive safe for de/serialization to cope with "ElasticsearchException with LDAP authentication" #206
  Added constants for SG index types
  Make org.ldaptive.SearchEntry safe for de/serialization to cope with "ElasticsearchException with LDAP authentication" #206
  Make tribe nodes working (fix #47) add replica autoexpand to sgadmin add reload option to sgadmin
  Set index.auto_expand_replicas="0-all"
  fix concurrent modification exception
  active debug log for tests
  build all branches in circleci
  Fix #182
  Fix threading issue regarding initialization
  Merge pull request #197 from ewolinetz/configurable_sg_index
  too much logs for travis
  bump version to 6-SNAPSHOT
  ...

# Conflicts:
#	src/main/java/com/floragunn/searchguard/http/HTTPBasicAuthenticator.java
#	src/test/resources/log4j.properties
@kenhuang

This comment has been minimized.

Copy link

commented Oct 2, 2016

Confirm the latest bundle (ES2.4.0) fixed my issue.
https://floragunn.com/downloads/elasticsearch-76EDC0BB-9A7E-40BE-96B6-67872AB3D965.tar.gz

I was using(ES2.3.5) bundle:
https://floragunn.com/downloads/elasticsearch-866a7b29-59d7-444c-833c-f10c42d533e9.tar.gz

It works fine on a windows machine, but when I try using the same bundle on docker image with ubuntu with JDK8, sgadmin.sh hangs util time out:

Will connect to localhost:9300 ... done
Contacting elasticsearch cluster 'elasticsearch' and wait for YELLOW clusterstate ...
Clustername: elasticsearch
Clusterstate: GREEN
Number of nodes: 1
Number of data nodes: 1
searchguard index does not exists, attempt to create it ... done (with 0 replicas, auto expand replicas is off)
Populate config from /searchguard-client/plugins/search-guard-2/sgconfig
Will update 'config' with plugins/search-guard-2/sgconfig/sg_config.yml
SUCC: Configuration for 'config' created or updated
Will update 'roles' with plugins/search-guard-2/sgconfig/sg_roles.yml
SUCC: Configuration for 'roles' created or updated
Will update 'rolesmapping' with plugins/search-guard-2/sgconfig/sg_roles_mapping.yml
SUCC: Configuration for 'rolesmapping' created or updated
Will update 'internalusers' with plugins/search-guard-2/sgconfig/sg_internal_users.yml
SUCC: Configuration for 'internalusers' created or updated
Will update 'actiongroups' with plugins/search-guard-2/sgconfig/sg_action_groups.yml
SUCC: Configuration for 'actiongroups' created or updated
Done with success

then seeing time out error message in the ES Server logs:

Unable to load all configurations types. Loaded '[]' but should '[config, roles, rolesmapping, internalusers, actiongroups]'

@floragunncom will we get a updated bundle for the older version?

@floragunncom

This comment has been minimized.

Copy link
Owner

commented Oct 3, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.