Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Connection to AWS ElasticSearch will be lost after a certain period of time #15

Closed
darwin67 opened this issue Jun 30, 2016 · 20 comments
Closed

Comments

@darwin67
Copy link
Contributor

darwin67 commented Jun 30, 2016

Hi,
I'm seeing something like this in the logs recently.

2016-06-30 19:30:53 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-06-30 19:30:54 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3f9e9deeaf04"
  2016-06-30 19:30:53 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/base.rb:249:in `perform_request'
  2016-06-30 19:30:53 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
  2016-06-30 19:30:53 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/elasticsearch-transport-1.0.18/lib/elasticsearch/transport/client.rb:128:in `perform_request'
  2016-06-30 19:30:53 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/elasticsearch-api-1.0.18/lib/elasticsearch/api/actions/bulk.rb:90:in `bulk'
  2016-06-30 19:30:53 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-elasticsearch-1.5.0/lib/fluent/plugin/out_elasticsearch.rb:278:in `send'
  2016-06-30 19:30:53 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-elasticsearch-1.5.0/lib/fluent/plugin/out_elasticsearch.rb:271:in `write'
  2016-06-30 19:30:53 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.20/lib/fluent/buffer.rb:345:in `write_chunk'
  2016-06-30 19:30:53 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.20/lib/fluent/buffer.rb:324:in `pop'
  2016-06-30 19:30:53 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.20/lib/fluent/output.rb:329:in `try_flush'
  2016-06-30 19:30:53 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.20/lib/fluent/output.rb:140:in `run'
2016-06-30 19:30:54 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-06-30 19:30:56 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3f9e9deeaf04"
  2016-06-30 19:30:54 +0000 [warn]: suppressed same stacktrace

Basically, what this is according to my understanding is that this plugin will lose connection to the AWS ElasticSearch service after 1, 2 days from the start/restart of the td-agent.

There is a Websocket connection duration limit listed here but not sure if it's related.

However, I'm guessing that the connection is closed from the AWS side, but I couldn't find any documents mentioning it. Any ideas why this is happening?

Also, is there a solution to this problem already (besides manually restarting td-agent) since I'm assuming that this plugin is being used elsewhere too.

@atomita
Copy link
Owner

atomita commented Jul 1, 2016

Hi @darwin67 ,

Thank you for report.

Maybe it had expired of the credentials.

I'm sorry, but it takes a long time to fix...

@therc
Copy link

therc commented Jul 28, 2016

There's a similar issue with the plain elasticsearch plugin + aws-es-proxy... :(

@aerickson
Copy link

I'm seeing this issue also. My credentials are OK. Logging works for a few days then dies.

2016-08-13 21:20:01 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-08-13 21:20:02 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3f859d6ae4a8"
  2016-08-13 21:20:01 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/elasticsearch-transport-2.0.0/lib/elasticsearch/transport/transport/base.rb:249:in `perform_request'
  2016-08-13 21:20:01 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/elasticsearch-transport-2.0.0/lib/elasticsearch/transport/transport/http/faraday.rb:20:in `perform_request'
  2016-08-13 21:20:01 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/elasticsearch-transport-2.0.0/lib/elasticsearch/transport/client.rb:128:in `perform_request'
  2016-08-13 21:20:01 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/elasticsearch-api-2.0.0/lib/elasticsearch/api/actions/bulk.rb:93:in `bulk'
  2016-08-13 21:20:01 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-elasticsearch-1.5.0/lib/fluent/plugin/out_elasticsearch.rb:278:in `send'
  2016-08-13 21:20:01 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluent-plugin-elasticsearch-1.5.0/lib/fluent/plugin/out_elasticsearch.rb:271:in `write'
  2016-08-13 21:20:01 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.26/lib/fluent/buffer.rb:354:in `write_chunk'
  2016-08-13 21:20:01 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.26/lib/fluent/buffer.rb:333:in `pop'
  2016-08-13 21:20:01 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.26/lib/fluent/output.rb:338:in `try_flush'
  2016-08-13 21:20:01 +0000 [warn]: /opt/td-agent/embedded/lib/ruby/gems/2.1.0/gems/fluentd-0.12.26/lib/fluent/output.rb:149:in `run'
2016-08-13 21:20:02 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-08-13 21:20:04 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3f859d6ae4a8"
  2016-08-13 21:20:02 +0000 [warn]: suppressed same stacktrace
2016-08-13 21:20:04 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-08-13 21:20:08 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3f859d6ae4a8"
  2016-08-13 21:20:04 +0000 [warn]: suppressed same stacktrace
2016-08-13 21:20:08 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-08-13 21:20:16 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3f859d6ae4a8"
  2016-08-13 21:20:08 +0000 [warn]: suppressed same stacktrace
2016-08-13 21:20:16 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-08-13 21:20:33 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3f859d6ae4a8"
  2016-08-13 21:20:16 +0000 [warn]: suppressed same stacktrace
2016-08-13 21:20:33 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-08-13 21:21:02 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3f859d6ae4a8"
  2016-08-13 21:20:33 +0000 [warn]: suppressed same stacktrace
2016-08-13 21:21:02 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-08-13 21:22:14 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3f859d6ae4a8"
  2016-08-13 21:21:02 +0000 [warn]: suppressed same stacktrace
2016-08-13 21:22:14 +0000 [warn]: temporarily failed to flush the buffer. next_retry=2016-08-13 21:24:25 +0000 error_class="Elasticsearch::Transport::Transport::Error" error="Cannot get new connection from pool." plugin_id="object:3f859d6ae4a8"
  2016-08-13 21:22:14 +0000 [warn]: suppressed same stacktrace

@vendrov
Copy link

vendrov commented Sep 19, 2016

+1
The same problem here

@steynovich
Copy link

It seems to be the issue that AWS Elasticsearch is not 100% compatible with native ES when it comes to reloading connections to ES. Reloading connections happens every 10,000 requests by default. This can be useful when you've multiple hosts configure, but in the case of AWS ES there is only a single (HA) endpoint.

For the compatibility issue, see: https://forums.aws.amazon.com/thread.jspa?threadID=222600

In our case preventing fluentd from reloading the host configuration as a workaround (add output plugin part of fluentd-config) seems to work:

# Prevent reloading connections to AWS ES
reload_on_failure false
reload_connections false

Note: reload_connections is false by default.

Prior to applying this workaround the connection was lost every ~3 hours (~10,800 seconds), which makes sense since we are flushing our data every 1s to ES.

I think this should be fixed in the Ruby Elasticsearch Client since will not only affect Fluentd, but potentially every Ruby/AWS ES implementation

@darwin67
Copy link
Contributor Author

@steynovich Thank you for the research.
My current work around is to have a cron job to restart td-agent everyday.
That works fine so far, but I'll like to check on what you suggested too once I have the time.

@mpas
Copy link

mpas commented Oct 26, 2016

+1 Experiencing the same issue

@steynovich We are experiencing the same issue, can you please give some more info where to place the mentioned work around..

@aerickson
Copy link

@mpas Those options are part of the 'parent' plugin that this plugin uses.

https://github.com/uken/fluent-plugin-elasticsearch#reload_on_failure

The https://github.com/uken/fluent-plugin-elasticsearch#usage section shows where to put it (in the match block).

@tanaka-takayoshi
Copy link

Hi, I'm afraid specifying "reload_connections false" won't work due to type mismatching.

reload_connections false

The parent plugin handles "reload_connections" options as a string type.
https://github.com/uken/fluent-plugin-elasticsearch/blob/v1.9.0/lib/fluent/plugin/out_elasticsearch_dynamic.rb#L15
https://github.com/uken/fluent-plugin-elasticsearch/blob/v1.9.0/lib/fluent/plugin/out_elasticsearch_dynamic.rb#L222

However, Elasticsearch Ruby client handles "reload_connections" as a FixNum or Boolean.
https://github.com/elastic/elasticsearch-ruby/blob/v2.0.0/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L51-L52
https://github.com/elastic/elasticsearch-ruby/blob/v2.0.0/elasticsearch-transport/lib/elasticsearch/transport/transport/base.rb#L70
elastic/elasticsearch-ruby#164

And the parent plugin set "reload_connections" true as default.

@mpas
Copy link

mpas commented Nov 30, 2016 via email

@aerickson
Copy link

I don't think there's a fix yet. The current workaround is to restart fluentd regularly (we use monit). :(

@tanaka-takayoshi
Copy link

After I looked into this issue, I found it won't happen if you don't use Dynamic configuration:
https://github.com/uken/fluent-plugin-elasticsearch/blob/master/lib/fluent/plugin/out_elasticsearch.rb#L41-L42

Also, I think we must modify parent plugin code. I forked the plugin and made some modification:
tanaka-takayoshi/fluent-plugin-elasticsearch@646fe26
I'm now testing this code and will feedback the result.

@tanaka-takayoshi
Copy link

After running 24hours, there're no connection errors tough I usually get errors after 16hours without this fix. I'll make PR to the parent plugin repository.

@tanaka-takayoshi
Copy link

The upstream PR is merged and now released. Are there any chance to grab the new version into this plugin?

@vendrov
Copy link

vendrov commented Jan 9, 2017

@tanaka-takayoshi If the repo owner isn't available, do you mind to fork it and create another plugin? This issue is extremely critical, and it's waiting for more then 26 days for the owner response

@tanaka-takayoshi
Copy link

@vendrov It's good that repo owner releases the new version to fix this issue. I think any code change is unnecessary because it refers the latest version plugin when it builds.
https://github.com/atomita/fluent-plugin-aws-elasticsearch-service/blob/master/fluent-plugin-aws-elasticsearch-service.gemspec#L27

However, repo owner will not response, I will be able to do it.

@tanaka-takayoshi
Copy link

@vendrov I forked and uploaded the gem, could you test it? I have poor knowledge of ruby gems versioning.
https://rubygems.org/gems/fluent-plugin-aws-elasticsearch-service-hotfix

@atomita I'll turn down my hotfix gems, once you release a new version. There's no need to update any file. Just build a new gem again is required.

@malford
Copy link

malford commented Apr 27, 2017

@tanaka-takayoshi with your hotfix plugin is it still necessary to specify reload_connections false in the config?

@tanaka-takayoshi
Copy link

@malford Yes, may have to specify reload_connections false as I intended to inherit the parent plugin settings and it's true by default.
https://github.com/uken/fluent-plugin-elasticsearch/blob/v1.9.3/lib/fluent/plugin/out_elasticsearch.rb#L41

@darwin67
Copy link
Contributor Author

haven't taken a look at this for a while.
looks like it's solved now so i'm closing this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

9 participants