Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repository-hdfs plugin not always closing tcp connexions #220

Closed
jubagarie opened this Issue Jun 19, 2014 · 15 comments

Comments

Projects
None yet
3 participants
@jubagarie
Copy link

jubagarie commented Jun 19, 2014

I'm using the "repository-hdfs" plugin to store snapshots on HDFS with Elasticsearch 1.1.1. It seems that Elasticsearch doesn't properly close the TCP connections after a snapshot is created.

The result for me was a "too many open files" errors in the Elasticsearch logs. Using the "lsof" command I found a pile of more than 50k TCP connections in the CLOSE_WAIT state and as many file descriptors.

@costin

This comment has been minimized.

Copy link
Member

costin commented Jun 19, 2014

What version of the repository-hdfs plugin are you using? Any information on where the TCP connections point to? Also what version of hadoop are you using?

Thanks

@costin costin added v2.0.1 labels Jun 19, 2014

@jubagarie

This comment has been minimized.

Copy link
Author

jubagarie commented Jun 19, 2014

I'm using :

  • hadoop 2.0.0 (cdh 4.1.2)
  • repository-hdfs 2.0.0-light
  • Debian Wheezy

The TCP connections are pointing to the nodes of my hadoop cluster. Here is an extract of "lsof" output for the elasticsearch process :

java    68073 elasticsearch 6747u  IPv4           15823974       0t0      TCP es08:57227->cdh4worker04:50010 (CLOSE_WAIT)
java    68073 elasticsearch 6748u  IPv4           15824908       0t0      TCP es08:57651->cdh4worker05:50010 (CLOSE_WAIT)
java    68073 elasticsearch 6749u  IPv4           15818656       0t0      TCP es08:54883->cdh4worker12:50010 (CLOSE_WAIT)

I noticed that even few hours after the last snapshot, the connections are still not closed.

Thanks for your help

@costin

This comment has been minimized.

Copy link
Member

costin commented Jun 19, 2014

I've done some quick searches and it looks like this is likely caused by Hadoop itself. For example, see this thread and this issue.
I'll try to rework the code so that the filesystem instance is disposed and created per action but if the connections are leaking, this will not help much...
http://archive.cloudera.com/cdh4/cdh/4/mr1-0.20.2+1215.releasenotes.html

@costin

This comment has been minimized.

Copy link
Member

costin commented Jun 19, 2014

@jubagarie Can you try a quick fix? Hope it forces hadoop to close the connections. After you do the backup, can you try unregistering the repository and see whether it has any effect on the number of connections opened?
This causes the plugin to close the underlying HDFS FS which is the only possible fix we can apply.

Let me know how it goes - thanks!

@jubagarie

This comment has been minimized.

Copy link
Author

jubagarie commented Jun 19, 2014

Unfortunately, unregistering the repository and registering it again doesn't affect the number of connections.

I can also add that my server version of hadoop-hdfs is a CDH 4.1.2 but my Elasticsearch servers are running a CDH 4.6.0. Both of them seem to include the patch created for the issue you found.

Thanks

@costin

This comment has been minimized.

Copy link
Member

costin commented Jun 19, 2014

Hmm, I'm afraid I'm not sure what else can be done on this front. Can you confirm in the ES logs that the file-system is created again when you register it again after unregistering it?
As I've mentioned the only thing we can do is close the FS faster than we are currently doing it. We are not opening or closing connections manually, rather we are just clients of Hadoop's FileSystem interface...
I'm wondering whether using concurrent_streams of 1 has any impact on the number of opened connections and thus potentially slows the growth down...

The only thing I can think of is restart the node which is clearly not ideal...

@jubagarie

This comment has been minimized.

Copy link
Author

jubagarie commented Jun 20, 2014

Here are the ES logs when I unregister/register the repository so it seems fine :

[2014-06-19 16:17:07,732][INFO ][repositories             ] [Turner Century] delete repository [hdfs]
[2014-06-19 16:31:56,739][INFO ][repositories             ] [Turner Century] put repository [hdfs]

I will dig on the Hadoop side.

Thanks for your time !

@costin

This comment has been minimized.

Copy link
Member

costin commented Jun 20, 2014

As a work-around, you could potentially add some firewall rules to kill connections in CLOSE_WAIT state, idle for more than X minutes.

@costin

This comment has been minimized.

Copy link
Member

costin commented Aug 14, 2014

Closing this with won't fix since there's no much we can do unfortunately...

@costin costin closed this Aug 14, 2014

@costin costin added the wontfix label Aug 14, 2014

@costin

This comment has been minimized.

Copy link
Member

costin commented Dec 2, 2014

As it seems this bug in Hadoop keeps occurring some pointers in the docs would help on how to try to fix it.

@costin costin reopened this Dec 2, 2014

@bflad

This comment has been minimized.

Copy link

bflad commented Jan 15, 2015

Hmmm. As a point of reference, Elasticsearch 1.1.1 + CDH 5.2.1 (Hadoop 2.5.x) here, and I don't see any CLOSE_WAIT connections in lsof after taking HDFS snapshots.

Aside: is there a recommended way to setup the light plugin with Hadoop jars (like CDH) for startup? Is ES_CLASSPATH the way to go? Might be worth a mention in the documentation, especially since on CentOS/RHEL provided files, its not obvious since the variable not in the sysconfig file or init script. Found it in the Elasticsearch shell script itself.

@bflad

This comment has been minimized.

Copy link

bflad commented Jan 15, 2015

Oh wait, totally lying, the connections were on the master nodes:

java      13452 elasticsearch  211u     IPv6            7964297      0t0        TCP esmaster01.example.com:58187->cdh02.example.com:50010 (CLOSE_WAIT)
java      13452 elasticsearch  212u     IPv6            7964298      0t0        TCP esmaster01.example.com:52732->cdh03.example.com:50010 (CLOSE_WAIT)
java      13452 elasticsearch  213u     IPv6            7964299      0t0        TCP esmaster01.example.com:57888->cdh01.example.com:50010 (CLOSE_WAIT)

So yes, still an issue!

@costin

This comment has been minimized.

Copy link
Member

costin commented Jan 15, 2015

@bflad Sorry to heart that.
As for the classpath, ES_CLASSPATH definitely works. I'll update the readme to make this clearer.

costin added a commit that referenced this issue Jan 19, 2015

@costin

This comment has been minimized.

Copy link
Member

costin commented Jun 17, 2015

@bflad @jubagarie you might want to try 2.1.0.rc1 since it improves the creation and closing of the FileSystem.

@costin

This comment has been minimized.

Copy link
Member

costin commented Oct 29, 2015

As there hasn't been any update, I'm closing the issue.

@costin costin closed this Oct 29, 2015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.