TCP connect timeout errors #183

brentonannan · 2018-01-10T05:55:00Z

We're having issues with connections to our prod cluster. Sometimes a query will work, but more often than not, it will throw:

** (stop) exited in: :gen_server.call(#PID<6382.29775.0>, {:checkout, #Reference<6382.893348796.102498306.96870>, true}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (stdlib) gen_server.erl:214: :gen_server.call/3
    src/poolboy.erl:55: :poolboy.checkout/3
    lib/db_connection/poolboy.ex:41: DBConnection.Poolboy.checkout/2
    lib/db_connection.ex:920: DBConnection.checkout/2
    lib/db_connection.ex:742: DBConnection.run/3
    lib/db_connection.ex:1133: DBConnection.run_meter/3
    lib/db_connection.ex:636: DBConnection.execute/4
    lib/mongo.ex:431: Mongo.kill_cursors/3

when called directly from the server. We're not clear on what process it is saying is not alive.

Our logs show tcp connect errors that look like:

04:05:52.177 [error] Mongo.Protocol (#PID<0.388.0>) failed to connect: ** (Mongo.Error) tcp connect: unknown POSIX error - :timeout
04:05:52.270 [error] GenServer #PID<0.388.0> terminating
** (stop) exited in: :gen_server.call(#PID<0.388.0>, {:checkout, #Reference<0.893348796.90439682.33221>, true, 5000}, 5000)
** (EXIT) time out
Last message: []
State: Mongo.Protocol
04:05:52.271 [error] GenServer #PID<0.375.0> terminating
** (stop) exited in: :gen_server.call(#PID<0.388.0>, {:checkout, #Reference<0.893348796.90439682.33221>, true, 5000}, 5000)
** (EXIT) time out
Last message: {:EXIT, #PID<0.348.0>, {:timeout, {:gen_server, :call, [#PID<0.388.0>, {:checkout, #Reference<0.893348796.90439682.33221>, true, 5000}, 5000]}}}
State: {:state, #PID<0.376.0>, [#PID<0.386.0>, #PID<0.385.0>, #PID<0.384.0>, #PID<0.383.0>, #PID<0.382.0>, #PID<0.381.0>, #PID<0.380.0>, #PID<0.379.0>, #PID<0.378.0>, #PID<0.377.0>], {[], []}, #Reference<0.893348796.90570754.33113>, 10, 0, 0, :fifo}

and repeat indefinitely.

We're starting mongo in our supervision tree like:

...
  {Mongo, mongo()}
...

  defp mongo do
    config = Application.get_env(:blah, :mongo)

    if Keyword.has_key?(config, :seeds) do
      Keyword.update!(config, :seeds, fn seeds -> String.split(seeds, ",") end)
    else
      config
    end
  end

With config:

config :blah,
  mongo: [
    name: :blah_db,
    seeds: System.get_env("MONGODB_HOSTS"),
    database: System.get_env("MONGODB_NAME"),
    username: System.get_env("MONGODB_USERNAME"),
    password: System.get_env("MONGODB_PASSWORD"),
    pool: DBConnection.Poolboy,
  ]

We are not having any problems on our staging server which is using a standalone server.

We're using the current release 0.4.3.

Please help!

The text was updated successfully, but these errors were encountered:

brentonannan · 2018-01-10T05:56:07Z

Of course, we're happy to provide any other info needed, but at this point we're not sure how to find more useful info.

brentonannan · 2018-01-10T08:15:35Z

More info: one of our infra folks have pointed out that we have a backup server in our replica set, which is set unreadable, and isolated from network ingress; this could explain the tcp connect errors.

We are thinking it could be the case that this failure is causing the entire pool to shutdown, giving the error we get when we try to query from the server.

Does this sound like a likely cause for our problems? If so, is there some way we can exclude a host from the topology?

ankhers · 2018-01-10T15:51:34Z

The backup server that cannot be reached is most likely the cause of the tco connect errors. However, that should not take down the entire application. It should just keep trying to make a connection to that server forever (should probably change that behaviour in some way).

As for the checkout issue, I am really unsure why that is happening. I will try to come up with some tests for you to run in order to figure out this issue.

ankhers · 2018-01-17T02:55:06Z

Would you be able to test my exclude_hosts branch? It allows you to specify an :excluded_hosts key in the Mongo.start_link/1 function. It expects a list, similar to the :seeds key. If this fixes your crash, I can look more into why it is crashing, though I only expect it to prevent the tcp connect errors.

brentonannan · 2018-01-17T06:09:05Z

We most definitely will give your branch a go, thanks @ankhers. We've managed to get a local setup that replicates the problem with a bit of finagling of mongos inside docker composes, so we'll test your branch against that as soon as we're able.

ankhers · 2018-01-17T17:58:24Z

Is there any chance you can share those docker files so that I could do some additional testing?

brentonannan · 2018-01-18T11:44:58Z

One of the devs on my team got this sorted (the docker networking setup, not the issue as a whole); I haven't looked at his notes yet, but I'll ask if he can comment on here with some details for you (AFAIK it was a bit of a pain to replicate, I think he ended up going with hacking /etc/hosts in the client container to simulate a broken route).

vgunawan · 2018-01-21T23:07:33Z

Sorry for taking time to reply, but I have written a gist on how to run the containers with sample docker-compose.yml file.

https://gist.github.com/vgunawan/4588cf26ca84ba57e39008d94dae032a

Follow through the instructions.md file, should give you some idea what to do.

ankhers · 2018-02-08T18:23:01Z

Did you end up testing that branch with your setup?

pbrudnick · 2018-04-11T19:54:27Z

I'm having the same issue @ankhers , I've been investigating and debugging the code, I add more info:

This happens to me when using a remote public uri but once the replica set is updated my original host is removed from the Topology state and the servers are resolving to internal IPs (Amazon in my case), then monitors map is being updated to those keys (hosts and arbiters) as well.
I could see it in update_rs_from_primary/1 function inside topology_description.ex

Regarding this problem, I think the mongo spec explains the problem and the expected behaviour here:
https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#clients-use-the-hostnames-listed-in-the-replica-set-config-not-the-seed-list

does it sound good?

PS: I tested with the exclude_hosts branch, the error doesn't happen but it doesn't work because if I exclude the hosts, my original public host is being removed as well (reading reconcile_servers/1 code) with your fix.

redink · 2018-04-27T20:12:27Z

I had the same issue @ankhers, and I think I resolved it by changing the options for timeout and pool_timeout.

There are two(actually three) timeout value:

timeout
used in connect timeout and receive TCP data timeout for now
pool_timeout
used in checkout connection from a pool
connect_timeout_ms
defined but not used

The problem is if connect MongoDB timeout (timeout value is 5000), checkout connection from a pool will timeout (timeout value is 5000 too) too. And checkout timeout will exit the process (https://github.com/elixir-ecto/db_connection/blob/master/lib/db_connection/connection.ex#L54). After times crash, the application will crash because of supervisor tree restart strategy. Then the Elixir node maybe down because of the application restart strategy.

So, we can modify the timeout for connect to MongoDB and checkout connection from a pool. Just like:

timeout: 5000,
pool_timeout: 8000

Actually, we should use connect_timeout_ms, but this option only defined, but not used.

ankhers · 2018-05-21T22:46:00Z

Just to keep on top of this. We added the connect_timeout_ms option in 0.4.6. Unfortunately I think it should have just been called connect_timeout. but that can be fixed later.

costa · 2018-06-05T11:08:03Z

Hey, not 100% sure, but having a similar error, must report that it totally had to do with internal (GCE) domain name resolution in my case: replacing local names with ips (which i never did with components on other platforms) totally solved my "timeout" problem.

p.s. I spoke too soon about "totally solving" the problem, I really have no idea, it "just works" now (except when it doesn't, and fails with timeout).

pbrudnick · 2018-12-04T17:29:39Z

In my case I solved the issue passing this explicit option:
type: :single

should we document better the options?

I was connecting remotely to an AWS mongodb and it was resolving the internal IPs giving errors, I just wanted to connect to my single server public dns.

piyushcoader · 2019-06-23T06:23:26Z

I am getting same error after i upgrade to v5.0.x, previous version was working fine

ankhers · 2019-06-23T13:26:16Z

Would anyone be willing to give me access to a database that is having this issue? I do not think there is much I can do if I am unable to see the issue firsthand.

ankhers · 2019-06-25T14:49:47Z

I'm just going to leave this here. Someone asked a question on the elixir forum about this issue. I'm adding it here incase we are able to track it down from that.

ankhers added the Kind:Bug label Jan 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TCP connect timeout errors #183

TCP connect timeout errors #183

brentonannan commented Jan 10, 2018 •

edited

Loading

brentonannan commented Jan 10, 2018 •

edited

Loading

brentonannan commented Jan 10, 2018

ankhers commented Jan 10, 2018

ankhers commented Jan 17, 2018

brentonannan commented Jan 17, 2018

ankhers commented Jan 17, 2018

brentonannan commented Jan 18, 2018 •

edited

Loading

vgunawan commented Jan 21, 2018

ankhers commented Feb 8, 2018

pbrudnick commented Apr 11, 2018

redink commented Apr 27, 2018

ankhers commented May 21, 2018

costa commented Jun 5, 2018 •

edited

Loading

pbrudnick commented Dec 4, 2018

piyushcoader commented Jun 23, 2019

ankhers commented Jun 23, 2019

ankhers commented Jun 25, 2019

TCP connect timeout errors #183

TCP connect timeout errors #183

Comments

brentonannan commented Jan 10, 2018 • edited Loading

brentonannan commented Jan 10, 2018 • edited Loading

brentonannan commented Jan 10, 2018

ankhers commented Jan 10, 2018

ankhers commented Jan 17, 2018

brentonannan commented Jan 17, 2018

ankhers commented Jan 17, 2018

brentonannan commented Jan 18, 2018 • edited Loading

vgunawan commented Jan 21, 2018

ankhers commented Feb 8, 2018

pbrudnick commented Apr 11, 2018

redink commented Apr 27, 2018

ankhers commented May 21, 2018

costa commented Jun 5, 2018 • edited Loading

pbrudnick commented Dec 4, 2018

piyushcoader commented Jun 23, 2019

ankhers commented Jun 23, 2019

ankhers commented Jun 25, 2019

brentonannan commented Jan 10, 2018 •

edited

Loading

brentonannan commented Jan 10, 2018 •

edited

Loading

brentonannan commented Jan 18, 2018 •

edited

Loading

costa commented Jun 5, 2018 •

edited

Loading