Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCP connect timeout errors #183

Open
brentonannan opened this issue Jan 10, 2018 · 17 comments
Open

TCP connect timeout errors #183

brentonannan opened this issue Jan 10, 2018 · 17 comments
Labels

Comments

@brentonannan
Copy link

brentonannan commented Jan 10, 2018

We're having issues with connections to our prod cluster. Sometimes a query will work, but more often than not, it will throw:

** (stop) exited in: :gen_server.call(#PID<6382.29775.0>, {:checkout, #Reference<6382.893348796.102498306.96870>, true}, 5000)
    ** (EXIT) no process: the process is not alive or there's no process currently associated with the given name, possibly because its application isn't started
    (stdlib) gen_server.erl:214: :gen_server.call/3
    src/poolboy.erl:55: :poolboy.checkout/3
    lib/db_connection/poolboy.ex:41: DBConnection.Poolboy.checkout/2
    lib/db_connection.ex:920: DBConnection.checkout/2
    lib/db_connection.ex:742: DBConnection.run/3
    lib/db_connection.ex:1133: DBConnection.run_meter/3
    lib/db_connection.ex:636: DBConnection.execute/4
    lib/mongo.ex:431: Mongo.kill_cursors/3

when called directly from the server. We're not clear on what process it is saying is not alive.

Our logs show tcp connect errors that look like:

04:05:52.177 [error] Mongo.Protocol (#PID<0.388.0>) failed to connect: ** (Mongo.Error) tcp connect: unknown POSIX error - :timeout
04:05:52.270 [error] GenServer #PID<0.388.0> terminating
** (stop) exited in: :gen_server.call(#PID<0.388.0>, {:checkout, #Reference<0.893348796.90439682.33221>, true, 5000}, 5000)
** (EXIT) time out
Last message: []
State: Mongo.Protocol
04:05:52.271 [error] GenServer #PID<0.375.0> terminating
** (stop) exited in: :gen_server.call(#PID<0.388.0>, {:checkout, #Reference<0.893348796.90439682.33221>, true, 5000}, 5000)
** (EXIT) time out
Last message: {:EXIT, #PID<0.348.0>, {:timeout, {:gen_server, :call, [#PID<0.388.0>, {:checkout, #Reference<0.893348796.90439682.33221>, true, 5000}, 5000]}}}
State: {:state, #PID<0.376.0>, [#PID<0.386.0>, #PID<0.385.0>, #PID<0.384.0>, #PID<0.383.0>, #PID<0.382.0>, #PID<0.381.0>, #PID<0.380.0>, #PID<0.379.0>, #PID<0.378.0>, #PID<0.377.0>], {[], []}, #Reference<0.893348796.90570754.33113>, 10, 0, 0, :fifo}

and repeat indefinitely.

We're starting mongo in our supervision tree like:

...
  {Mongo, mongo()}
...

  defp mongo do
    config = Application.get_env(:blah, :mongo)

    if Keyword.has_key?(config, :seeds) do
      Keyword.update!(config, :seeds, fn seeds -> String.split(seeds, ",") end)
    else
      config
    end
  end

With config:

config :blah,
  mongo: [
    name: :blah_db,
    seeds: System.get_env("MONGODB_HOSTS"),
    database: System.get_env("MONGODB_NAME"),
    username: System.get_env("MONGODB_USERNAME"),
    password: System.get_env("MONGODB_PASSWORD"),
    pool: DBConnection.Poolboy,
  ]

We are not having any problems on our staging server which is using a standalone server.

We're using the current release 0.4.3.

Please help!

@brentonannan
Copy link
Author

brentonannan commented Jan 10, 2018

Of course, we're happy to provide any other info needed, but at this point we're not sure how to find more useful info.

@brentonannan
Copy link
Author

More info: one of our infra folks have pointed out that we have a backup server in our replica set, which is set unreadable, and isolated from network ingress; this could explain the tcp connect errors.

We are thinking it could be the case that this failure is causing the entire pool to shutdown, giving the error we get when we try to query from the server.

Does this sound like a likely cause for our problems? If so, is there some way we can exclude a host from the topology?

@ankhers
Copy link
Collaborator

ankhers commented Jan 10, 2018

The backup server that cannot be reached is most likely the cause of the tco connect errors. However, that should not take down the entire application. It should just keep trying to make a connection to that server forever (should probably change that behaviour in some way).

As for the checkout issue, I am really unsure why that is happening. I will try to come up with some tests for you to run in order to figure out this issue.

@ankhers
Copy link
Collaborator

ankhers commented Jan 17, 2018

Would you be able to test my exclude_hosts branch? It allows you to specify an :excluded_hosts key in the Mongo.start_link/1 function. It expects a list, similar to the :seeds key. If this fixes your crash, I can look more into why it is crashing, though I only expect it to prevent the tcp connect errors.

@brentonannan
Copy link
Author

We most definitely will give your branch a go, thanks @ankhers. We've managed to get a local setup that replicates the problem with a bit of finagling of mongos inside docker composes, so we'll test your branch against that as soon as we're able.

@ankhers
Copy link
Collaborator

ankhers commented Jan 17, 2018

Is there any chance you can share those docker files so that I could do some additional testing?

@brentonannan
Copy link
Author

brentonannan commented Jan 18, 2018

One of the devs on my team got this sorted (the docker networking setup, not the issue as a whole); I haven't looked at his notes yet, but I'll ask if he can comment on here with some details for you (AFAIK it was a bit of a pain to replicate, I think he ended up going with hacking /etc/hosts in the client container to simulate a broken route).

@vgunawan
Copy link

Sorry for taking time to reply, but I have written a gist on how to run the containers with sample docker-compose.yml file.

https://gist.github.com/vgunawan/4588cf26ca84ba57e39008d94dae032a

Follow through the instructions.md file, should give you some idea what to do.

@ankhers
Copy link
Collaborator

ankhers commented Feb 8, 2018

Did you end up testing that branch with your setup?

@pbrudnick
Copy link

I'm having the same issue @ankhers , I've been investigating and debugging the code, I add more info:

This happens to me when using a remote public uri but once the replica set is updated my original host is removed from the Topology state and the servers are resolving to internal IPs (Amazon in my case), then monitors map is being updated to those keys (hosts and arbiters) as well.
I could see it in update_rs_from_primary/1 function inside topology_description.ex

Regarding this problem, I think the mongo spec explains the problem and the expected behaviour here:
https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#clients-use-the-hostnames-listed-in-the-replica-set-config-not-the-seed-list

does it sound good?

PS: I tested with the exclude_hosts branch, the error doesn't happen but it doesn't work because if I exclude the hosts, my original public host is being removed as well (reading reconcile_servers/1 code) with your fix.

@redink
Copy link
Contributor

redink commented Apr 27, 2018

I had the same issue @ankhers, and I think I resolved it by changing the options for timeout and pool_timeout.

There are two(actually three) timeout value:

  • timeout
    used in connect timeout and receive TCP data timeout for now
  • pool_timeout
    used in checkout connection from a pool
  • connect_timeout_ms
    defined but not used

The problem is if connect MongoDB timeout (timeout value is 5000), checkout connection from a pool will timeout (timeout value is 5000 too) too. And checkout timeout will exit the process (https://github.com/elixir-ecto/db_connection/blob/master/lib/db_connection/connection.ex#L54). After times crash, the application will crash because of supervisor tree restart strategy. Then the Elixir node maybe down because of the application restart strategy.

So, we can modify the timeout for connect to MongoDB and checkout connection from a pool. Just like:

timeout: 5000,
pool_timeout: 8000

Actually, we should use connect_timeout_ms, but this option only defined, but not used.

@ankhers
Copy link
Collaborator

ankhers commented May 21, 2018

Just to keep on top of this. We added the connect_timeout_ms option in 0.4.6. Unfortunately I think it should have just been called connect_timeout. but that can be fixed later.

@costa
Copy link

costa commented Jun 5, 2018

Hey, not 100% sure, but having a similar error, must report that it totally had to do with internal (GCE) domain name resolution in my case: replacing local names with ips (which i never did with components on other platforms) totally solved my "timeout" problem.

p.s. I spoke too soon about "totally solving" the problem, I really have no idea, it "just works" now (except when it doesn't, and fails with timeout).

@pbrudnick
Copy link

In my case I solved the issue passing this explicit option:
type: :single

should we document better the options?

I was connecting remotely to an AWS mongodb and it was resolving the internal IPs giving errors, I just wanted to connect to my single server public dns.

@piyushcoader
Copy link

I am getting same error after i upgrade to v5.0.x, previous version was working fine

@ankhers
Copy link
Collaborator

ankhers commented Jun 23, 2019

Would anyone be willing to give me access to a database that is having this issue? I do not think there is much I can do if I am unable to see the issue firsthand.

@ankhers
Copy link
Collaborator

ankhers commented Jun 25, 2019

I'm just going to leave this here. Someone asked a question on the elixir forum about this issue. I'm adding it here incase we are able to track it down from that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants