Latency spikes after updating to Ecto 3 #2888

take-five · 2019-01-21T08:05:13Z

Precheck

Do not use the issues tracker for help or support requests (try Elixir Forum, Stack Overflow, IRC or mailing lists, etc).
For proposing a new feature, please start a discussion on elixir-ecto.
For bugs, do a quick search and make sure the bug has not yet been reported.
Finally, be nice and have fun!

Environment

Elixir version (elixir -v):

Elixir 1.7.4
Erlang/OTP 21 [erts-10.2.2] [source] [64-bit] [smp:4:4] [ds:4:4:10] [async-threads:1] [hipe]

Database and version (PostgreSQL 9.4, MongoDB 3.2, etc.): PostgreSQL 9.5
Ecto version (mix deps): 3.0.6
Database adapter and version (mix deps): postgrex 0.14.1
Operating system: Alpine Linux v3.7
DB pool size: 15

Current behavior

After updating one of our applications to Ecto 3 we're observing unexpected latency spikes on random queries.

In the application I created a subscription to Ecto events via Telemetry which emits data for each query to Datadog, here's what we can see all the time (this graph shows max query time)

Here's the graph from another application for which we also updated Ecto to 3.0

You can see that the max execution time jumped on Tuesday 15th which correlates with the time the changes were deployed.

I also set up both DB and application to log queries that took longer than 100ms. I can see these slow queries in the application logs, but not in the DB logs.

Also note, that the application is mostly read-only and the latency spikes happen even with "SELECT" queries (this excludes lock contention on the DB side). The queries are very simple and usually fast, tables have all necessary indexes.

Expected behavior

Query latency is stable, without spikes.

The text was updated successfully, but these errors were encountered:

josevalim · 2019-01-21T08:37:03Z

Hi @take-five!

Thank you for the report. We need a bit more information:

for those logged queries, can you also log the queue_time, decode_time and query_time? Those are all in the telemetry event metadata.
Furthermore, do you have logs for the full life cycle? For example, if this is a web app, do you have logs for the whole request/response too? It would be good to know if those spikes are also translating to the whole request/response.
Can you please let us know if you changed anything else during the deployment or was it only the Ecto version?
Did you configure your Ecto pool before? If so, with which parameters? Did you configure it now? Or is everything running "stock"?
Finally, can you check protocols are consolidated in production? You can log the value of Protocol.consolidated?(Enumerable) when your app starts.

We will be glad to look into this.

take-five · 2019-01-21T08:49:44Z

for those logged queries, can you also log the queue_time, decode_time and query_time? Those are all in the telemetry event metadata.

Yes, all values are logged, you can see them on this screenshot (units are milliseconds)

Furthermore, do you have logs for the full life cycle? For example, if this is a web app, do you have logs for the whole request/response too? It would be good to know if those spikes are also translating to the whole request/response.

Here's the example of logs for one of those requests

Can you please let us know if you changed anything else during the deployment or was it only the Ecto version?

It was Ecto, Postgrex and DBConnection versions + fixing some deprecation warnings about Ecto.Repo.update_all with returning: true

Did you configure your Ecto pool before? If so, with which parameters? Did you configure it now? Or is everything running "stock"?

We configure it in config/config.exs:

config :app, App.Repo,
  ...
  pool_size: 15

We used this setting before upgrading to Ecto 3 and didn't change it.

Finally, can you check protocols are consolidated in production?

Yes, they are. I SSH-ed into the running app in production and executed the command you suggested. It returns true.

josevalim · 2019-01-21T09:12:13Z

@take-five perfect, thanks for the quick follow up!

josevalim · 2019-01-22T08:59:33Z

Hi @take-five. We discussed this internally and unfortunately nothing came to mind as the root cause of this issue, so the next step would be to isolate it further - which we know can be very hard.

It is also very interesting that it happens in a certain frequency and always around value. Do you have anything periodically being executed in your app?

Other than that, can you reproduce it in development?

take-five · 2019-01-22T10:21:39Z

Hi @josevalim, thanks for the follow up

It is also very interesting that it happens in a certain frequency and always around value

Actually, I couldn't find any regularity in these spikes, it seems very random as you can see on the first graph in the opening post. The second graph represents wider range and has some regularity, but it's related to how traffic volume is distributed over the daily hours. The second graph is here to show how things changed after upgrading to Ecto 3.

Do you have anything periodically being executed in your app?

One app has periodically executing tasks, other doesn't. Given that performance changed in both apps after upgrading to Ecto 3, I don't think it's related.

Other than that, can you reproduce it in development?

To some extent, but I'm not sure if my method is correct.

I made it so the queries that take longer than 10ms are logged in the application
I turned on logging of all queries in Postgres
I ran a very cheap query, which normally takes fractions of millisecond to complete, 100k times. Among these 100k queries I consistently get 5-15 outliers that take more than 10ms to complete.

This is what I see in application log:

queue_time=0 decode_time=0 query_time=14
queue_time=0 decode_time=0 query_time=10
queue_time=0 decode_time=0 query_time=17
queue_time=0 decode_time=0 query_time=21
queue_time=0 decode_time=0 query_time=13
queue_time=0 decode_time=0 query_time=32

This is what I see in the DB log:

$ grep duration /usr/local/var/log/postgres.log | grep -v "duration: 0"
[unknown] LOG:  duration: 6.105 ms
[unknown] LOG:  duration: 1.356 ms
[unknown] LOG:  duration: 4.686 ms
[unknown] LOG:  duration: 3.169 ms
[unknown] LOG:  duration: 1.116 ms
[unknown] LOG:  duration: 2.075 ms
[unknown] LOG:  duration: 1.416 ms
[unknown] LOG:  duration: 4.896 ms
[unknown] LOG:  duration: 2.460 ms
[unknown] LOG:  duration: 7.358 ms
[unknown] LOG:  duration: 2.143 ms
[unknown] LOG:  duration: 3.013 ms
[unknown] LOG:  duration: 2.189 ms
[unknown] LOG:  duration: 7.270 ms
[unknown] LOG:  duration: 3.284 ms
[unknown] LOG:  duration: 4.395 ms
[unknown] LOG:  duration: 4.016 ms
[unknown] LOG:  duration: 1.784 ms
[unknown] LOG:  duration: 1.773 ms
[unknown] LOG:  duration: 2.780 ms

As you can see, all queries took less than 8 ms to complete in DB, not saying about 32ms from application perspective 😃

In development I run the same PostgreSQL version as in production, it runs on the same machine, so there's no network latency involved. Could it be that we have long GC pauses on connection process?

take-five · 2019-01-22T15:30:13Z

After we reverted one of our apps to Ecto 2 in production, the graph looks stable again:

And 95th percentile looks like this (this big hump is a consequence of restart, I guess):

So it's not only about latency spikes, but also general performance degrade (even if it's not that big in absolute numbers).

@josevalim, let me know if I can provide any additional information

josevalim · 2019-01-22T16:39:43Z

@take-five if you take the example app in ecto's repository, add some data to it, and run a bunch of simple queries, can you reproduce the issue?

josevalim · 2019-01-22T16:40:02Z

To be clear, by reproduce the issue, I mean, reproduce the results you posted in your comment above in development.

take-five · 2019-01-22T17:55:53Z

@josevalim I ran the same test on the example app. I had to make some modifications, e.g. add instrumentation code, change default logger level to INFO (by the way, mix.exs in the example app is missing dependency ecto_sql).

I uploaded the instrumentation code here: https://gist.github.com/take-five/f62ba41adfb5c81d12f6c090283427f1

I ran the test twice, here's the result from the second run:

iex(2)> (1..100_000) |> Enum.each(fn _ -> Friends.Repo.get!(Friends.Person, 1) end)
19:44:03.657 [warn]  query_time=10
19:44:13.606 [warn]  query_time=11
19:44:21.507 [warn]  query_time=17
19:44:21.708 [warn]  query_time=21
19:44:22.354 [warn]  query_time=10

This time there are less outliers, but I guess that's because the table has only 3 rows and only 4 columns.

And here's the results from DB log:

$ grep duration /usr/local/var/log/postgres.log | grep -v "duration: 0"
[unknown] LOG:  duration: 2.042 ms
[unknown] LOG:  duration: 4.174 ms

take-five · 2019-01-22T18:14:05Z

I also ran the same test on Ecto 2.2 and results are slightly worse, but this time there are slow queries in DB log

iex(3)> (1..100_000) |> Enum.each(fn _ -> Friends.Repo.get!(Friends.Person, 1) end)

20:11:16.174 [warn]  query_time=18
20:11:20.515 [warn]  query_time=14
20:11:20.670 [warn]  query_time=47
20:11:30.422 [warn]  query_time=10
20:11:30.433 [warn]  query_time=10
20:11:31.200 [warn]  query_time=14
20:11:41.660 [warn]  query_time=11

$ grep duration /usr/local/var/log/postgres.log | grep -v "duration: 0"
[unknown] LOG:  duration: 13.873 ms
[unknown] LOG:  duration: 39.876 ms
[unknown] LOG:  duration: 2.380 ms
[unknown] LOG:  duration: 14.381 ms

It makes me think that the test wasn't correct

josevalim · 2019-01-28T10:20:25Z

Thanks @take-five. Yes, I would guess those are about garbage collection, especially because they are running in the same process. We can try increasing the amount of data and see if they happen more often, as that would make it clearer it is indeed cause by garbage collection.

In this case, query_time is very poorly named, because it is not the time of the query but rather how long we spent with the database connection.

I asked earlier about the whole request/response time. Does it become worst on average with Ecto 3.0? Or is it roughly the same? If it is the same, then it could mean that for some reason garbage collection is being triggered more often inside Ecto, while before it would happen outside. However if you are having increases or spikes in the request/response, then it is definitely something to worry about.

take-five · 2019-01-28T10:58:09Z

I asked earlier about the whole request/response time. Does it become worst on average with Ecto 3.0?

Yes, it became worse, we're observing the same spikes on the application response time graphs since we deployed Ecto 3.0.

josevalim · 2019-01-28T14:26:01Z

Thanks @take-five. Unfortunately I am not sure what are the next steps. Do you have any suggestions?

One option is to increase the data in the example app and see if it can reproduce it but that is just a guess.

@fishcakez do you have any suggestions?

take-five · 2019-01-28T20:13:12Z

Out of 3 dependencies that have been upgraded (ecto, db_connection, postgrex), which one, do you think, is most likely the culprit?

josevalim · 2019-01-28T20:14:47Z

@take-five in a way it doesn't matter because they have to be upgraded together as we worked on them together. But if I had to guess ecto or db_connection, the work on postgrex was minimal.

take-five · 2019-01-28T21:13:54Z

Do you have any suggestions on how to approach this problem? Maybe some ideas about how to catch and profile these outliers on a live system?

josevalim · 2019-01-31T21:09:36Z

@take-five the only next suggestion I have is to profile some of those queries using mix profile.eprof or mix profile.cprof. Are you deploying using Mix or Distillery?

take-five · 2019-02-07T09:01:05Z

Hi @josevalim ,

I deployed observer_cli to our production system, compared a picture on Ecto 2 and Ecto 3 applications, and found a strange pattern.

Here's what we can see on Ecto 3 application:

|205|<0.1838.0>  |    3.1797 KB |Elixir.DBConnection.Connection:init/1 |               386947| 0        |erlang:hibernate/3               |
|206|<0.1837.0>  |    3.1797 KB |Elixir.DBConnection.Connection:init/1 |               373573| 0        |erlang:hibernate/3               |
|207|<0.1836.0>  |    3.1797 KB |Elixir.DBConnection.Connection:init/1 |               383517| 0        |erlang:hibernate/3               |
|208|<0.1835.0>  |    3.1797 KB |Elixir.DBConnection.Connection:init/1 |               373575| 0        |erlang:hibernate/3               |

And here's what we can see on Ecto 2 application:

|29 |<0.908.0>   |   16.6289 KB |Elixir.DBConnection.Connection:init/1 |               108717| 0        |gen_server:loop/7                |
|30 |<0.906.0>   |   16.6289 KB |Elixir.DBConnection.Connection:init/1 |               108716| 0        |gen_server:loop/7                |
|31 |<0.905.0>   |   16.6289 KB |Elixir.DBConnection.Connection:init/1 |               108712| 0        |gen_server:loop/7                |
|32 |<0.904.0>   |   16.6289 KB |Elixir.DBConnection.Connection:init/1 |               108720| 0        |gen_server:loop/7                |

I assume, this line puts the connection process into hibernate mode. Could it be the culprit?

josevalim · 2019-02-07T09:33:30Z

Interesting. Could you try a fork of db_connection without that hibernate line and see if it improves things? Maybe @fishcakez has some other ideas too.

take-five · 2019-02-07T09:34:52Z

Yeah, I'll experiment withdb_connection this week and let you know about the results.

take-five · 2019-02-07T19:55:57Z

I've let our application run with the patched db_connection for a bit, but there's no difference - we're still experiencing the same spikes.

josevalim · 2019-02-08T18:08:33Z

If you start the console in production and run some of those queries, can you reproduce those values from the console? If so, one option is to add :mix and :tools to your release and invoke this function: https://hexdocs.pm/mix/master/Mix.Tasks.Profile.Eprof.html#profile/2 -- *José Valimwww.plataformatec.com.br <http://www.plataformatec.com.br/>Founder and Director of R&D*

fishcakez · 2019-02-10T23:07:50Z

Hibernation is out of band of the normal request path, its only called after a connect and after a ping. It is a good use of hibernate because those processes aren't used unless we see disconnect/idle connection or stop. Idle connections do no occur concurrently unless ping time is very very big (order of seconds which should not happen in healthy state).

take-five · 2019-02-21T14:33:29Z

Hi @josevalim , sorry for the long response, now I finally could get back to this problem and run some experiments on a live system.

I added :tools app and package ExProf to our system and ran some benchmarks from the console on the production instance. Then I ran the same query against an empty table 2000 times and collected traces for the top 3 slowest of them.

I uploaded eprof reports here: https://gist.github.com/take-five/a8d5897dee3adf3617fc9f40ed4f1e60

Unfortunately, I couldn't catch any long running queries from the console. Running queries with eprof takes quite a lot of time, it slows down execution significantly (two orders of magnitude). Also, out of these 3 traces, 2 bottom ones seem like they were caused by preemption or garbage collection.

josevalim · 2019-02-22T16:00:18Z

Thanks @take-five! Can you please share how and which command was run? From this code snippet, it seems that the eprof report was "corrupted" with evaluation information. Maybe we need to run this slightly different.

take-five · 2019-02-22T17:24:39Z

Sure, this is what I ran on the server (sorry for spaghetti-ish code)

import ExProf.Macro
operator = OperatorAuth.Repo.get!(OperatorAuth.Operator, "aa8ab5f1-779d-475e-a1a2-750a375c5109")

capture_io = fn fun ->
  original_gl = Process.group_leader()
  {:ok, capture_gl} = StringIO.open("", capture_prompt: true)

  try do
    Process.group_leader(self(), capture_gl)
    fun.()
  after
    #StringIO.close(capture_gl)
    Process.group_leader(self(), original_gl)
  end
end

profile! = fn threshold_ms ->
  {records, {total_time, result}} = capture_io.(fn ->
    profile do
      :timer.tc(fn -> OperatorAuth.OperatorRepository.allowed_login_from?(operator, "127.0.0.1") end)
    end
  end)

  call_time = Enum.reduce(records, 0, &(&1.time + &2)) / 1000

  if call_time > threshold_ms do
    {:slow, total_time / 1000, call_time, records}
  else
    {:fast, call_time, records}
  end
end

crunch! = fn threshold_ms, iterations ->
  started_at = System.monotonic_time()

  {_, _, records} = Enum.reduce_while(1..iterations, {0, 0, [nil, nil, nil]}, fn n, {sum, longest, [records1, records2, records3]} ->
    case profile!.(threshold_ms) do
      {:slow, total_time, call_time, records} ->
        {:halt, {sum, call_time, records}}

      {:fast, call_time, new_records} ->
        records =
          if call_time > longest do
            [new_records, records1, records2]
          else
            [records1, records2, records3]
          end

        if rem(n, 50) == 0 do
          elapsed = System.convert_time_unit(System.monotonic_time() - started_at, :native, :second)

          IO.puts "#{n} samples analyzed, call time avg = #{sum / 50} ms; " <>
                  "call time max = #{max(call_time, longest)} ms; " <>
                  "time elapsed = #{elapsed} s"
          {:cont, {0, max(call_time, longest), records}}
        else
          {:cont, {sum + call_time, max(call_time, longest), records}}
        end
    end
  end)

  records = Enum.reject(records, &is_nil/1)

  IO.puts "TOP #{length(records)} traces"
  IO.puts "-----------------------------"
  Enum.each(records, fn r ->
    ExProf.Analyzer.print(r)
    IO.puts "----"
    IO.puts "call_time: #{Enum.reduce(r, 0, &(&1.time + &2)) / 1000}"
    IO.puts "----"
    IO.puts ""
  end)
end

crunch!.(50, 1000)

I ran this code from the iex console on a live system (operator_auth/bin/operator_auth remote_console)

josevalim · 2019-02-22T17:39:00Z

Did you run this on IEx or on a separate file? -- *José Valimwww.plataformatec.com.br <http://www.plataformatec.com.br/>Founder and Director of R&D*

take-five · 2019-02-22T17:40:12Z

Did you run this on IEx or on a separate file?

I ran this in IEx (operator_auth/bin/operator_auth remote_console)

josevalim · 2019-02-22T17:44:45Z

That evaluates the code and that interferes with the profiling. You can try to put all of it inside a defmodule Foo do def run do ... end end And invoke Foo.run. That should generate more prestine results. :) If you can give it a try, it will be very appreciated, thanks! -- *José Valimwww.plataformatec.com.br <http://www.plataformatec.com.br/>Founder and Director of R&D*

take-five · 2019-02-25T08:34:50Z

Hi @josevalim ,

I compiled everything as a module and ran the benchmark again. Again, I couldn't get any really slow traces, here are the 3 slowest ones: https://gist.github.com/take-five/75455509720a70fbc3c92572ab77692e

josevalim · 2019-02-25T08:38:46Z

Thank you! It is so weird that so much time is spent on ETS. Quick question: how did you install and compile Erlang?

take-five · 2019-02-25T08:45:53Z

We use this Dockerfile to build the production image:

FROM elixir:1.7.4-alpine AS build

ENV MIX_ENV=prod

RUN apk add --no-cache git make gcc libc-dev

# Install Hex+Rebar
RUN mix local.hex --force && \
    mix local.rebar --force

WORKDIR /app

# Cache elixir deps
COPY mix.exs mix.lock ./

RUN mix deps.get --only prod
COPY config ./config
RUN mix deps.compile

COPY . .
RUN mix release --env=prod

# Runtime container
FROM alpine:3.9

RUN apk --no-cache add bash openssl

ENV MIX_ENV=prod REPLACE_OS_VARS=true

COPY --from=build /app/rel/operator_auth /operator_auth

CMD ["./operator_auth/bin/operator_auth", "foreground"]

We use ERTS included in the release.

Here's ours rel/vm.args.eex file:

## Name of the node
-name <%= release_name %>@127.0.0.1

## Cookie for distributed erlang
-setcookie <%= release.profile.cookie %>

## Enable kernel poll and a few async threads
+K true
+A 100

## Increase number of concurrent ports/sockets
-env ERL_MAX_PORTS 4096

# Enable SMP automatically based on availability
-smp auto

take-five · 2019-04-17T12:44:17Z

Quick update.

I've taken a look at our graphs and it turns out that we don't have this issue anymore. The latency spikes disappeared after our infrastructure team performed a failover exercise and migrated everything from AWS availability zone us-east-1b to us-east-1a. We don't really know what happened, but now everything works normally.

Here's how the graph looks like, you can clearly see the point when the failover happened.

So probably Ecto is not to blame, although it's still strange that Ecto 2 worked fine (or seemed to work fine) in AZ us-east-1b.

michalmuskala · 2019-07-22T10:32:57Z

Since I don't think there's anything actionable in this issue right now, I'm going to close it. If the observed behaviour is replicated again, we can reopen then.

michalmuskala closed this as completed Jul 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency spikes after updating to Ecto 3 #2888

Latency spikes after updating to Ecto 3 #2888

take-five commented Jan 21, 2019

josevalim commented Jan 21, 2019

take-five commented Jan 21, 2019

josevalim commented Jan 21, 2019

josevalim commented Jan 22, 2019

take-five commented Jan 22, 2019 •

edited

Loading

take-five commented Jan 22, 2019 •

edited

Loading

josevalim commented Jan 22, 2019

josevalim commented Jan 22, 2019

take-five commented Jan 22, 2019 •

edited

Loading

take-five commented Jan 22, 2019 •

edited

Loading

josevalim commented Jan 28, 2019

take-five commented Jan 28, 2019

josevalim commented Jan 28, 2019

take-five commented Jan 28, 2019

josevalim commented Jan 28, 2019

take-five commented Jan 28, 2019

josevalim commented Jan 31, 2019

take-five commented Feb 7, 2019

josevalim commented Feb 7, 2019

take-five commented Feb 7, 2019

take-five commented Feb 7, 2019

josevalim commented Feb 8, 2019 via email

fishcakez commented Feb 10, 2019

take-five commented Feb 21, 2019

josevalim commented Feb 22, 2019

take-five commented Feb 22, 2019 •

edited

Loading

josevalim commented Feb 22, 2019 via email

take-five commented Feb 22, 2019

josevalim commented Feb 22, 2019 via email

take-five commented Feb 25, 2019

josevalim commented Feb 25, 2019

take-five commented Feb 25, 2019

take-five commented Apr 17, 2019

michalmuskala commented Jul 22, 2019

Latency spikes after updating to Ecto 3 #2888

Latency spikes after updating to Ecto 3 #2888

Comments

take-five commented Jan 21, 2019

Precheck

Environment

Current behavior

Expected behavior

josevalim commented Jan 21, 2019

take-five commented Jan 21, 2019

josevalim commented Jan 21, 2019

josevalim commented Jan 22, 2019

take-five commented Jan 22, 2019 • edited Loading

take-five commented Jan 22, 2019 • edited Loading

josevalim commented Jan 22, 2019

josevalim commented Jan 22, 2019

take-five commented Jan 22, 2019 • edited Loading

take-five commented Jan 22, 2019 • edited Loading

josevalim commented Jan 28, 2019

take-five commented Jan 28, 2019

josevalim commented Jan 28, 2019

take-five commented Jan 28, 2019

josevalim commented Jan 28, 2019

take-five commented Jan 28, 2019

josevalim commented Jan 31, 2019

take-five commented Feb 7, 2019

josevalim commented Feb 7, 2019

take-five commented Feb 7, 2019

take-five commented Feb 7, 2019

josevalim commented Feb 8, 2019 via email

fishcakez commented Feb 10, 2019

take-five commented Feb 21, 2019

josevalim commented Feb 22, 2019

take-five commented Feb 22, 2019 • edited Loading

josevalim commented Feb 22, 2019 via email

take-five commented Feb 22, 2019

josevalim commented Feb 22, 2019 via email

take-five commented Feb 25, 2019

josevalim commented Feb 25, 2019

take-five commented Feb 25, 2019

take-five commented Apr 17, 2019

michalmuskala commented Jul 22, 2019

take-five commented Jan 22, 2019 •

edited

Loading

take-five commented Jan 22, 2019 •

edited

Loading

take-five commented Jan 22, 2019 •

edited

Loading

take-five commented Jan 22, 2019 •

edited

Loading

take-five commented Feb 22, 2019 •

edited

Loading