-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Latency spikes after updating to Ecto 3 #2888
Comments
Hi @take-five! Thank you for the report. We need a bit more information:
We will be glad to look into this. |
@take-five perfect, thanks for the quick follow up! |
Hi @take-five. We discussed this internally and unfortunately nothing came to mind as the root cause of this issue, so the next step would be to isolate it further - which we know can be very hard. It is also very interesting that it happens in a certain frequency and always around value. Do you have anything periodically being executed in your app? Other than that, can you reproduce it in development? |
Hi @josevalim, thanks for the follow up
Actually, I couldn't find any regularity in these spikes, it seems very random as you can see on the first graph in the opening post. The second graph represents wider range and has some regularity, but it's related to how traffic volume is distributed over the daily hours. The second graph is here to show how things changed after upgrading to Ecto 3.
One app has periodically executing tasks, other doesn't. Given that performance changed in both apps after upgrading to Ecto 3, I don't think it's related.
To some extent, but I'm not sure if my method is correct.
This is what I see in application log:
This is what I see in the DB log:
As you can see, all queries took less than 8 ms to complete in DB, not saying about 32ms from application perspective 😃 In development I run the same PostgreSQL version as in production, it runs on the same machine, so there's no network latency involved. Could it be that we have long GC pauses on connection process? |
After we reverted one of our apps to Ecto 2 in production, the graph looks stable again: And 95th percentile looks like this (this big hump is a consequence of restart, I guess): So it's not only about latency spikes, but also general performance degrade (even if it's not that big in absolute numbers). @josevalim, let me know if I can provide any additional information |
@take-five if you take the example app in |
To be clear, by reproduce the issue, I mean, reproduce the results you posted in your comment above in development. |
@josevalim I ran the same test on the example app. I had to make some modifications, e.g. add instrumentation code, change default logger level to INFO (by the way, I uploaded the instrumentation code here: https://gist.github.com/take-five/f62ba41adfb5c81d12f6c090283427f1 I ran the test twice, here's the result from the second run:
This time there are less outliers, but I guess that's because the table has only 3 rows and only 4 columns. And here's the results from DB log:
|
I also ran the same test on Ecto 2.2 and results are slightly worse, but this time there are slow queries in DB log
It makes me think that the test wasn't correct |
Thanks @take-five. Yes, I would guess those are about garbage collection, especially because they are running in the same process. We can try increasing the amount of data and see if they happen more often, as that would make it clearer it is indeed cause by garbage collection. In this case, I asked earlier about the whole request/response time. Does it become worst on average with Ecto 3.0? Or is it roughly the same? If it is the same, then it could mean that for some reason garbage collection is being triggered more often inside Ecto, while before it would happen outside. However if you are having increases or spikes in the request/response, then it is definitely something to worry about. |
Thanks @take-five. Unfortunately I am not sure what are the next steps. Do you have any suggestions? One option is to increase the data in the example app and see if it can reproduce it but that is just a guess. @fishcakez do you have any suggestions? |
Out of 3 dependencies that have been upgraded (ecto, db_connection, postgrex), which one, do you think, is most likely the culprit? |
@take-five in a way it doesn't matter because they have to be upgraded together as we worked on them together. But if I had to guess ecto or db_connection, the work on postgrex was minimal. |
Do you have any suggestions on how to approach this problem? Maybe some ideas about how to catch and profile these outliers on a live system? |
@take-five the only next suggestion I have is to profile some of those queries using |
Hi @josevalim , I deployed Here's what we can see on Ecto 3 application:
And here's what we can see on Ecto 2 application:
I assume, this line puts the connection process into hibernate mode. Could it be the culprit? |
Interesting. Could you try a fork of |
Yeah, I'll experiment with |
I've let our application run with the patched |
If you start the console in production and run some of those queries, can
you reproduce those values from the console? If so, one option is to add
:mix and :tools to your release and invoke this function:
https://hexdocs.pm/mix/master/Mix.Tasks.Profile.Eprof.html#profile/2
--
*José Valimwww.plataformatec.com.br
<http://www.plataformatec.com.br/>Founder and Director of R&D*
|
Hibernation is out of band of the normal request path, its only called after a connect and after a ping. It is a good use of hibernate because those processes aren't used unless we see disconnect/idle connection or stop. Idle connections do no occur concurrently unless ping time is very very big (order of seconds which should not happen in healthy state). |
Hi @josevalim , sorry for the long response, now I finally could get back to this problem and run some experiments on a live system. I added I uploaded Unfortunately, I couldn't catch any long running queries from the console. Running queries with |
Thanks @take-five! Can you please share how and which command was run? From this code snippet, it seems that the eprof report was "corrupted" with evaluation information. Maybe we need to run this slightly different. |
Sure, this is what I ran on the server (sorry for spaghetti-ish code) import ExProf.Macro
operator = OperatorAuth.Repo.get!(OperatorAuth.Operator, "aa8ab5f1-779d-475e-a1a2-750a375c5109")
capture_io = fn fun ->
original_gl = Process.group_leader()
{:ok, capture_gl} = StringIO.open("", capture_prompt: true)
try do
Process.group_leader(self(), capture_gl)
fun.()
after
#StringIO.close(capture_gl)
Process.group_leader(self(), original_gl)
end
end
profile! = fn threshold_ms ->
{records, {total_time, result}} = capture_io.(fn ->
profile do
:timer.tc(fn -> OperatorAuth.OperatorRepository.allowed_login_from?(operator, "127.0.0.1") end)
end
end)
call_time = Enum.reduce(records, 0, &(&1.time + &2)) / 1000
if call_time > threshold_ms do
{:slow, total_time / 1000, call_time, records}
else
{:fast, call_time, records}
end
end
crunch! = fn threshold_ms, iterations ->
started_at = System.monotonic_time()
{_, _, records} = Enum.reduce_while(1..iterations, {0, 0, [nil, nil, nil]}, fn n, {sum, longest, [records1, records2, records3]} ->
case profile!.(threshold_ms) do
{:slow, total_time, call_time, records} ->
{:halt, {sum, call_time, records}}
{:fast, call_time, new_records} ->
records =
if call_time > longest do
[new_records, records1, records2]
else
[records1, records2, records3]
end
if rem(n, 50) == 0 do
elapsed = System.convert_time_unit(System.monotonic_time() - started_at, :native, :second)
IO.puts "#{n} samples analyzed, call time avg = #{sum / 50} ms; " <>
"call time max = #{max(call_time, longest)} ms; " <>
"time elapsed = #{elapsed} s"
{:cont, {0, max(call_time, longest), records}}
else
{:cont, {sum + call_time, max(call_time, longest), records}}
end
end
end)
records = Enum.reject(records, &is_nil/1)
IO.puts "TOP #{length(records)} traces"
IO.puts "-----------------------------"
Enum.each(records, fn r ->
ExProf.Analyzer.print(r)
IO.puts "----"
IO.puts "call_time: #{Enum.reduce(r, 0, &(&1.time + &2)) / 1000}"
IO.puts "----"
IO.puts ""
end)
end
crunch!.(50, 1000) I ran this code from the |
Did you run this on IEx or on a separate file?
--
*José Valimwww.plataformatec.com.br
<http://www.plataformatec.com.br/>Founder and Director of R&D*
|
I ran this in IEx ( |
That evaluates the code and that interferes with the profiling. You can try
to put all of it inside a
defmodule Foo do
def run do
...
end
end
And invoke Foo.run. That should generate more prestine results. :)
If you can give it a try, it will be very appreciated, thanks!
--
*José Valimwww.plataformatec.com.br
<http://www.plataformatec.com.br/>Founder and Director of R&D*
|
Hi @josevalim , I compiled everything as a module and ran the benchmark again. Again, I couldn't get any really slow traces, here are the 3 slowest ones: https://gist.github.com/take-five/75455509720a70fbc3c92572ab77692e |
Thank you! It is so weird that so much time is spent on ETS. Quick question: how did you install and compile Erlang? |
We use this Dockerfile to build the production image:
We use ERTS included in the release. Here's ours
|
Since I don't think there's anything actionable in this issue right now, I'm going to close it. If the observed behaviour is replicated again, we can reopen then. |
Precheck
Environment
Current behavior
After updating one of our applications to Ecto 3 we're observing unexpected latency spikes on random queries.
In the application I created a subscription to Ecto events via Telemetry which emits data for each query to Datadog, here's what we can see all the time (this graph shows max query time)
Here's the graph from another application for which we also updated Ecto to 3.0
You can see that the max execution time jumped on Tuesday 15th which correlates with the time the changes were deployed.
I also set up both DB and application to log queries that took longer than 100ms. I can see these slow queries in the application logs, but not in the DB logs.
Also note, that the application is mostly read-only and the latency spikes happen even with "SELECT" queries (this excludes lock contention on the DB side). The queries are very simple and usually fast, tables have all necessary indexes.
Expected behavior
Query latency is stable, without spikes.
The text was updated successfully, but these errors were encountered: