-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Analytics: refactoring for speed improvements #3072
Conversation
# cache all stats in the date range for the requested user or organization | ||
cache_key = "analytics-for-dates-#{start_date}-#{end_date}-#{user_or_org.class.name}-#{user_or_org.id}" | ||
|
||
Rails.cache.fetch(cache_key, expires_in: 7.days) do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since now there's only one cache per range, I kept it to the upper range, 7 days, but we can probably lower it to 3 if we see that API clients aren't going to keep asking the same range over and over
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would happen if someone requested the past seven days? That would cache the most recent day for a long time right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed the cache key to include both start and end date, so if the past seven days are asked twice the same day the second time they come back from the cache. The next day the cache key changes. I believe it's a good compromise but I'm not 100% sure it's the right compromise.
This would help dashboards with fixed dates without killing performance by caching each day separately.
I think we can revisit the caching policy in the future because it really depends on how all of this behaves in production once we have enough data to understand which caching strategy would work best.
Caching is hard, no doubt about that ehhe
The description and the work done is truly impressive. |
wow, amazing work @rhymes . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this looks awesome! 😁
|
||
def change | ||
# concurrent creation avoids downtime during production deploy because | ||
# indexes get created in the background |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great to know!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a neat feature.
With 9.6 they went beyond that introducing basic parallelism features (some queries are transparently run in parallel by the DB) and expanded them in versions 10 and 11 (more types of parallel queries but also parallel joins and parallel creation of other types of indexes, you can also set the number of workers and other things). More details in my changelog issue #2801
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's do this
What type of PR is this? (check all applicable)
Description
This refactoring began when I was started taking a look at how the whole analytics worked in the Dashboard pro and noticed we could have computed it faster. I've also noticed that sometimes it requires multiple refreshes in production to be shown due to performance issues.
This PR contains two things: a refactoring of the logic in the
AnalyticsController
for the API and speed improvements for the analytics themselves.Solutions taken into account
I've talked about possible improvements in #2311 and the possible solutions envisioned were:
So, this PR contains a better usage of SQL (from 64 queries to 4 in the grouping by day) and adds indexes on columns that are going to be filtered on repeatedly (indexes are added concurrently by PostgreSQL with the Rails migration not to disrupt normal production flow)
Walkthrough of the steps taken and ameliorations
First thing first I had to set a baseline. You can't make things "faster" without knowing how slow or fast are they. So I wrote a script that uses Ruby's benchmark module to have a ballpark of what was going on:
This was the result on my meager development instance (not too much data in the DB):
since the beginning refers to
2019-04-01
which is the date DEV released Analytics.Step 1 - refactor totals
After refactoring the
AnalyticsService.totals
calculations I benchmarked this timing:a 1.17x speed increase over master branch. Not much but it's a start and there were less queries all around.
Step 2 - refactor grouping by day
This took considerably more time to achieve, but after playing around a bit with PostgreSQL, its query plans the number of queries went down from 64 to 4 (one per each metric).
These are the times measured:
user system total real last week 3.164215 0.148648 3.312863 ( 4.885400) last month 5.232762 0.173115 5.405877 ( 7.152623) since the beginning 7.509549 0.204578 7.714127 ( 9.507002) date too much in the past 39.811000 0.820973 40.631973 ( 44.212027)
last week: 8.91x speed increase over master branch
last month: 23.69x speed increase over master branch
since the beginning: 5.87x speed increase over master branch
date too much in the past it's just me going back to
2018-01-01
to check how slow it would be with such a wide rangeThe speed increase is definitely significant, but there was something bothering me. That's when I thought about caching.
What's the deal with caching in the context of analytics?
Caching of analytics in the master branch is done 1 time per each day in the date range, which can potentially result in tens or even hundreds of network call to the caching serves. Yes, caching helps because (if I'm not mistaken) DEV is using a globally distributed memcached pool of servers so the data is likely to be near you BUT going back and forth to the caching servers might be slower than asking the data to the DB directly if the range is big enough. Basically the round trips added to the caching server speed might result in more time spent than asking the web server to get the data from the DB (which is likely near the server) and send it back to the client.
I still think caching is important but caching each day separately might be counter productive. Since the data now it's loaded in bulk from the DB (
GROUP BY
returns one row per each day in the same query), it might make sense to cache the whole result set instead of each row separately.Step 3 - what happens if we remove all cache calls
user system total real last week 2.369785 0.201472 2.571257 ( 3.973177) last month 2.793212 0.148250 2.941462 ( 4.483640) since the beginning 3.373822 0.158784 3.532606 ( 5.202688) date too much in the past 5.493960 0.209581 5.703541 ( 7.480122)
last week: 11.48x speed increase over master branch
last month: 43.52x speed increase over master branch
since the beginning: 67.57x speed increase over master branch
As I suspected, the bigger the date range is, the more counter productive N cache calls are. Given that I'm on development with a totally different hardware and software than what's running the DEV production site times are going to be different and speed increases likely lower but it verifies my logic from the previous section.
Step 4 - add indexes
One thing that happens with analytics on many rows is that indexes start to matter (not so much in my development tests). Keep in mind that PostgreSQL can still decide not to use indexes if the data set is not big enough, but overall, they are a good idea.
What indexes? I took each query run by
totals
andgrouped_by_day
and checkedWHERE
,GROUP BY
andFILTER
conditions. These are the fields I decided to add indexes on:articles.published
(this is also going to benefit the entire website, since published articles is a super common query)comments.created_at
comments.score
follows.created_at
page_views.created_at
reactions.created_at
reactions.points
Since DEV has been in production for long and those tables contain lots and lots of data (I reckon reactions and page views especially), it's a good idea to (again :P) use the whole PostgreSQL power and add indexes concurrently. That means that PostgreSQL is going to use its concurrency capabilities to write the indexes in the background without stopping the normal write workflow to those tables. Indexes are normally written locking down an entire table, which can slow down the website during a migration. You can read more about all the details in the PostgreSQL doc about the subject.
So, what are the benchmarked timings?
user system total real totals 1.678849 0.194811 1.873660 ( 2.974526) last week 2.410151 0.109961 2.520112 ( 3.714314) last month 4.619584 0.132775 4.752359 ( 6.109809) since the beginning 7.117115 0.159913 7.277028 ( 8.774819) date too much in the past 36.347774 0.402848 36.750622 ( 38.826439)
totals: 1.02x speed increase over no indexes version
last week: 1.31x speed increase over no indexes version
last month: 1.13x speed increase over no indexes version
since the beginning: 1.06x speed increase over no indexes version
The reasons why they only slightly faster than the version with no indexes is that some indexes were already in place (those on the id/type columns and the user id) and that I don't have a big enough data set which, as previously said, can make PostgreSQL decide for a sequential scan or similar operations.
Step 5 - with indexes and no cache calls
Since the cache calls per each day were still in place, I tried without them:
user system total real last week 1.870252 0.116198 1.986450 ( 3.161371) last month 2.725406 0.144293 2.869699 ( 4.278060) since the beginning 2.924799 0.146779 3.071578 ( 4.523648) date too much in the past 5.456498 0.215746 5.672244 ( 7.229783)
again, with no cache calls the calculations are significantly faster, but the same evaluations made above are valid for this example.
last week: 1.29x speed increase over no cache and no indexes version
last month: 1.02x speed increase over no cache and no indexes version
since the beginning: 1.14x speed increase over no cache and no indexes version
Step 6 (and final step) - with indexes and one cache call
user system total real last week 2.259850 0.134709 2.394559 ( 3.935191) last month 3.120629 0.161101 3.281730 ( 4.906886) since the beginning 3.539331 0.181215 3.720546 ( 5.575235) date too much in the past 8.353001 0.302818 8.655819 ( 11.406922)
if compared with the example with indexes and one cache call per each day, the biggest difference can be seen when the date range becomes consistent: 3 months in the past is 1.95x faster and 1 year and a half in the past is 4.24x faster
If compared with the current master:
last week: 12.34x speed increase
last month: 39x speed increase
since the beginning: 64.12x speed increase
I hope this walkthrough was helpful 😁
Related Tickets & Documents
Related to #2311
Added to documentation?