Analytics: refactoring for speed improvements #3072

rhymes · 2019-06-07T14:17:28Z

What type of PR is this? (check all applicable)

Refactor
Feature
Bug Fix
Documentation Update

Description

This refactoring began when I was started taking a look at how the whole analytics worked in the Dashboard pro and noticed we could have computed it faster. I've also noticed that sometimes it requires multiple refreshes in production to be shown due to performance issues.

This PR contains two things: a refactoring of the logic in the AnalyticsController for the API and speed improvements for the analytics themselves.

Solutions taken into account

I've talked about possible improvements in #2311 and the possible solutions envisioned were:

limiting the date range (not implemented here)
adding indexes to columns that are part of the filtering/where conditions (part of this PR)
use grouping in SQL (part of this PR)
pre computing data in batches (not implemented here)
low level caching (this is already in production but there are issues with caching each date separately as we can already see in production. Each date means a separate network call to the caching server, which means an average of 30 HTTP calls for a month range and possibly hundreds for "infinity". I've changed it to a single cache call per range but we'll talk more about caching later)

So, this PR contains a better usage of SQL (from 64 queries to 4 in the grouping by day) and adds indexes on columns that are going to be filtered on repeatedly (indexes are added concurrently by PostgreSQL with the Rails migration not to disrupt normal production flow)

Walkthrough of the steps taken and ameliorations

First thing first I had to set a baseline. You can't make things "faster" without knowing how slow or fast are they. So I wrote a script that uses Ruby's benchmark module to have a ballpark of what was going on:

require 'benchmark'

# disable logging to avoid measuring that
ActiveRecord::Base.logger = nil

user = User.find(11)

iterations = 1_000

Benchmark.bm do |bm|
  bm.report("totals") do
    as = AnalyticsService.new(user)
    iterations.times { as.totals }
  end
  bm.report("last week") do
    as = AnalyticsService.new(user, start_date: 1.week.ago.to_date.iso8601)
    iterations.times { as.grouped_by_day }
  end
  bm.report("last month") do
    as = AnalyticsService.new(user, start_date: 1.month.ago.to_date.iso8601)
    iterations.times { as.grouped_by_day }
  end
  bm.report("since the beginning") do
    as = AnalyticsService.new(user, start_date: "2019-04-01")
    iterations.times { as.grouped_by_day }
  end
  bm.report("date too much in the past") do
    as = AnalyticsService.new(user, start_date: "2018-01-01")
    iterations.times { as.grouped_by_day }
  end
end

This was the result on my meager development instance (not too much data in the DB):

                     user        system     total         real
totals               2.028109    0.231059   2.259168      (  3.594266)
last week            27.779555   1.734052   29.513607     ( 45.470222)
last month           120.701257  7.255624   127.956881    (206.690019)
since the beginning  225.478188  13.058602  238.536790    (384.704629)

since the beginning refers to 2019-04-01 which is the date DEV released Analytics.

Step 1 - refactor totals

After refactoring the AnalyticsService.totals calculations I benchmarked this timing:

       user     system      total        real
totals  1.787045   0.128882   1.915927 (  2.999562)

a 1.17x speed increase over master branch. Not much but it's a start and there were less queries all around.

Step 2 - refactor grouping by day

This took considerably more time to achieve, but after playing around a bit with PostgreSQL, its query plans the number of queries went down from 64 to 4 (one per each metric).

These are the times measured:

                          user        system      total       real
last week                 3.164215    0.148648    3.312863    (  4.885400)
last month                5.232762    0.173115    5.405877    (  7.152623)
since the beginning       7.509549    0.204578    7.714127    (  9.507002)
date too much in the past 39.811000   0.820973    40.631973   ( 44.212027)

last week: 8.91x speed increase over master branch
last month: 23.69x speed increase over master branch
since the beginning: 5.87x speed increase over master branch

date too much in the past it's just me going back to 2018-01-01 to check how slow it would be with such a wide range

The speed increase is definitely significant, but there was something bothering me. That's when I thought about caching.

What's the deal with caching in the context of analytics?

Caching of analytics in the master branch is done 1 time per each day in the date range, which can potentially result in tens or even hundreds of network call to the caching serves. Yes, caching helps because (if I'm not mistaken) DEV is using a globally distributed memcached pool of servers so the data is likely to be near you BUT going back and forth to the caching servers might be slower than asking the data to the DB directly if the range is big enough. Basically the round trips added to the caching server speed might result in more time spent than asking the web server to get the data from the DB (which is likely near the server) and send it back to the client.

I still think caching is important but caching each day separately might be counter productive. Since the data now it's loaded in bulk from the DB (GROUP BY returns one row per each day in the same query), it might make sense to cache the whole result set instead of each row separately.

Step 3 - what happens if we remove all cache calls

                            user        system     total        real
last week                   2.369785    0.201472   2.571257     (  3.973177)
last month                  2.793212    0.148250   2.941462     (  4.483640)
since the beginning         3.373822    0.158784   3.532606     (  5.202688)
date too much in the past   5.493960    0.209581   5.703541     (  7.480122)

last week: 11.48x speed increase over master branch
last month: 43.52x speed increase over master branch
since the beginning: 67.57x speed increase over master branch

As I suspected, the bigger the date range is, the more counter productive N cache calls are. Given that I'm on development with a totally different hardware and software than what's running the DEV production site times are going to be different and speed increases likely lower but it verifies my logic from the previous section.

Step 4 - add indexes

One thing that happens with analytics on many rows is that indexes start to matter (not so much in my development tests). Keep in mind that PostgreSQL can still decide not to use indexes if the data set is not big enough, but overall, they are a good idea.

What indexes? I took each query run by totals and grouped_by_day and checked WHERE, GROUP BY and FILTER conditions. These are the fields I decided to add indexes on:

articles.published (this is also going to benefit the entire website, since published articles is a super common query)
comments.created_at
comments.score
follows.created_at
page_views.created_at
reactions.created_at
reactions.points

Since DEV has been in production for long and those tables contain lots and lots of data (I reckon reactions and page views especially), it's a good idea to (again :P) use the whole PostgreSQL power and add indexes concurrently. That means that PostgreSQL is going to use its concurrency capabilities to write the indexes in the background without stopping the normal write workflow to those tables. Indexes are normally written locking down an entire table, which can slow down the website during a migration. You can read more about all the details in the PostgreSQL doc about the subject.

So, what are the benchmarked timings?

                          user        system      total       real
totals                    1.678849    0.194811    1.873660    (  2.974526)
last week                 2.410151    0.109961    2.520112    (  3.714314)
last month                4.619584    0.132775    4.752359    (  6.109809)
since the beginning       7.117115    0.159913    7.277028    (  8.774819)
date too much in the past 36.347774   0.402848    36.750622   ( 38.826439)

totals: 1.02x speed increase over no indexes version
last week: 1.31x speed increase over no indexes version
last month: 1.13x speed increase over no indexes version
since the beginning: 1.06x speed increase over no indexes version

The reasons why they only slightly faster than the version with no indexes is that some indexes were already in place (those on the id/type columns and the user id) and that I don't have a big enough data set which, as previously said, can make PostgreSQL decide for a sequential scan or similar operations.

Step 5 - with indexes and no cache calls

Since the cache calls per each day were still in place, I tried without them:

                            user        system      total       real
last week                   1.870252    0.116198    1.986450    (  3.161371)
last month                  2.725406    0.144293    2.869699    (  4.278060)
since the beginning         2.924799    0.146779    3.071578    (  4.523648)
date too much in the past   5.456498    0.215746    5.672244    (  7.229783)

again, with no cache calls the calculations are significantly faster, but the same evaluations made above are valid for this example.

last week: 1.29x speed increase over no cache and no indexes version
last month: 1.02x speed increase over no cache and no indexes version
since the beginning: 1.14x speed increase over no cache and no indexes version

Step 6 (and final step) - with indexes and one cache call

                            user       system     total     real
last week                   2.259850   0.134709   2.394559  (  3.935191)
last month                  3.120629   0.161101   3.281730  (  4.906886)
since the beginning         3.539331   0.181215   3.720546  (  5.575235)
date too much in the past   8.353001   0.302818   8.655819  ( 11.406922)

if compared with the example with indexes and one cache call per each day, the biggest difference can be seen when the date range becomes consistent: 3 months in the past is 1.95x faster and 1 year and a half in the past is 4.24x faster

If compared with the current master:

last week: 12.34x speed increase
last month: 39x speed increase
since the beginning: 64.12x speed increase

I hope this walkthrough was helpful 😁

Related Tickets & Documents

Related to #2311

Added to documentation?

docs.dev.to
readme
no documentation needed

…r the day line

rhymes · 2019-06-07T14:19:43Z

app/services/analytics_service.rb

+    # cache all stats in the date range for the requested user or organization
+    cache_key = "analytics-for-dates-#{start_date}-#{end_date}-#{user_or_org.class.name}-#{user_or_org.id}"
+
+    Rails.cache.fetch(cache_key, expires_in: 7.days) do


Since now there's only one cache per range, I kept it to the upper range, 7 days, but we can probably lower it to 3 if we see that API clients aren't going to keep asking the same range over and over

What would happen if someone requested the past seven days? That would cache the most recent day for a long time right?

I changed the cache key to include both start and end date, so if the past seven days are asked twice the same day the second time they come back from the cache. The next day the cache key changes. I believe it's a good compromise but I'm not 100% sure it's the right compromise.

This would help dashboards with fixed dates without killing performance by caching each day separately.

I think we can revisit the caching policy in the future because it really depends on how all of this behaves in production once we have enough data to understand which caching strategy would work best.

Caching is hard, no doubt about that ehhe

app/controllers/api/v0/api_controller.rb

lightalloy · 2019-06-07T16:12:57Z

The description and the work done is truly impressive.

maestromac · 2019-06-07T18:58:55Z

wow, amazing work @rhymes .

Zhao-Andy

All of this looks awesome! 😁

benhalpern · 2019-06-08T00:05:04Z

db/migrate/20190607110030_add_indexes_for_analytics.rb

+
+  def change
+    # concurrent creation avoids downtime during production deploy because
+    # indexes get created in the background


This is really great to know!

Yeah, it's a neat feature.

With 9.6 they went beyond that introducing basic parallelism features (some queries are transparently run in parallel by the DB) and expanded them in versions 10 and 11 (more types of parallel queries but also parallel joins and parallel creation of other types of indexes, you can also set the number of workers and other things). More details in my changelog issue #2801

spec/services/analytics/analytics_service_spec.rb

benhalpern

Let's do this

rhymes added 16 commits June 7, 2019 15:27

Use proper auth chain before checking parameters

65c942b

Refactor AnalyticsController

ed86f7c

Add a bunch of test and rewrite reactions counts with a single query

88ca73d

Add tests for follows totals

f3b4ae7

Use round and add more refactoring

a6d6068

Add tests to group by day to fallback during refactoring

72b19de

Use group by to calculate comments count by day

8973358

Use group by to calculate follows count by day

7c73641

Use group by to calculate reactions stats by day

393ee4b

Use group by to calculate page_views stats by day

19048d0

Move calculations per day back in the cached block

74346c5

Add a few comments and a little bit of refactoring

2db2738

Add indexes (concurrently) for analytics

04906f0

Make tests more time robust

f444337

Use a date range cache related to the requesting user or org

b665afb

Freezing time in UTC should help when tests are ran on time zones ove…

1b955cb

…r the day line

rhymes commented Jun 7, 2019

View reviewed changes

rhymes changed the title ~~[WIP] Analytics: refactoring for speed improvements~~ Analytics: refactoring for speed improvements Jun 7, 2019

pr-triage bot added the PR: unreviewed bot applied label for PR's with no review label Jun 7, 2019

rhymes requested review from Zhao-Andy, lightalloy and benhalpern and removed request for Zhao-Andy and lightalloy June 7, 2019 15:19

lightalloy reviewed Jun 7, 2019

View reviewed changes

app/controllers/api/v0/api_controller.rb Outdated Show resolved Hide resolved

Zhao-Andy reviewed Jun 7, 2019

View reviewed changes

Merge branch 'master' into rhymes/faster-analytics

ea2e58a

benhalpern reviewed Jun 8, 2019

View reviewed changes

rhymes added 2 commits June 8, 2019 12:29

Remove explicit returns

1027474

Page title is escaped, guard against that

30aa1a4

weekly-digest bot mentioned this pull request Jun 10, 2019

Weekly Digest (3 June, 2019 - 10 June, 2019) #3090

Closed

lightalloy reviewed Jun 10, 2019

View reviewed changes

spec/services/analytics/analytics_service_spec.rb Outdated Show resolved Hide resolved

Move Analytics service spec in the proper place

f9d4ef7

rhymes requested a review from lightalloy June 10, 2019 11:53

benhalpern approved these changes Jun 10, 2019

View reviewed changes

pr-triage bot added PR: reviewed-approved bot applied label for PR's where reviewer approves changes and removed PR: unreviewed bot applied label for PR's with no review labels Jun 10, 2019

benhalpern merged commit a8b7f6e into forem:master Jun 10, 2019

pr-triage bot added PR: merged bot applied label for PR's that are merged and removed PR: reviewed-approved bot applied label for PR's where reviewer approves changes labels Jun 10, 2019

rhymes deleted the rhymes/faster-analytics branch June 10, 2019 13:29

rhymes removed the request for review from lightalloy June 10, 2019 13:32

weekly-digest bot mentioned this pull request Jun 17, 2019

Weekly Digest (10 June, 2019 - 17 June, 2019) #3185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analytics: refactoring for speed improvements #3072

Analytics: refactoring for speed improvements #3072

rhymes commented Jun 7, 2019

rhymes Jun 7, 2019 •

edited

Zhao-Andy Jun 7, 2019

rhymes Jun 7, 2019

lightalloy commented Jun 7, 2019

maestromac commented Jun 7, 2019

Zhao-Andy left a comment

benhalpern Jun 8, 2019

rhymes Jun 8, 2019

benhalpern left a comment

Analytics: refactoring for speed improvements #3072

Analytics: refactoring for speed improvements #3072

Conversation

rhymes commented Jun 7, 2019

What type of PR is this? (check all applicable)

Description

Solutions taken into account

Walkthrough of the steps taken and ameliorations

Step 1 - refactor totals

Step 2 - refactor grouping by day

What's the deal with caching in the context of analytics?

Step 3 - what happens if we remove all cache calls

Step 4 - add indexes

Step 5 - with indexes and no cache calls

Step 6 (and final step) - with indexes and one cache call

Related Tickets & Documents

Added to documentation?

rhymes Jun 7, 2019 • edited

Choose a reason for hiding this comment

Zhao-Andy Jun 7, 2019

Choose a reason for hiding this comment

rhymes Jun 7, 2019

Choose a reason for hiding this comment

lightalloy commented Jun 7, 2019

maestromac commented Jun 7, 2019

Zhao-Andy left a comment

Choose a reason for hiding this comment

benhalpern Jun 8, 2019

Choose a reason for hiding this comment

rhymes Jun 8, 2019

Choose a reason for hiding this comment

benhalpern left a comment

Choose a reason for hiding this comment

rhymes Jun 7, 2019 •

edited