Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Analytics: refactoring for speed improvements #3072

Merged
merged 20 commits into from
Jun 10, 2019
Merged

Analytics: refactoring for speed improvements #3072

merged 20 commits into from
Jun 10, 2019

Conversation

rhymes
Copy link
Contributor

@rhymes rhymes commented Jun 7, 2019

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Documentation Update

Description

This refactoring began when I was started taking a look at how the whole analytics worked in the Dashboard pro and noticed we could have computed it faster. I've also noticed that sometimes it requires multiple refreshes in production to be shown due to performance issues.

This PR contains two things: a refactoring of the logic in the AnalyticsController for the API and speed improvements for the analytics themselves.

Solutions taken into account

I've talked about possible improvements in #2311 and the possible solutions envisioned were:

  • limiting the date range (not implemented here)
  • adding indexes to columns that are part of the filtering/where conditions (part of this PR)
  • use grouping in SQL (part of this PR)
  • pre computing data in batches (not implemented here)
  • low level caching (this is already in production but there are issues with caching each date separately as we can already see in production. Each date means a separate network call to the caching server, which means an average of 30 HTTP calls for a month range and possibly hundreds for "infinity". I've changed it to a single cache call per range but we'll talk more about caching later)

So, this PR contains a better usage of SQL (from 64 queries to 4 in the grouping by day) and adds indexes on columns that are going to be filtered on repeatedly (indexes are added concurrently by PostgreSQL with the Rails migration not to disrupt normal production flow)

Walkthrough of the steps taken and ameliorations

First thing first I had to set a baseline. You can't make things "faster" without knowing how slow or fast are they. So I wrote a script that uses Ruby's benchmark module to have a ballpark of what was going on:

require 'benchmark'

# disable logging to avoid measuring that
ActiveRecord::Base.logger = nil

user = User.find(11)

iterations = 1_000

Benchmark.bm do |bm|
  bm.report("totals") do
    as = AnalyticsService.new(user)
    iterations.times { as.totals }
  end
  bm.report("last week") do
    as = AnalyticsService.new(user, start_date: 1.week.ago.to_date.iso8601)
    iterations.times { as.grouped_by_day }
  end
  bm.report("last month") do
    as = AnalyticsService.new(user, start_date: 1.month.ago.to_date.iso8601)
    iterations.times { as.grouped_by_day }
  end
  bm.report("since the beginning") do
    as = AnalyticsService.new(user, start_date: "2019-04-01")
    iterations.times { as.grouped_by_day }
  end
  bm.report("date too much in the past") do
    as = AnalyticsService.new(user, start_date: "2018-01-01")
    iterations.times { as.grouped_by_day }
  end
end

This was the result on my meager development instance (not too much data in the DB):

                     user        system     total         real
totals               2.028109    0.231059   2.259168      (  3.594266)
last week            27.779555   1.734052   29.513607     ( 45.470222)
last month           120.701257  7.255624   127.956881    (206.690019)
since the beginning  225.478188  13.058602  238.536790    (384.704629)

since the beginning refers to 2019-04-01 which is the date DEV released Analytics.

Step 1 - refactor totals

After refactoring the AnalyticsService.totals calculations I benchmarked this timing:

       user     system      total        real
totals  1.787045   0.128882   1.915927 (  2.999562)

a 1.17x speed increase over master branch. Not much but it's a start and there were less queries all around.

Step 2 - refactor grouping by day

This took considerably more time to achieve, but after playing around a bit with PostgreSQL, its query plans the number of queries went down from 64 to 4 (one per each metric).

These are the times measured:

                          user        system      total       real
last week                 3.164215    0.148648    3.312863    (  4.885400)
last month                5.232762    0.173115    5.405877    (  7.152623)
since the beginning       7.509549    0.204578    7.714127    (  9.507002)
date too much in the past 39.811000   0.820973    40.631973   ( 44.212027)

last week: 8.91x speed increase over master branch
last month: 23.69x speed increase over master branch
since the beginning: 5.87x speed increase over master branch

date too much in the past it's just me going back to 2018-01-01 to check how slow it would be with such a wide range

The speed increase is definitely significant, but there was something bothering me. That's when I thought about caching.

What's the deal with caching in the context of analytics?

Caching of analytics in the master branch is done 1 time per each day in the date range, which can potentially result in tens or even hundreds of network call to the caching serves. Yes, caching helps because (if I'm not mistaken) DEV is using a globally distributed memcached pool of servers so the data is likely to be near you BUT going back and forth to the caching servers might be slower than asking the data to the DB directly if the range is big enough. Basically the round trips added to the caching server speed might result in more time spent than asking the web server to get the data from the DB (which is likely near the server) and send it back to the client.

I still think caching is important but caching each day separately might be counter productive. Since the data now it's loaded in bulk from the DB (GROUP BY returns one row per each day in the same query), it might make sense to cache the whole result set instead of each row separately.

Step 3 - what happens if we remove all cache calls

                            user        system     total        real
last week                   2.369785    0.201472   2.571257     (  3.973177)
last month                  2.793212    0.148250   2.941462     (  4.483640)
since the beginning         3.373822    0.158784   3.532606     (  5.202688)
date too much in the past   5.493960    0.209581   5.703541     (  7.480122)

last week: 11.48x speed increase over master branch
last month: 43.52x speed increase over master branch
since the beginning: 67.57x speed increase over master branch

As I suspected, the bigger the date range is, the more counter productive N cache calls are. Given that I'm on development with a totally different hardware and software than what's running the DEV production site times are going to be different and speed increases likely lower but it verifies my logic from the previous section.

Step 4 - add indexes

One thing that happens with analytics on many rows is that indexes start to matter (not so much in my development tests). Keep in mind that PostgreSQL can still decide not to use indexes if the data set is not big enough, but overall, they are a good idea.

What indexes? I took each query run by totals and grouped_by_day and checked WHERE, GROUP BY and FILTER conditions. These are the fields I decided to add indexes on:

  • articles.published (this is also going to benefit the entire website, since published articles is a super common query)
  • comments.created_at
  • comments.score
  • follows.created_at
  • page_views.created_at
  • reactions.created_at
  • reactions.points

Since DEV has been in production for long and those tables contain lots and lots of data (I reckon reactions and page views especially), it's a good idea to (again :P) use the whole PostgreSQL power and add indexes concurrently. That means that PostgreSQL is going to use its concurrency capabilities to write the indexes in the background without stopping the normal write workflow to those tables. Indexes are normally written locking down an entire table, which can slow down the website during a migration. You can read more about all the details in the PostgreSQL doc about the subject.

So, what are the benchmarked timings?

                          user        system      total       real
totals                    1.678849    0.194811    1.873660    (  2.974526)
last week                 2.410151    0.109961    2.520112    (  3.714314)
last month                4.619584    0.132775    4.752359    (  6.109809)
since the beginning       7.117115    0.159913    7.277028    (  8.774819)
date too much in the past 36.347774   0.402848    36.750622   ( 38.826439)

totals: 1.02x speed increase over no indexes version
last week: 1.31x speed increase over no indexes version
last month: 1.13x speed increase over no indexes version
since the beginning: 1.06x speed increase over no indexes version

The reasons why they only slightly faster than the version with no indexes is that some indexes were already in place (those on the id/type columns and the user id) and that I don't have a big enough data set which, as previously said, can make PostgreSQL decide for a sequential scan or similar operations.

Step 5 - with indexes and no cache calls

Since the cache calls per each day were still in place, I tried without them:

                            user        system      total       real
last week                   1.870252    0.116198    1.986450    (  3.161371)
last month                  2.725406    0.144293    2.869699    (  4.278060)
since the beginning         2.924799    0.146779    3.071578    (  4.523648)
date too much in the past   5.456498    0.215746    5.672244    (  7.229783)

again, with no cache calls the calculations are significantly faster, but the same evaluations made above are valid for this example.

last week: 1.29x speed increase over no cache and no indexes version
last month: 1.02x speed increase over no cache and no indexes version
since the beginning: 1.14x speed increase over no cache and no indexes version

Step 6 (and final step) - with indexes and one cache call

                            user       system     total     real
last week                   2.259850   0.134709   2.394559  (  3.935191)
last month                  3.120629   0.161101   3.281730  (  4.906886)
since the beginning         3.539331   0.181215   3.720546  (  5.575235)
date too much in the past   8.353001   0.302818   8.655819  ( 11.406922)

if compared with the example with indexes and one cache call per each day, the biggest difference can be seen when the date range becomes consistent: 3 months in the past is 1.95x faster and 1 year and a half in the past is 4.24x faster

If compared with the current master:

last week: 12.34x speed increase
last month: 39x speed increase
since the beginning: 64.12x speed increase

I hope this walkthrough was helpful 😁

Related Tickets & Documents

Related to #2311

Added to documentation?

  • docs.dev.to
  • readme
  • no documentation needed

# cache all stats in the date range for the requested user or organization
cache_key = "analytics-for-dates-#{start_date}-#{end_date}-#{user_or_org.class.name}-#{user_or_org.id}"

Rails.cache.fetch(cache_key, expires_in: 7.days) do
Copy link
Contributor Author

@rhymes rhymes Jun 7, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since now there's only one cache per range, I kept it to the upper range, 7 days, but we can probably lower it to 3 if we see that API clients aren't going to keep asking the same range over and over

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if someone requested the past seven days? That would cache the most recent day for a long time right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed the cache key to include both start and end date, so if the past seven days are asked twice the same day the second time they come back from the cache. The next day the cache key changes. I believe it's a good compromise but I'm not 100% sure it's the right compromise.

This would help dashboards with fixed dates without killing performance by caching each day separately.

I think we can revisit the caching policy in the future because it really depends on how all of this behaves in production once we have enough data to understand which caching strategy would work best.

Caching is hard, no doubt about that ehhe

@rhymes rhymes changed the title [WIP] Analytics: refactoring for speed improvements Analytics: refactoring for speed improvements Jun 7, 2019
@pr-triage pr-triage bot added the PR: unreviewed bot applied label for PR's with no review label Jun 7, 2019
@rhymes rhymes requested review from Zhao-Andy, lightalloy and benhalpern and removed request for Zhao-Andy and lightalloy June 7, 2019 15:19
@lightalloy
Copy link
Contributor

The description and the work done is truly impressive.

@maestromac
Copy link
Contributor

wow, amazing work @rhymes .

Copy link
Contributor

@Zhao-Andy Zhao-Andy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this looks awesome! 😁


def change
# concurrent creation avoids downtime during production deploy because
# indexes get created in the background
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great to know!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a neat feature.

With 9.6 they went beyond that introducing basic parallelism features (some queries are transparently run in parallel by the DB) and expanded them in versions 10 and 11 (more types of parallel queries but also parallel joins and parallel creation of other types of indexes, you can also set the number of workers and other things). More details in my changelog issue #2801

@rhymes rhymes requested a review from lightalloy June 10, 2019 11:53
Copy link
Contributor

@benhalpern benhalpern left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do this

@pr-triage pr-triage bot added PR: reviewed-approved bot applied label for PR's where reviewer approves changes and removed PR: unreviewed bot applied label for PR's with no review labels Jun 10, 2019
@benhalpern benhalpern merged commit a8b7f6e into forem:master Jun 10, 2019
@pr-triage pr-triage bot added PR: merged bot applied label for PR's that are merged and removed PR: reviewed-approved bot applied label for PR's where reviewer approves changes labels Jun 10, 2019
@rhymes rhymes deleted the rhymes/faster-analytics branch June 10, 2019 13:29
@rhymes rhymes removed the request for review from lightalloy June 10, 2019 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: merged bot applied label for PR's that are merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants