New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don’t call RANDOM() on large tables #647

Merged
merged 1 commit into from Nov 21, 2017

Conversation

Projects
None yet
1 participant
@schneems
Member

schneems commented Nov 21, 2017

Spending close to 40% of time executing queries with RANDOM() in them.:

$ heroku pg:outliers
 14:52:35.890252 | 19.9%          | 186,846    | 02:38:39.448613 | SELECT  "repos".* FROM "repos" WHERE (repos.id not in (?,?)) ORDER BY random() LIMIT $1
 08:59:35.017667 | 12.1%          | 2,532,339  | 00:01:13.506894 | SELECT  "users".* FROM "users" WHERE ("users"."github_access_token" IS NOT NULL) ORDER BY RANDOM() LIMIT $1

Instead we can figure out the largest possible ID, generate an ID between 1 and that number then find a record greater than or equal to that ID.

Here’s the current query with RANDOM():

=> EXPLAIN ANALYZE SELECT  "users".* FROM "users" WHERE ("users"."github_access_token" IS NOT NULL) ORDER BY RANDOM() LIMIT 1;

                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1471.00..1471.01 rows=1 width=2098) (actual time=12.747..12.748 rows=1 loops=1)
   ->  Sort  (cost=1471.00..1475.24 rows=8464 width=2098) (actual time=12.745..12.745 rows=1 loops=1)
         Sort Key: (random())
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Seq Scan on users  (cost=0.00..1462.54 rows=8464 width=2098) (actual time=0.013..7.327 rows=8726 loops=1)
               Filter: (github_access_token IS NOT NULL)
               Rows Removed by Filter: 13510
 Total runtime: 12.811 ms
(8 rows)

12 ms.

Here’s the same query with >=:

=> EXPLAIN ANALYZE SELECT  "users".* FROM "users" WHERE ("users"."github_access_token" IS NOT NULL) AND id >= 55 LIMIT 1;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.17 rows=1 width=2098) (actual time=0.009..0.009 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..1469.36 rows=8459 width=2098) (actual time=0.009..0.009 rows=1 loops=1)
         Filter: ((github_access_token IS NOT NULL) AND (id >= 55))
 Total runtime: 0.039 ms

Much better 0.039ms

@schneems schneems temporarily deployed to codetriage-staging-pr-647 Nov 21, 2017 Inactive

@schneems schneems force-pushed the schneems/no-more-random branch from 2ab33ba to 28a326e Nov 21, 2017

Don’t call RANDOM() on large tables
```
$ heroku pg:outliers
 14:52:35.890252 | 19.9%          | 186,846    | 02:38:39.448613 | SELECT  "repos".* FROM "repos" WHERE (repos.id not in (?,?)) ORDER BY random() LIMIT $1
 08:59:35.017667 | 12.1%          | 2,532,339  | 00:01:13.506894 | SELECT  "users".* FROM "users" WHERE ("users"."github_access_token" IS NOT NULL) ORDER BY RANDOM() LIMIT $1
```

Spending close to 40% of time executing queries with `RANDOM()` in them.

Instead we can figure out the largest possible ID, generate an ID between 1 and that number then find a record greater than or equal to that ID.


Here’s the current query with RANDOM()

```
=> EXPLAIN ANALYZE SELECT  "users".* FROM "users" WHERE ("users"."github_access_token" IS NOT NULL) ORDER BY RANDOM() LIMIT 1;

                                                      QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
 Limit  (cost=1471.00..1471.01 rows=1 width=2098) (actual time=12.747..12.748 rows=1 loops=1)
   ->  Sort  (cost=1471.00..1475.24 rows=8464 width=2098) (actual time=12.745..12.745 rows=1 loops=1)
         Sort Key: (random())
         Sort Method: top-N heapsort  Memory: 26kB
         ->  Seq Scan on users  (cost=0.00..1462.54 rows=8464 width=2098) (actual time=0.013..7.327 rows=8726 loops=1)
               Filter: (github_access_token IS NOT NULL)
               Rows Removed by Filter: 13510
 Total runtime: 12.811 ms
(8 rows)
```

12 ms.

Here’s the same query with >=

```
=> EXPLAIN ANALYZE SELECT  "users".* FROM "users" WHERE ("users"."github_access_token" IS NOT NULL) AND id >= 55 LIMIT 1;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.17 rows=1 width=2098) (actual time=0.009..0.009 rows=1 loops=1)
   ->  Seq Scan on users  (cost=0.00..1469.36 rows=8459 width=2098) (actual time=0.009..0.009 rows=1 loops=1)
         Filter: ((github_access_token IS NOT NULL) AND (id >= 55))
 Total runtime: 0.039 ms
```

Much better 0.039ms

@schneems schneems merged commit c2c7eec into master Nov 21, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment