# Lecture 08: Query Optimization I

## New IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

If you didn't load it in with a previous lecture, load in the `imdb_perf_lecture` database.

In [1]:
!unzip -u ../lec07/data/imdb_perf_lecture.zip -d ../lec07/data/

Archive:  ../lec07/data/imdb_perf_lecture.zip


In [2]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f ../lec07/data/imdb_perf_lecture.sql

DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 845888
COPY 2211936
COPY 656453
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE


In [3]:
%reload_ext sql

In [4]:
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture

# Demo

---

Example 1

In [5]:
%sql EXPLAIN ANALYZE SELECT * FROM Actor;

QUERY PLAN
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=36) (actual time=0.033..61.825 rows=845888 loops=1)
Planning Time: 0.157 ms
Execution Time: 85.537 ms


Notice `ANALYZE`:
* start time and end time for operator, rows processed
* loops = number of times the operator is executed

Notice `EXPLAIN`:
* width = size (in bytes) of output tuples from that operator

---

Example 2: just planning (no execution)

In [6]:
%sql EXPLAIN SELECT * FROM Actor;

QUERY PLAN
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=36)


---

Example 3

In [7]:
%sql EXPLAIN ANALYZE SELECT id FROM Actor;

QUERY PLAN
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=4) (actual time=0.041..74.812 rows=845888 loops=1)
Planning Time: 0.040 ms
Execution Time: 98.669 ms


Notice:
* width from 36 down to 4!
* still 845k output tuples

---

Example 4:

In [8]:
%sql EXPLAIN ANALYZE SELECT id FROM Actor WHERE id > 4000000;

QUERY PLAN
Bitmap Heap Scan on actor (cost=5285.64..14036.18 rows=281963 width=4) (actual time=24.923..80.432 rows=444781 loops=1)
Recheck Cond: (id > 4000000)
Heap Blocks: exact=3088
-> Bitmap Index Scan on actor_pkey (cost=0.00..5215.15 rows=281963 width=0) (actual time=24.542..24.543 rows=444781 loops=1)
Index Cond: (id > 4000000)
Planning Time: 0.066 ms
Execution Time: 93.329 ms


Notice:
* output tuples now reduced to 444k
* planning has imperfect estimate (444887) vs actual (444781)

---

Example 5:

In [9]:
%sql EXPLAIN ANALYZE SELECT id, name FROM Actor WHERE id > 4000000;

QUERY PLAN
Bitmap Heap Scan on actor (cost=5285.64..14036.18 rows=281963 width=36) (actual time=16.971..53.383 rows=444781 loops=1)
Recheck Cond: (id > 4000000)
Heap Blocks: exact=3088
-> Bitmap Index Scan on actor_pkey (cost=0.00..5215.15 rows=281963 width=0) (actual time=16.603..16.603 rows=444781 loops=1)
Index Cond: (id > 4000000)
Planning Time: 0.051 ms
Execution Time: 66.063 ms


Notice:
* width back up a bit to 18 (because of name)
* index-only scan moving to sequential scan

# Moar queries if time permits! 

In [10]:
%%sql
EXPLAIN ANALYZE -- 1
SELECT id
FROM Actor
WHERE id > 4000000 AND name='Tom Hanks';

QUERY PLAN
Gather (cost=1000.00..11653.80 rows=1410 width=4) (actual time=26.184..29.006 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on actor (cost=0.00..10512.80 rows=588 width=4) (actual time=23.644..23.645 rows=0 loops=3)
Filter: ((id > 4000000) AND (name = 'Tom Hanks'::text))
Rows Removed by Filter: 281963
Planning Time: 0.077 ms
Execution Time: 29.024 ms


Query 1: Compared to previous query where there was no name condition in WHERE clause, this query uses a parallel sequential scan: likely because the predicate is a bit more expensive to check, and so parallelism in processing the tuples is helpful. 

Note also just one output tuple!

In [11]:
%%sql
EXPLAIN ANALYZE -- 2
SELECT id
FROM Actor
WHERE id < 4000000 AND name='Tom Hanks';

QUERY PLAN
Gather (cost=1000.00..11653.80 rows=1410 width=4) (actual time=0.215..28.760 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on actor (cost=0.00..10512.80 rows=588 width=4) (actual time=15.044..23.673 rows=0 loops=3)
Filter: ((id < 4000000) AND (name = 'Tom Hanks'::text))
Rows Removed by Filter: 281962
Planning Time: 0.078 ms
Execution Time: 28.780 ms


Query 2: Compared to previous, we flipped the id condition, and now we've found a match for Tom Hanks! Otherwise quite similar. Note that we still need to go through all the tuples.

In [12]:
%%sql
EXPLAIN ANALYZE -- 3
SELECT id
FROM Actor;

QUERY PLAN
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=4) (actual time=0.043..72.572 rows=845888 loops=1)
Planning Time: 0.039 ms
Execution Time: 96.430 ms


Query 3: Resetting to an old query for a second without conditions...

In [13]:
%%sql
EXPLAIN ANALYZE -- 4
SELECT id
FROM Actor
LIMIT 10;

QUERY PLAN
Limit (cost=0.00..0.16 rows=10 width=4) (actual time=0.018..0.020 rows=10 loops=1)
-> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=4) (actual time=0.017..0.018 rows=10 loops=1)
Planning Time: 0.055 ms
Execution Time: 0.029 ms


Query 4: by adding a LIMIT, see significantly reduced execution time! Also a limit node above the sequential scan node. 

Next, we just change a SQLmagic setting to allow for a longer query plan to be displayed...

In [14]:
%config SqlMagic.displaylimit = None

In [15]:
%%sql
EXPLAIN ANALYZE -- 5
SELECT id
FROM Actor
ORDER BY name
LIMIT 10;

QUERY PLAN
Limit (cost=17366.94..17368.11 rows=10 width=36) (actual time=74.626..77.175 rows=10 loops=1)
-> Gather Merge (cost=17366.94..99611.72 rows=704906 width=36) (actual time=74.624..77.172 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=16366.92..17248.05 rows=352453 width=36) (actual time=72.135..72.137 rows=8 loops=3)
Sort Key: name
Sort Method: top-N heapsort Memory: 26kB
Worker 0: Sort Method: top-N heapsort Memory: 25kB
Worker 1: Sort Method: top-N heapsort Memory: 25kB
-> Parallel Seq Scan on actor (cost=0.00..8750.53 rows=352453 width=36) (actual time=0.025..40.377 rows=281963 loops=3)


Query 5: Whoa! Sorts are expensive! 