# Lecture 08: Query Optimization I

## New IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

If you didn't load it in with a previous lecture, load in the `imdb_perf_lecture` database.

In [1]:
!unzip -u ../lec07/data/imdb_perf_lecture.zip -d ../lec07/data/

Archive:  ../lec07/data/imdb_perf_lecture.zip


In [2]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f ../lec07/data/imdb_perf_lecture.sql

DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
psql:../lec07/data/imdb_perf_lecture.sql:33: ERROR:  role "yanlisa" does not exist
CREATE TABLE
psql:../lec07/data/imdb_perf_lecture.sql:45: ERROR:  role "yanlisa" does not exist
CREATE TABLE
psql:../lec07/data/imdb_perf_lecture.sql:59: ERROR:  role "yanlisa" does not exist
COPY 845888
COPY 2211936
COPY 656453
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE


In [3]:
%reload_ext sql

There's a new jupysql version available (0.10.10), you're running 0.10.0. To upgrade: pip install jupysql --upgrade
Deploy Panel apps for free on Ploomber Cloud! Learn more: https://ploomber.io/s/signup


In [4]:
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture

# Demo

---

Example 1

In [5]:
%sql EXPLAIN ANALYZE SELECT * FROM Actor;

QUERY PLAN
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=36) (actual time=0.044..54.361 rows=845888 loops=1)
Planning Time: 0.145 ms
Execution Time: 78.471 ms


Notice `ANALYZE`:
* start time and end time for operator, rows processed
* loops = number of times the operator is executed

Notice `EXPLAIN`:
* width = size (in bytes) of output tuples from that operator

---

Example 2: just planning (no execution)

In [6]:
%sql EXPLAIN SELECT * FROM Actor;

QUERY PLAN
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=36)


---

Example 3

In [7]:
%sql EXPLAIN ANALYZE SELECT id FROM Actor;

QUERY PLAN
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=4) (actual time=0.073..70.980 rows=845888 loops=1)
Planning Time: 0.043 ms
Execution Time: 95.830 ms


Notice:
* width from 18 down to 4!
* still 845k output tuples

---

Example 4:

In [8]:
%sql EXPLAIN ANALYZE SELECT id FROM Actor WHERE id > 4000000;

QUERY PLAN
Index Only Scan using actor_pkey on actor (cost=0.42..8034.78 rows=281963 width=4) (actual time=0.254..52.920 rows=444781 loops=1)
Index Cond: (id > 4000000)
Heap Fetches: 0
Planning Time: 0.095 ms
Execution Time: 65.784 ms


Notice:
* output tuples now reduced to 444k
* planning has imperfect estimate (445080) vs actual (444781)

---

Example 5:

In [9]:
%sql EXPLAIN ANALYZE SELECT id, name FROM Actor WHERE id > 4000000;

QUERY PLAN
Seq Scan on actor (cost=0.00..15799.60 rows=446616 width=18) (actual time=0.211..62.859 rows=444781 loops=1)
Filter: (id > 4000000)
Rows Removed by Filter: 401107
Planning Time: 0.142 ms
Execution Time: 75.917 ms


Notice:
* width back to 18 (because of name)

# Matching Exercise

In [10]:
%%sql
EXPLAIN ANALYZE -- 1
SELECT id
FROM Actor
WHERE id > 4000000 AND name='Tom Hanks';

QUERY PLAN
Gather (cost=1000.00..11512.90 rows=1 width=4) (actual time=24.223..26.357 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on actor (cost=0.00..10512.80 rows=1 width=4) (actual time=21.729..21.730 rows=0 loops=3)
Filter: ((id > 4000000) AND (name = 'Tom Hanks'::text))
Rows Removed by Filter: 281963
Planning Time: 0.097 ms
Execution Time: 26.374 ms


In [11]:
%%sql
EXPLAIN ANALYZE -- 2
SELECT id
FROM Actor
WHERE id < 4000000 AND name='Tom Hanks';

QUERY PLAN
Gather (cost=1000.00..11512.90 rows=1 width=4) (actual time=0.305..26.222 rows=1 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on actor (cost=0.00..10512.80 rows=1 width=4) (actual time=13.627..21.490 rows=0 loops=3)
Filter: ((id < 4000000) AND (name = 'Tom Hanks'::text))
Rows Removed by Filter: 281962
Planning Time: 0.071 ms
Execution Time: 26.240 ms


In [12]:
%%sql
EXPLAIN ANALYZE -- 3
SELECT id
FROM Actor;

QUERY PLAN
Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=4) (actual time=0.086..69.342 rows=845888 loops=1)
Planning Time: 0.045 ms
Execution Time: 93.685 ms


In [13]:
%%sql
EXPLAIN ANALYZE -- 4
SELECT id
FROM Actor
LIMIT 10;

QUERY PLAN
Limit (cost=0.00..0.16 rows=10 width=4) (actual time=0.022..0.023 rows=10 loops=1)
-> Seq Scan on actor (cost=0.00..13684.88 rows=845888 width=4) (actual time=0.021..0.022 rows=10 loops=1)
Planning Time: 0.079 ms
Execution Time: 0.034 ms


In [15]:
%config SqlMagic.displaylimit = None

In [16]:
%%sql
EXPLAIN ANALYZE -- 5
SELECT id
FROM Actor
ORDER BY name
LIMIT 10;

QUERY PLAN
Limit (cost=17366.94..17368.11 rows=10 width=18) (actual time=69.808..71.982 rows=10 loops=1)
-> Gather Merge (cost=17366.94..99611.72 rows=704906 width=18) (actual time=69.806..71.978 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=16366.92..17248.05 rows=352453 width=18) (actual time=67.204..67.206 rows=9 loops=3)
Sort Key: name
Sort Method: top-N heapsort Memory: 26kB
Worker 0: Sort Method: top-N heapsort Memory: 25kB
Worker 1: Sort Method: top-N heapsort Memory: 26kB
-> Parallel Seq Scan on actor (cost=0.00..8750.53 rows=352453 width=18) (actual time=0.020..19.115 rows=281963 loops=3)
