# Discussion 05 Notebook

This notebook is an accompaniment to the associated discussion worksheet handout.

## Load in the IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

If you didn't load it in with a previous lecture, load in the `imdb_perf_lecture` database.

In [8]:
!unzip -u ../lecture/lec07/data/imdb_perf_lecture.zip -d ../lecture/lec07/data/

Archive:  ../lecture/lec07/data/imdb_perf_lecture.zip


In [1]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f ../lecture/lec07/data/imdb_perf_lecture.sql

ERROR:  database "imdb_perf_lecture" is being accessed by other users
DETAIL:  There is 1 other session using the database.
ERROR:  database "imdb_perf_lecture" already exists
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
psql:../lecture/lec07/data/imdb_perf_lecture.sql:30: ERROR:  relation "actor" already exists
psql:../lecture/lec07/data/imdb_perf_lecture.sql:33: ERROR:  role "yanlisa" does not exist
psql:../lecture/lec07/data/imdb_perf_lecture.sql:42: ERROR:  relation "cast_info" already exists
psql:../lecture/lec07/data/imdb_perf_lecture.sql:45: ERROR:  role "yanlisa" does not exist
psql:../lecture/lec07/data/imdb_perf_lecture.sql:56: ERROR:  relation "movie" already exists
psql:../lecture/lec07/data/imdb_perf_lecture.sql:59: ERROR:  role "yanlisa" does not exist
psql:../lecture/lec07/data/imdb_perf_lecture.sql:845954: ERROR:  duplicate key value violates unique constraint "actor_pkey"
DETAIL:  Key (id)=(1) already exists.
CONTEXT:  COPY actor, li

In [4]:
%reload_ext sql
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture
import pandas as pd

Before starting this part, review the schema of the relations in the `imdb_perf_lecture` database. Here's the printout from `psql`:

```
imdb_perf_lecture=# \d actor
               Table "public.actor"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 id     | integer |           | not null | 
 name   | text    |           |          | 
Indexes:
    "actor_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "cast_info" CONSTRAINT "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actor(id)

imdb_perf_lecture=# \d cast_info
               Table "public.cast_info"
  Column   |  Type   | Collation | Nullable | Default 
-----------+---------+-----------+----------+---------
 person_id | integer |           |          | 
 movie_id  | integer |           |          | 
Foreign-key constraints:
    "cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movie(id)
    "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actor(id)

imdb_perf_lecture=# \d movie
                    Table "public.movie"
     Column      |  Type   | Collation | Nullable | Default 
-----------------+---------+-----------+----------+---------
 id              | integer |           | not null | 
 title           | text    |           |          | 
 year            | integer |           |          | 
 runtime_minutes | integer |           |          | 
Indexes:
    "movie_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "cast_info" CONSTRAINT "cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movie(id)
```

# Question 1 Queries

This question looks at the impacts of **aggregation** and **sorting** on query performance.

## Question 1.1

In the cell below, write a query that returns the actor names and the number of times the corresponding name appears in the `Actor` relation.

In [6]:
%%sql
-- write your query here --

-- SOLUTION: delete the below before release --
SELECT COUNT(*), name
FROM actor
GROUP BY name;

count,name
1,Angelika Bender
1,Claire Hackett
1,Amparo Azócar
1,Buck Adams
1,Bob Brady
1,Brigitte Boore
1,Aydemir Akbas
1,Cecile Bonnel
1,Claudia Becker
2,Mark Bailey


## Question 1.2

Now, in the cell below, write a query that returns the actor IDs and the number of times the corresponding ID appears in the `Actor` relation.

In [9]:
%%sql
-- write your query here --

-- SOLUTION: delete the below before release --
SELECT COUNT(*), id
FROM actor
GROUP BY id;

count,id
1,1
1,2
1,3
1,4
1,5
1,6
1,7
1,8
1,9
1,10


## Question 1.3

Run `EXPLAIN ANALYZE` on your two queries above. See below for the full question.

If you're having trouble seeing the entirety of the query plan, you can run the following cell to set the limit on displayed rows to 20. **Careful**: Do not set this to `None` and run the actual queries; SQL will return millions of rows and crash your kernel!

In [12]:
# run this cell to remove 10-row limit on display
%config SqlMagic.displaylimit = 20

In [10]:
%%sql
-- write your EXPLAIN ANALYZE here --

-- SOLUTION: delete the below before release --
EXPLAIN ANALYZE 
SELECT COUNT(*), name
FROM actor
GROUP BY name;

QUERY PLAN
HashAggregate (cost=67965.49..83560.81 rows=732363 width=22) (actual time=376.648..801.889 rows=804435 loops=1)
Group Key: name
Planned Partitions: 32 Batches: 33 Memory Usage: 4113kB Disk Usage: 30624kB
-> Seq Scan on actor (cost=0.00..13703.21 rows=847021 width=14) (actual time=0.015..60.077 rows=845888 loops=1)
Planning Time: 0.069 ms
Execution Time: 832.037 ms


In [11]:
%%sql
-- write your EXPLAIN ANALYZE here --

-- SOLUTION: delete the below before release --
EXPLAIN ANALYZE 
SELECT COUNT(*), id
FROM actor
GROUP BY id;

QUERY PLAN
GroupAggregate (cost=0.42..34726.95 rows=847021 width=12) (actual time=0.050..260.063 rows=845888 loops=1)
Group Key: id
-> Index Only Scan using actor_pkey on actor (cost=0.42..22021.64 rows=847021 width=4) (actual time=0.042..80.571 rows=845888 loops=1)
Heap Fetches: 157
Planning Time: 0.097 ms
Execution Time: 284.734 ms


<br/><br/>

**(Question 1.3, continued)**
Why do you think the the `name` query use a Sequential Scan, whereas the `id` query use an Index Only scan?

Because only the id data exists in the index, and Index Scan is less I/O-intensive than Sequential
Scan when available. For the name column, Sequential Scan is the only possible way to retrieve its
data

## Question 1.4

Now, write a command that creates an index `name_actor_index` on the `name` attribute of `actor`.

In [13]:
%%sql
-- write your EXPLAIN ANALYZE here --

-- SOLUTION: delete the below before release --
CREATE INDEX name_actor_index ON actor(name);

**(Question 1.4, continued)**
Now rerun your `EXPLAIN ANALYZE` of your Question 1.1 query on `name` by copying and pasting it into the cell below. See below for the discussion question.

In [14]:
%%sql
-- write your EXPLAIN ANALYZE here --

-- SOLUTION: delete the below before release --
EXPLAIN ANALYZE 
SELECT COUNT(*), name
FROM actor
GROUP BY name;

QUERY PLAN
GroupAggregate (cost=0.42..37772.01 rows=731383 width=22) (actual time=0.060..323.392 rows=804435 loops=1)
Group Key: name
-> Index Only Scan using name_actor_index on actor (cost=0.42..26228.74 rows=845888 width=14) (actual time=0.052..105.058 rows=845888 loops=1)
Heap Fetches: 157
Planning Time: 0.238 ms
Execution Time: 346.601 ms


**(Question 1.4, continued)**
Why does the `name` query now use an `Index Only Scan`? What index is it using?

The query can now make use of the newly created name actor index index, since an Index Scan
incurs less I/O cost than a full Sequential Scan. Notice that this is only possible because we
already paid the I/O cost upfront when creating this index.


## Question 1.5

Now in the cell below, analyze the impacts of rewriting your query from Question 1.2 to return the entries sorted by ID. In other words, run an `EXPLAIN ANALYZE` on a query that returns the actor IDs (**sorted by lowest ID first**) and the number of times the corresponding ID appears in the `Actor` relation.

**Discuss**: Why does this query take so much time? What is the `ORDER BY` operation doing here?

You might be surprised to find out that this query did not take any longer than the previous query.
In fact, the query plan did not even include an additional sorting step. Why? Because the B+
tree index already provides ordering “for free”. When you traverse a B+ tree index to retrieve all
the ID values, you already get them back in a sorted order.

In [16]:
%%sql
-- write your query here --

-- SOLUTION: delete the below before release --
EXPLAIN ANALYZE
SELECT COUNT(*), id
FROM actor
GROUP BY id
ORDER BY id;

QUERY PLAN
GroupAggregate (cost=0.42..34699.60 rows=845888 width=12) (actual time=0.076..264.933 rows=845888 loops=1)
Group Key: id
-> Index Only Scan using actor_pkey on actor (cost=0.42..22011.28 rows=845888 width=4) (actual time=0.068..82.501 rows=845888 loops=1)
Heap Fetches: 157
Planning Time: 0.174 ms
Execution Time: 290.046 ms


<br/><br/><br/>

---

# Question 2

Before starting this question, we encourage you to delete any indexes you created above. It will help you compute consistent results with other classmates.

In [17]:
# just run this cell
%sql DROP INDEX name_actor_index;

RuntimeError: (psycopg2.errors.UndefinedObject) index "name_actor_index" does not exist

[SQL: DROP INDEX name_actor_index;]
(Background on this error at: https://sqlalche.me/e/20/f405)
If you need help solving this issue, send us a message: https://ploomber.io/community


## Question 2.1

Write a query that computes an inner join on `actor` and `cast_info` on the actor ID. Your query should return all attributes.

Once you're comfortable that the query is working as expected, run it through `EXPLAIN ANALYZE`. **Discuss**: What kind of join is the query optimizer performing? Why might this be the case?

Hash join. 

Here are some reasons for using hash join:
- Join condition: This is an equi-join, which is particularly suitable for hash join. If this were
an inequality join, hash join would not be efficient here.
- Table sizes: If one of the tables (actor) is substantially smaller than the other (cast info),
it can be loaded into memory and used to build a hash table, which can then be used for
a more efficient join with the larger table. Here the smaller table actor (845,888 rows) is
used to build the hash table. The larger table cast info (2,211,936 rows) is then scanned and
matched against this hash table.


In [19]:
%%sql
-- write your query here; edit to include EXPLAIN ANALYZE --

-- SOLUTION: delete the below before release --
EXPLAIN ANALYZE
SELECT *
FROM actor, cast_info
WHERE actor.id = cast_info.person_id;

QUERY PLAN
Hash Join (cost=29222.48..89171.29 rows=2211784 width=26) (actual time=189.485..1588.855 rows=2211936 loops=1)
Hash Cond: (cast_info.person_id = actor.id)
-> Seq Scan on cast_info (cost=0.00..31905.84 rows=2211784 width=8) (actual time=0.047..232.695 rows=2211936 loops=1)
-> Hash (cost=13691.88..13691.88 rows=845888 width=18) (actual time=189.258..189.260 rows=845888 loops=1)
Buckets: 65536 Batches: 16 Memory Usage: 3114kB
-> Seq Scan on actor (cost=0.00..13691.88 rows=845888 width=18) (actual time=0.004..57.367 rows=845888 loops=1)
Planning Time: 0.161 ms
Execution Time: 1655.870 ms


<br/>

---

Sometimes you may prefer to adjust the PostgreSQL settings to force a specific performance of a query. 
In the remainder of this question we will explore how to specify these settings. Note that settings tweaking for a single query is **not recommended** in practice, as it affects all of your queries! However, if you know specific characteristics of how your database will be queried in general, then by all means, dive in to adjust these knobs.

* All runtime parameters for PostgreSQL are in one view, called `pg_settings` (Documentation 54.24 [link](https://www.postgresql.org/docs/current/view-pg-settings.html)).
* In particular, the Planner Method Configuration (Documentation 20.7.1 [link](https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-SEQ-PAGE-COST)) includes the parameter descriptions for the query optimizer.

We encourage you to keep these pages up as you explore the next activity.

## Question 2.2

Run the below query. Which settings are related to selecting the type of join that the query optimizer can select?

In [21]:
%%sql
SELECT name
FROM pg_settings
WHERE name LIKE 'enable_%';

name
enable_async_append
enable_bitmapscan
enable_gathermerge
enable_hashagg
enable_hashjoin
enable_incremental_sort
enable_indexonlyscan
enable_indexscan
enable_material
enable_memoize


We are interested in the following: `enable hashjoin`, `enable mergejoin`, `enable nestloop`

## Question 2.3

Let's suppose we turn off hash joins as an option for the query optimizer. The syntax is included for you below.

In [22]:
# just run this cell
%sql set enable_hashjoin=false;

Copy your `EXPLAIN ANALYZE` command from Question 2.1 and rerun it below.

**Discuss**: Recall that our intial query was performed using hash joins; what is the join the query optimizer picks below? Why might this be preferred over the remaining alternatives?

Merge join. Here are some reasons for using merge join:
- Join condition: Merge joins are especially efficient when the join condition is an equijoin, and
when both tables can be efficiently accessed in sorted order of the join keys.
- Presence of Order and Indexes: There’s an index on actor.id (as shown by the Index Scan
using actor pkey), which makes accessing rows from the actor table in sorted order efficient.
This favors a merge join.
- Table Sizes: Both actor and cast info have a considerable number of rows. Nested loop joins
tend to be efficient when one of the tables is significantly smaller, allowing for rapid repeated
lookups or scans of that table. Given the sizes of the tables, a nested loop join might have
been seen as less efficient than a merge join.

In [24]:
%%sql
-- copy your EXPLAIN ANALYZE query from the previous part here --

-- SOLUTION: delete the below before release --
EXPLAIN ANALYZE
SELECT *
FROM actor, cast_info
WHERE actor.id = cast_info.person_id;

QUERY PLAN
Merge Join (cost=325485.10..405625.74 rows=2211784 width=26) (actual time=971.957..2027.917 rows=2211936 loops=1)
Merge Cond: (actor.id = cast_info.person_id)
-> Index Scan using actor_pkey on actor (cost=0.42..39336.91 rows=845888 width=18) (actual time=0.010..144.174 rows=845888 loops=1)
-> Materialize (cost=325472.25..336531.17 rows=2211784 width=8) (actual time=967.461..1449.327 rows=2211936 loops=1)
-> Sort (cost=325472.25..331001.71 rows=2211784 width=8) (actual time=967.450..1234.208 rows=2211936 loops=1)
Sort Key: cast_info.person_id
Sort Method: external merge Disk: 39000kB
-> Seq Scan on cast_info (cost=0.00..31905.84 rows=2211784 width=8) (actual time=0.062..166.354 rows=2211936 loops=1)
Planning Time: 0.164 ms
JIT:


## Question 2.4

Next, what if we turn off the option for using the join in Question 2.3? In the cell below, replace `# YOUR CODE HERE` with the one-line SQL magic that will set the corresponding `pg_settings` entry to `false`. Feel free to refer to the sql magic in Question 2.3 and the `pg_settings` list in Question 2.2 as needed.

In [26]:
# YOUR SQL line MAGIC HERE
%sql set enable_mergejoin=false; -- SOLUTION, delete before release

Copy your `EXPLAIN ANALYZE` command from the previous parts and rerun it below. Note the selected join, as well as the significantly longer execution time!

In [27]:
%%sql
-- write your query here; edit to include EXPLAIN ANALYZE --

-- SOLUTION: delete the below before release --
EXPLAIN ANALYZE
SELECT *
FROM actor, cast_info
WHERE actor.id = cast_info.person_id;

QUERY PLAN
Gather (cost=1000.43..636421.72 rows=2211784 width=26) (actual time=270.688..8079.711 rows=2211936 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=0.43..414243.32 rows=921577 width=26) (actual time=336.058..6766.368 rows=737312 loops=3)
-> Parallel Seq Scan on cast_info (cost=0.00..19003.77 rows=921577 width=8) (actual time=0.043..146.181 rows=737312 loops=3)
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.008..0.008 rows=1 loops=2211936)
Cache Key: cast_info.person_id
Cache Mode: logical
Hits: 344449 Misses: 354417 Evictions: 319025 Overflows: 0 Memory Usage: 4097kB
Worker 0: Hits: 376596 Misses: 388414 Evictions: 353030 Overflows: 0 Memory Usage: 4097kB


## Question 2.5

**Cleanup** Finally, reset the two settings you edited in Question 2.3 and 2.4 in the cell below.

In [29]:
# YOUR SQL line MAGIC HERE
# YOUR SQL line MAGIC HERE
%sql set enable_hashjoin=true; -- SOLUTION, delete before release
%sql set enable_mergejoin=true; -- SOLUTION, delete before release