# Discussion 04 Notebook

This notebook is an accompaniment to the associated discussion worksheet handout.

## Load in the IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

If you didn't load it in with a previous lecture (lecture 6), load in the `imdb_perf_lecture` database.

In [4]:
# !ln -sf ../../lec/lec06/data .
# !unzip -u data/imdb_perf_lecture.zip -d data/

In [5]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f imdb_perf_lecture.sql

DROP DATABASE
CREATE DATABASE
psql: error: imdb_perf_lecture.sql: No such file or directory


In [14]:
# !psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
# !psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
# !psql -h localhost -d imdb_perf_lecture -f data/imdb_perf_lecture.sql

In [4]:
%reload_ext sql
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture
import pandas as pd

Before starting this part, review the schema of the relations in the `imdb_perf_lecture` database. Here's the printout from `psql`:

```
imdb_perf_lecture=# \d actor
               Table "public.actor"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 id     | integer |           | not null | 
 name   | text    |           |          | 
Indexes:
    "actor_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "cast_info" CONSTRAINT "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actor(id)

imdb_perf_lecture=# \d cast_info
               Table "public.cast_info"
  Column   |  Type   | Collation | Nullable | Default 
-----------+---------+-----------+----------+---------
 person_id | integer |           |          | 
 movie_id  | integer |           |          | 
Foreign-key constraints:
    "cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movie(id)
    "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actor(id)

imdb_perf_lecture=# \d movie
                    Table "public.movie"
     Column      |  Type   | Collation | Nullable | Default 
-----------------+---------+-----------+----------+---------
 id              | integer |           | not null | 
 title           | text    |           |          | 
 year            | integer |           |          | 
 runtime_minutes | integer |           |          | 
Indexes:
    "movie_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "cast_info" CONSTRAINT "cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movie(id)
```

# Section V 
## `EXPLAIN ANALYZE` with Joins

Consider the following schema:
- `actors(id, name)`
- `cast_info(person_id, movie_id)`

## Question 10

Write a query that computes an inner join on `actors` and `cast_info` on the actor ID. Your query should return all attributes.

In [3]:
%%sql
SELECT *
FROM actors AS a
INNER JOIN cast_info AS c
ON a.id = c.person_id;


id,name,person_id,movie_id
299757,Aage Fønss,299757,1115
736379,Jenny Roelsgaard,736379,1116
299757,Aage Fønss,299757,1116
735618,Cecilio Rodríguez de la Vega,735618,1184
381005,Aage Hertel,381005,1240
299757,Aage Fønss,299757,1277
736379,Jenny Roelsgaard,736379,1277
299757,Aage Fønss,299757,1440
254852,George D. Ellis,254852,1604
54884,Florence Barker,54884,1712


Once you're comfortable that the query is working as expected, run it through `EXPLAIN ANALYZE`. **Discuss**: What kind of join is the query optimizer performing? Why might this be the case?

In [4]:
%%sql
EXPLAIN ANALYZE 
SELECT *
FROM actors AS a
INNER JOIN cast_info AS c
ON a.id = c.person_id;


QUERY PLAN
Gather Merge (cost=160273.59..375337.94 rows=1843280 width=26) (actual time=2576.845..3580.578 rows=2211936 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=159273.57..161577.67 rows=921640 width=26) (actual time=2524.434..2699.209 rows=737312 loops=3)
Sort Key: actors.id
Sort Method: external merge Disk: 27384kB
Worker 0: Sort Method: external merge Disk: 27336kB
Worker 1: Sort Method: external merge Disk: 27360kB
-> Parallel Hash Join (cost=15222.20..45913.91 rows=921640 width=26) (actual time=684.880..1545.001 rows=737312 loops=3)
Hash Cond: (cast_info.person_id = actors.id)


<br/>

---

Sometimes you may prefer to adjust the PostgreSQL settings to force a specific performance of a query. 
In the remainder of this question we will explore how to specify these settings. Note that settings tweaking for a single query is **not recommended** in practice, as it affects all of your queries! However, if you know specific characteristics of how your database will be queried in general, then by all means, dive in to adjust these knobs.

* All runtime parameters for PostgreSQL are in one view, called `pg_settings` (Documentation 54.24 [link](https://www.postgresql.org/docs/current/view-pg-settings.html)).
* In particular, the Planner Method Configuration (Documentation 20.7.1 [link](https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-SEQ-PAGE-COST)) includes the parameter descriptions for the query optimizer.

We encourage you to keep these pages up as you explore the next activity.

## Question 11

Run the below query. Which settings are related to selecting the type of join that the query optimizer can select?

In [9]:
%config SqlMagic.displaylimit = None

In [10]:
%%sql
SELECT name
FROM pg_settings
WHERE name LIKE 'enable_%';

name
enable_async_append
enable_bitmapscan
enable_gathermerge
enable_hashagg
enable_hashjoin
enable_incremental_sort
enable_indexonlyscan
enable_indexscan
enable_material
enable_memoize


## Question 12

Let's suppose we turn off hash join as an option for the query optimizer. The syntax is included for you below.

In [11]:
# just run this cell
%sql set enable_hashjoin=false;

Copy your `EXPLAIN ANALYZE` command from Question 10 and rerun it below.

**Discuss**: Recall that our intial query was performed using hash joins; what is the join the query optimizer picks below? Why might this be preferred over the remaining alternatives?

In [12]:
%%sql
EXPLAIN ANALYZE 
SELECT *
FROM actors AS a
INNER JOIN cast_info AS c
ON a.id = c.person_id;


QUERY PLAN
Merge Join (cost=325499.41..405308.26 rows=2211936 width=26) (actual time=860.594..1934.699 rows=2211936 loops=1)
Merge Cond: (actors.id = cast_info.person_id)
-> Index Scan using actor_pkey on actors (cost=0.42..38992.95 rows=845888 width=18) (actual time=0.056..168.143 rows=845888 loops=1)
-> Materialize (cost=325497.89..336557.57 rows=2211936 width=8) (actual time=855.408..1341.910 rows=2211936 loops=1)
-> Sort (cost=325497.89..331027.73 rows=2211936 width=8) (actual time=855.401..1129.626 rows=2211936 loops=1)
Sort Key: cast_info.person_id
Sort Method: external merge Disk: 39000kB
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.027..147.011 rows=2211936 loops=1)
Planning Time: 0.136 ms
JIT:


## Question 13

Next, what if we turn off the option for using the join in Question 10? In the cell below, replace `# YOUR CODE HERE` with the one-line SQL magic that will set the corresponding `pg_settings` entry to `false`.

In [13]:
# YOUR SQL line MAGIC HERE
%sql set enable_mergejoin=false;

Copy your `EXPLAIN ANALYZE` command from the previous parts and rerun it below. Note the selected join, as well as the significnatly longer execution time!

In [14]:
%%sql
-- write your query here; edit to include EXPLAIN ANALYZE --


QUERY PLAN
Gather (cost=1000.43..633539.36 rows=2211936 width=26) (actual time=241.271..6354.512 rows=2211936 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Nested Loop (cost=0.43..411345.76 rows=921640 width=26) (actual time=280.761..5121.940 rows=737312 loops=3)
-> Parallel Seq Scan on cast_info (cost=0.00..19004.40 rows=921640 width=8) (actual time=0.039..216.512 rows=737312 loops=3)
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=0.006..0.006 rows=1 loops=2211936)
Cache Key: cast_info.person_id
Cache Mode: logical
Hits: 353079 Misses: 360703 Evictions: 325310 Overflows: 0 Memory Usage: 4097kB
Worker 0: Hits: 370101 Misses: 384513 Evictions: 349123 Overflows: 0 Memory Usage: 4097kB


## Question 14

**Cleanup** Finally, reset the two settings you edited before to reenable the two join methods that you disabled in previous questions.

In [21]:
# YOUR SQL line MAGIC HERE
%sql set enable_hashjoin=true; 
%sql set enable_hashjoin=true; 