# Discussion 06 Notebook

This notebook is an accompaniment to the associated discussion worksheet handout.

## Load in the IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

If you didn't load it in with a previous lecture, load in the `imdb_perf_lecture` database.

In [None]:
!unzip -u ../lecture/lec07/data/imdb_perf_lecture.zip -d ../lecture/lec07/data/

In [None]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f ../lecture/lec07/data/imdb_perf_lecture.sql

In [None]:
%reload_ext sql
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture
import pandas as pd

Before starting this part, review the schema of the relations in the `imdb_perf_lecture` database. Here's the printout from `psql`:

```
imdb_perf_lecture=# \d actor
               Table "public.actor"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 id     | integer |           | not null | 
 name   | text    |           |          | 
Indexes:
    "actor_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "cast_info" CONSTRAINT "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actor(id)

imdb_perf_lecture=# \d cast_info
               Table "public.cast_info"
  Column   |  Type   | Collation | Nullable | Default 
-----------+---------+-----------+----------+---------
 person_id | integer |           |          | 
 movie_id  | integer |           |          | 
Foreign-key constraints:
    "cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movie(id)
    "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actor(id)

imdb_perf_lecture=# \d movie
                    Table "public.movie"
     Column      |  Type   | Collation | Nullable | Default 
-----------------+---------+-----------+----------+---------
 id              | integer |           | not null | 
 title           | text    |           |          | 
 year            | integer |           |          | 
 runtime_minutes | integer |           |          | 
Indexes:
    "movie_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "cast_info" CONSTRAINT "cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movie(id)
```

# Question 1 Queries

This question looks at the impacts of **aggregation** and **sorting** on query performance.

## Question 1.1

In the cell below, write a query that returns the actor names and the number of times the corresponding name appears in the `Actor` relation.

In [None]:
%%sql
-- write your query here --


## Question 1.2

Now, in the cell below, write a query that returns the actor IDs and the number of times the corresponding ID appears in the `Actor` relation.

In [None]:
%%sql
-- write your query here --


## Question 1.3

Run `EXPLAIN ANALYZE` on your two queries above. See below for the full question.

If you're having trouble seeing the entirety of the query plan, you can run the following cell to set the limit on displayed rows to 20. **Careful**: Do not set this to `None` and run the actual queries; SQL will return millions of rows and crash your kernel!

In [None]:
# run this cell to remove 10-row limit on display
%config SqlMagic.displaylimit = 20

In [None]:
%%sql
-- write your EXPLAIN ANALYZE here --


In [None]:
%%sql
-- write your EXPLAIN ANALYZE here --


<br/><br/>

**(Question 1.3, continued)**
Why do you think the the `name` query use a Sequential Scan, whereas the `id` query use an Index Only scan?

## Question 1.4

Now, write a command that creates an index `name_actor_index` on the `name` attribute of `actor`.

In [None]:
%%sql
-- write your EXPLAIN ANALYZE here --


**(Question 1.4, continued)**
Now rerun your `EXPLAIN ANALYZE` of your Question 1.1 query on `name` by copying and pasting it into the cell below. See below for the discussion question.

In [None]:
%%sql
-- write your EXPLAIN ANALYZE here --


**(Question 1.4, continued)**
Why does the `name` query now use an `Index Only Scan`? What index is it using?

## Question 1.5

Now in the cell below, analyze the impacts of rewriting your query from Question 1.2 to return the entries sorted by ID. In other words, run an `EXPLAIN ANALYZE` on a query that returns the actor IDs (**sorted by lowest ID first**) and the number of times the corresponding ID appears in the `Actor` relation.

**Discuss**: Why does this query take so much time? What is the `ORDER BY` operation doing here?

In [None]:
%%sql
-- write your query here --


<br/><br/><br/>

---

# Question 2

Before starting this question, we encourage you to delete any indexes you created above. It will help you compute consistent results with other classmates.

In [None]:
# just run this cell
%sql DROP INDEX name_actor_index;

## Question 2.1

Write a query that computes an inner join on `actor` and `cast_info` on the actor ID. Your query should return all attributes.

Once you're comfortable that the query is working as expected, run it through `EXPLAIN ANALYZE`. **Discuss**: What kind of join is the query optimizer performing? Why might this be the case?

In [None]:
%%sql
-- write your query here; edit to include EXPLAIN ANALYZE --


<br/>

---

Sometimes you may prefer to adjust the PostgreSQL settings to force a specific performance of a query. 
In the remainder of this question we will explore how to specify these settings. Note that settings tweaking for a single query is **not recommended** in practice, as it affects all of your queries! However, if you know specific characteristics of how your database will be queried in general, then by all means, dive in to adjust these knobs.

* All runtime parameters for PostgreSQL are in one view, called `pg_settings` (Documentation 54.24 [link](https://www.postgresql.org/docs/current/view-pg-settings.html)).
* In particular, the Planner Method Configuration (Documentation 20.7.1 [link](https://www.postgresql.org/docs/current/runtime-config-query.html#GUC-SEQ-PAGE-COST)) includes the parameter descriptions for the query optimizer.

We encourage you to keep these pages up as you explore the next activity.

## Question 2.2

Run the below query. Which settings are related to selecting the type of join that the query optimizer can select?

In [None]:
%%sql
SELECT name
FROM pg_settings
WHERE name LIKE 'enable_%';

## Question 2.3

Let's suppose we turn off hash joins as an option for the query optimizer. The syntax is included for you below.

In [None]:
# just run this cell
%sql set enable_hashjoin=false;

Copy your `EXPLAIN ANALYZE` command from Question 2.1 and rerun it below.

**Discuss**: Recall that our intial query was performed using hash joins; what is the join the query optimizer picks below? Why might this be preferred over the remaining alternatives?

In [None]:
%%sql
-- copy your EXPLAIN ANALYZE query from the previous part here --


## Question 2.4

Next, what if we turn off the option for using the join in Question 2.3? In the cell below, replace `# YOUR CODE HERE` with the one-line SQL magic that will set the corresponding `pg_settings` entry to `false`. Feel free to refer to the sql magic in Question 2.3 and the `pg_settings` list in Question 2.2 as needed.

In [None]:
# YOUR SQL line MAGIC HERE
%sql set enable_mergejoin=false; -- SOLUTION, delete before release

Copy your `EXPLAIN ANALYZE` command from the previous parts and rerun it below. Note the selected join, as well as the significantly longer execution time!

In [None]:
%%sql
-- write your query here; edit to include EXPLAIN ANALYZE --


## Question 2.5

**Cleanup** Finally, reset the two settings you edited in Question 2.3 and 2.4 in the cell below.

In [None]:
# YOUR SQL line MAGIC HERE
# YOUR SQL line MAGIC HERE