# Discussion 03 Notebook

This notebook is an accompaniment to the associated discussion worksheet handout.

## Load in the IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

We assume you have the associated lecture folder `lec06` pulled into your repo already. The below commands create a symbolic link (i.e., shortcut/redirect with `ln`) to this lecture data directory, allowing some space saving, and unzip the database file.

In [8]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f imdb_perf_lecture.sql

ERROR:  database "imdb_perf_lecture" is being accessed by other users
DETAIL:  There is 1 other session using the database.
ERROR:  database "imdb_perf_lecture" already exists
psql: error: imdb_perf_lecture.sql: No such file or directory


Before starting this part, review the schema of the relations in the `imdb_perf_lecture` database. Here's the printout from `psql`:

```
imdb_perf_lecture=# \d actors
               Table "public.actors"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 id     | integer |           | not null | 
 name   | text    |           |          | 
Indexes:
    "actor_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "cast_info" CONSTRAINT "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actors(id)

imdb_perf_lecture=# \d movies
                   Table "public.movies"
     Column      |  Type   | Collation | Nullable | Default 
-----------------+---------+-----------+----------+---------
 id              | integer |           | not null | 
 title           | text    |           |          | 
 year            | integer |           |          | 
 runtime_minutes | integer |           |          | 
Indexes:
    "movie_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "cast_info" CONSTRAINT "cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movies(id)

imdb_perf_lecture=# \d cast_info
               Table "public.cast_info"
  Column   |  Type   | Collation | Nullable | Default 
-----------+---------+-----------+----------+---------
 person_id | integer |           |          | 
 movie_id  | integer |           |          | 
Foreign-key constraints:
    "cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movies(id)
    "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actors(id)

```

In [9]:
%reload_ext sql
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture
import pandas as pd


# IV. Query Performance

This question looks at the impacts of **aggregation** and **sorting** on query performance.

In [11]:
%%sql
SELECT * FROM actors;

RuntimeError: If using snippets, you may pass the --with argument explicitly.
For more details please refer: https://jupysql.ploomber.io/en/latest/compose.html#with-argument


Original error message from DB driver:
(psycopg2.errors.UndefinedTable) relation "actors" does not exist
LINE 1: SELECT * FROM actors;
                      ^

[SQL: SELECT * FROM actors;]
(Background on this error at: https://sqlalche.me/e/20/f405)



## Question 8

Write a query that returns the actor names and the number of times the corresponding name appears in the `actors` relation.

In [None]:
%%sql
-- write your query here --


## Question 9

Write a query that returns the actor IDs and the number of times the corresponding ID appears in the `actors` relation.

In [None]:
%%sql
-- write your query here --


## Question 10

Run `EXPLAIN ANALYZE` on your two queries above. See below for the full question.

If you're having trouble seeing the entirety of the query plan, you can run the following cell to set the limit on displayed rows to 20. **Careful**: Do not set this to `None` and run the actual queries; SQL will return millions of rows and crash your kernel!

In [None]:
# run this cell to remove 10-row limit on display
%config SqlMagic.displaylimit = 20


In [None]:
%%sql
-- write your EXPLAIN ANALYZE here --


In [None]:
%%sql
-- write your EXPLAIN ANALYZE here --


<br/><br/>

**(Question, continued)**
Why do you think the the `name` query use a Sequential Scan, whereas the `id` query use an Index Only scan?

## Question 11

Write a command that creates an index `name_actor_index` on the `name` attribute of `actors`.

In [None]:
%%sql
-- write your query here --
-- CREATE INDEX name_actor_index on actors (name);

In [None]:
# DROP INDEX name_actor_index 

**(Question, continued)**
Rerun your `EXPLAIN ANALYZE` of your Question 8 query on `name` by copying and pasting it into the cell below. See below for the discussion question.

In [None]:
%%sql
-- write your EXPLAIN ANALYZE here --


**(Question, continued)**
Why does the `name` query now use an Index Only Scan? What index is it using?

## Question 12

Analyze the impact of sorting as follows. Rewrite your query from **Question 9** to return the entries sorted by ID. In other words, run an `EXPLAIN ANALYZE` on a query that returns the actor IDs (**sorted by lowest ID first**) and the number of times the corresponding ID appears in the `actors` relation.

**Discuss**: Do you expect this query to take more time? Why or why not?

In [None]:
%%sql
-- write your query here --
