# Discussion 04 Notebook

This notebook is an accompaniment to the associated discussion worksheet handout.

## Load in the IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

We assume you have the associated lecture folder `lec06` pulled into your repo already. The below commands create a symbolic link (i.e., shortcut/redirect with `ln`) to this lecture data directory, allowing some space saving, and unzip the database file.

In [None]:
!ln -sf ../../lec/lec06/data .
!unzip -u data/imdb_perf_lecture.zip -d data/

In [None]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f data/imdb_perf_lecture.sql

Before starting this part, review the schema of the relations in the `imdb_perf_lecture` database. Here's the printout from `psql`:

```
imdb_perf_lecture=# \d actors
               Table "public.actors"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 id     | integer |           | not null | 
 name   | text    |           |          | 
Indexes:
    "actor_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "cast_info" CONSTRAINT "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actors(id)

imdb_perf_lecture=# \d movies
                   Table "public.movies"
     Column      |  Type   | Collation | Nullable | Default 
-----------------+---------+-----------+----------+---------
 id              | integer |           | not null | 
 title           | text    |           |          | 
 year            | integer |           |          | 
 runtime_minutes | integer |           |          | 
Indexes:
    "movie_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "cast_info" CONSTRAINT "cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movies(id)

imdb_perf_lecture=# \d cast_info
               Table "public.cast_info"
  Column   |  Type   | Collation | Nullable | Default 
-----------+---------+-----------+----------+---------
 person_id | integer |           |          | 
 movie_id  | integer |           |          | 
Foreign-key constraints:
    "cast_info_movie_id_fkey" FOREIGN KEY (movie_id) REFERENCES movies(id)
    "cast_info_person_id_fkey" FOREIGN KEY (person_id) REFERENCES actors(id)

```

In [6]:
%reload_ext sql
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture
import pandas as pd


# II. Query Performance

This question looks at the impacts of **aggregation** and **sorting** on query performance.

## Question 3

Write a query that returns the actor names and the number of times the corresponding name appears in the `actors` relation.

In [9]:
%%sql
-- write your query here --

RuntimeError: (psycopg2.ProgrammingError) can't execute an empty query
[SQL: -- write your query here --]
(Background on this error at: https://sqlalche.me/e/20/f405)
If you need help solving this issue, send us a message: https://ploomber.io/community


## Question 4

Write a query that returns the actor IDs and the number of times the corresponding ID appears in the `actors` relation.

In [None]:
%%sql
-- write your query here --


## Question 5

Run `EXPLAIN ANALYZE` on your two queries above. See below for the full question.

If you're having trouble seeing the entirety of the query plan, you can run the following cell to set the limit on displayed rows to 20. **Careful**: Do not set this to `None` and run the actual queries; SQL will return millions of rows and crash your kernel!

In [10]:
# run this cell to remove 10-row limit on display
%config SqlMagic.displaylimit = 20


In [13]:
%%sql
-- write your EXPLAIN ANALYZE here --


RuntimeError: (psycopg2.ProgrammingError) can't execute an empty query
[SQL: -- write your EXPLAIN ANALYZE here --]
(Background on this error at: https://sqlalche.me/e/20/f405)
If you need help solving this issue, send us a message: https://ploomber.io/community


In [14]:
%%sql
-- write your EXPLAIN ANALYZE here --


RuntimeError: (psycopg2.ProgrammingError) can't execute an empty query
[SQL: -- write your EXPLAIN ANALYZE here --]
(Background on this error at: https://sqlalche.me/e/20/f405)
If you need help solving this issue, send us a message: https://ploomber.io/community


<br/><br/>

**(Question, continued)**
Why do you think the the `name` query use a Sequential Scan, whereas the `id` query use an Index Only scan?

## Question 6

Write a command that creates an index `name_actor_index` on the `name` attribute of `actors`.

In [15]:
%%sql
-- write your query here --


**(Question, continued)**
Rerun your `EXPLAIN ANALYZE` of your Question 3 query on `name` by copying and pasting it into the cell below. See below for the discussion question.

In [20]:
%%sql
-- write your EXPLAIN ANALYZE here --


QUERY PLAN
GroupAggregate (cost=0.42..34669.07 rows=845888 width=12) (actual time=0.026..250.920 rows=845888 loops=1)
Group Key: id
-> Index Only Scan using actor_pkey on actors (cost=0.42..21980.74 rows=845888 width=4) (actual time=0.020..77.269 rows=845888 loops=1)
Heap Fetches: 0
Planning Time: 0.086 ms
Execution Time: 275.075 ms


**(Question, continued)**
Why does the `name` query now use an Index Only Scan? What index is it using?

## Question 7

Analyze the impact of sorting as follows. Rewrite your query from **Question 4** to return the entries sorted by ID. In other words, run an `EXPLAIN ANALYZE` on a query that returns the actor IDs (**sorted by lowest ID first**) and the number of times the corresponding ID appears in the `actors` relation.

**Discuss**: Do you expect this query to take more time? Why or why not?

In [19]:
%%sql
-- write your query here --


QUERY PLAN
GroupAggregate (cost=0.42..34669.07 rows=845888 width=12) (actual time=0.026..246.978 rows=845888 loops=1)
Group Key: id
-> Index Only Scan using actor_pkey on actors (cost=0.42..21980.74 rows=845888 width=4) (actual time=0.020..75.866 rows=845888 loops=1)
Heap Fetches: 0
Planning Time: 0.070 ms
Execution Time: 270.727 ms


# III. Table Sampling

If you're having trouble seeing the entirety of the query plan, you can run the following cell to set the limit on displayed rows to 20. **Careful**: Do not set this to `None` and run the actual queries; SQL will return millions of rows and crash your kernel!

Consider the following query which randomly selects 10,000 rows from the `actors` table:
`SELECT *
FROM actors
ORDER BY RANDOM()
LIMIT 10000;`

In [74]:
# run this cell to change default 10-row limit on display
%config SqlMagic.displaylimit = 20

#### ORDER BY sampling

In [26]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors
ORDER BY RANDOM()
LIMIT 10000;

QUERY PLAN
Limit (cost=76228.62..76253.62 rows=10000 width=26) (actual time=226.827..228.484 rows=10000 loops=1)
-> Sort (cost=76228.62..78343.34 rows=845888 width=26) (actual time=226.825..227.945 rows=10000 loops=1)
Sort Key: (random())
Sort Method: top-N heapsort Memory: 2087kB
-> Seq Scan on actors (cost=0.00..15799.60 rows=845888 width=26) (actual time=0.066..88.679 rows=845888 loops=1)
Planning Time: 0.062 ms
Execution Time: 229.689 ms


## Question 8

Write a query that uses `TABLESAMPLE BERNOULLI` to randomly select 10,000 rows from `actors`, on average. Hint: Use a subquery to compute the total rows in `actors`.

In [22]:
%%sql
-- write your query here --


RuntimeError: (psycopg2.ProgrammingError) can't execute an empty query
[SQL: -- write your query here --]
(Background on this error at: https://sqlalche.me/e/20/f405)
If you need help solving this issue, send us a message: https://ploomber.io/community


## Question 9

Write a query that uses `TABLESAMPLE SYSTEM` to randomly select 10,000 rows from `actors`, on average.
How does this page-level table sampling compare with the Bernoulli table sampling in the previous
question?

In [None]:
%%sql
-- write your query here --


## Question 10

Run `EXPLAIN ANALYZE` on each of the three sampling methods above. Does this confirm your
understanding of the different methods?

In [9]:
%%sql
-- write your EXPLAIN ANALYZE here --


Traceback (most recent call last):
  File "/Users/michellelin/opt/anaconda3/lib/python3.9/site-packages/sql/magic.py", line 196, in execute
  File "/Users/michellelin/opt/anaconda3/lib/python3.9/site-packages/sql/connection.py", line 82, in set
    raise ConnectionError(
sql.connection.ConnectionError: Environment variable $DATABASE_URL not set, and no connect string given.

Connection info needed in SQLAlchemy format, example:
               postgresql://username:password@hostname/dbname
               or an existing connection: dict_keys([])


## Question 11

In the above queries, computing the total rows in `actors` for every table sample takes time. Use
`EXPLAIN ANALYZE` to compare with using the PostgreSQL estimates of numbers of records: <br>
`SELECT reltuples AS estimate FROM pg class where relname = ‘actors’;` <br>
Rewrite your table sampling queries above to use this row estimate, and note the speedup in performance.

In [None]:
%%sql
-- write your query here --
