# Lecture 08

In [15]:
# Run this cell to set up imports
import numpy as np
import pandas as pd

## Load in the IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

We assume you have the associated lecture folder `lec06` pulled into your repo already. The below commands create a symbolic link (i.e., shortcut/redirect with `ln`) to this lecture data directory, allowing some space saving, and unzip the database file.

In [3]:
!ln -sf ../../lec/lec06/data .
!unzip -u data/imdb_perf_lecture.zip -d data/

Archive:  data/imdb_perf_lecture.zip


In [2]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f data/imdb_perf_lecture.sql

NOTICE:  database "imdb_perf_lecture" does not exist, skipping
DROP DATABASE
CREATE DATABASE
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
CREATE TABLE
ALTER TABLE
COPY 845888
COPY 2211936
COPY 656453
ALTER TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE


## Start `jupysql`

In [4]:
%reload_ext sql

There's a new jupysql version available (0.10.14), you're running 0.10.0. To upgrade: pip install jupysql --upgrade
Deploy Streamlit apps for free on Ploomber Cloud! Learn more: https://ploomber.io/s/signup


In [5]:
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture

If you're having trouble seeing the entirety of query plans, you can run the following cell to set the limit on displayed rows to 20. **Careful**: Do not set this to `None` and run the actual queries; SQL will return millions of rows and crash your kernel!

In [10]:
# run this cell to remove 10-row limit on display
%config SqlMagic.displaylimit = 20

# Matching




<div class="alert alert-success">
It is much easier to see query plans in <b>psql</b>!<br/>
<code>jupysql</code> dataframe visualization removes any whitespace.
</div>

You can also run (after each cell):
```
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
```

In [8]:
%%sql
/* 1 */
EXPLAIN ANALYZE
SELECT id FROM actors
WHERE id > 4000000 AND
name='Tom Hanks';

QUERY PLAN
Gather (cost=1000.00..11512.90 rows=1 width=4) (actual time=814.540..816.267 rows=0 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Seq Scan on actors (cost=0.00..10512.80 rows=1 width=4) (actual time=775.270..775.271 rows=0 loops=3)
Filter: ((id > 4000000) AND (name = 'Tom Hanks'::text))
Rows Removed by Filter: 281963
Planning Time: 117.484 ms
Execution Time: 816.314 ms


In [11]:
%%sql
/* 2 */
EXPLAIN ANALYZE
SELECT id FROM actors
ORDER BY name
LIMIT 10;

QUERY PLAN
Limit (cost=17366.94..17368.11 rows=10 width=18) (actual time=117.328..119.129 rows=10 loops=1)
-> Gather Merge (cost=17366.94..99611.72 rows=704906 width=18) (actual time=117.327..119.126 rows=10 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Sort (cost=16366.92..17248.05 rows=352453 width=18) (actual time=114.770..114.771 rows=7 loops=3)
Sort Key: name
Sort Method: top-N heapsort Memory: 26kB
Worker 0: Sort Method: top-N heapsort Memory: 26kB
Worker 1: Sort Method: top-N heapsort Memory: 26kB
-> Parallel Seq Scan on actors (cost=0.00..8750.53 rows=352453 width=18) (actual time=0.023..40.436 rows=281963 loops=3)


In [None]:
%%sql
/* 3 */
EXPLAIN ANALYZE
SELECT id FROM actors
ORDER BY id
LIMIT 10;

## Two-table demo: LIMIT

Let's join two tables, `actors` and `cast_info`. The query planner selects a hash join:

In [17]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info
WHERE actors.id = cast_info.person_id;

QUERY PLAN
Hash Join (cost=29215.48..89168.21 rows=2211936 width=26) (actual time=177.035..1342.758 rows=2211936 loops=1)
Hash Cond: (cast_info.person_id = actors.id)
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.054..144.130 rows=2211936 loops=1)
-> Hash (cost=13684.88..13684.88 rows=845888 width=18) (actual time=176.783..176.784 rows=845888 loops=1)
Buckets: 65536 Batches: 16 Memory Usage: 3114kB
-> Seq Scan on actors (cost=0.00..13684.88 rows=845888 width=18) (actual time=0.019..59.147 rows=845888 loops=1)
Planning Time: 0.211 ms
Execution Time: 1407.911 ms


In [18]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

AttributeError: 'str' object has no attribute 'DataFrame'

<br/><br/>

Below, we add `LIMIT`. Note the query planner switches to a nested loop join, using an index scan to match `cast_info.person_id` to the indexed attribute `actors.id`! This results in a 10,000x speedup!

In [19]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info
WHERE actors.id = cast_info.person_id
LIMIT 10;

QUERY PLAN
Limit (cost=0.43..4.48 rows=10 width=26) (actual time=37.726..128.934 rows=10 loops=1)
-> Nested Loop (cost=0.43..895035.54 rows=2211936 width=26) (actual time=37.725..128.928 rows=10 loops=1)
-> Seq Scan on cast_info (cost=0.00..31907.36 rows=2211936 width=8) (actual time=0.048..0.054 rows=10 loops=1)
-> Memoize (cost=0.43..0.47 rows=1 width=18) (actual time=12.884..12.884 rows=1 loops=10)
Cache Key: cast_info.person_id
Cache Mode: logical
Hits: 2 Misses: 8 Evictions: 0 Overflows: 0 Memory Usage: 1kB
-> Index Scan using actor_pkey on actors (cost=0.42..0.46 rows=1 width=18) (actual time=16.101..16.101 rows=1 loops=8)
Index Cond: (id = cast_info.person_id)
Planning Time: 0.155 ms


In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

## Two-table demo: Projection

<div class="alert alert-success">
It is much easier to see query plans in <b>psql</b>!<br/>
<code>jupysql</code> dataframe visualization removes any whitespace.
</div>

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT name, movie_id
FROM actors, cast_info
WHERE actors.id = cast_info.person_id;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

</br><br/>

The below is not as substantial a reduction, but still about a quarter speed-up.
* Notice that projection was pushed down below the join “at source”.
* If we waited until join was done, would be at least as expensive.

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT name, movie_id
FROM actors, cast_info
WHERE actors.id = cast_info.person_id AND actors.id > 4000000;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

## Three-way joins

In [None]:
%%sql 
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

<br/><br/>
Below, note the predicate pushdown in the sequential scan on actors! Again, copy-paste into `psql` if you can't see the whitespace formatting.

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
    AND name = 'Tom Hanks';

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

<br/><br/>

Compare with the below predicate pushdown, where the filter is now on movie titles:

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
    AND title LIKE 'Snakes on a Plane';

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

# Three-way joins with Indexes

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

<br/><br/>
What if we dropped one of the indexes?

To do so we must drop the primary key constraint on actors.id:

In [None]:
%sql ALTER TABLE actors DROP CONSTRAINT actor_pkey CASCADE;

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

<br/><br/>
What if we dropped both indexes?

In [None]:
%sql ALTER TABLE movies DROP CONSTRAINT movie_pkey CASCADE;

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

# Cleanup

We close the connection, then drop the database:

In [None]:
%sql --close postgresql://127.0.0.1:5432/imdb_perf_lecture

In [None]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'