# Lecture 08

In [None]:
# Run this cell to set up imports
import numpy as np
import pandas as pd

## Load in the IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

We assume you have the associated lecture folder `lec06` pulled into your repo already. The below commands create a symbolic link (i.e., shortcut/redirect with `ln`) to this lecture data directory, allowing some space saving, and unzip the database file.

In [None]:
!ln -sf ../../lec/lec06/data .
!unzip -u data/imdb_perf_lecture.zip -d data/

In [None]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f data/imdb_perf_lecture.sql

## Start `jupysql`

In [None]:
%reload_ext sql

In [None]:
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture

If you're having trouble seeing the entirety of query plans, you can run the following cell to set the limit on displayed rows to 20. **Careful**: Do not set this to `None` and run the actual queries; SQL will return millions of rows and crash your kernel!

In [None]:
# run this cell to remove 10-row limit on display
%config SqlMagic.displaylimit = 20

# Matching




<div class="alert alert-success">
It is much easier to see query plans in <b>psql</b>!<br/>
<code>jupysql</code> dataframe visualization removes any whitespace.
</div>

You can also run (after each cell):
```
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)
```

In [None]:
%%sql
/* 1 */
EXPLAIN ANALYZE
SELECT id FROM actors
WHERE id > 4000000 AND
name='Tom Hanks';

In [None]:
%%sql
/* 2 */
EXPLAIN ANALYZE
SELECT id FROM actors
ORDER BY name
LIMIT 10;

In [None]:
%%sql
/* 3 */
EXPLAIN ANALYZE
SELECT id FROM actors
ORDER BY id
LIMIT 10;

## Two-table demo: LIMIT

Let's join two tables, `actors` and `cast_info`. The query planner selects a hash join:

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info
WHERE actors.id = cast_info.person_id;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

<br/><br/>

Below, we add `LIMIT`. Note the query planner switches to a nested loop join, using an index scan to match `cast_info.person_id` to the indexed attribute `actors.id`! This results in a 10,000x speedup!

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info
WHERE actors.id = cast_info.person_id
LIMIT 10;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

## Two-table demo: Projection

<div class="alert alert-success">
It is much easier to see query plans in <b>psql</b>!<br/>
<code>jupysql</code> dataframe visualization removes any whitespace.
</div>

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT name, movie_id
FROM actors, cast_info
WHERE actors.id = cast_info.person_id;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

</br><br/>

The below is not as substantial a reduction, but still about a quarter speed-up.
* Notice that projection was pushed down below the join “at source”.
* If we waited until join was done, would be at least as expensive.

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT name, movie_id
FROM actors, cast_info
WHERE actors.id = cast_info.person_id AND actors.id > 4000000;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

## Three-way joins

In [None]:
%%sql 
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

<br/><br/>
Below, note the predicate pushdown in the sequential scan on actors! Again, copy-paste into `psql` if you can't see the whitespace formatting.

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
    AND name = 'Tom Hanks';

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

<br/><br/>

Compare with the below predicate pushdown, where the filter is now on movie titles:

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
    AND title LIKE 'Snakes on a Plane';

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

# Three-way joins with Indexes

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

<br/><br/>
What if we dropped one of the indexes?

To do so we must drop the primary key constraint on actors.id:

In [None]:
%sql ALTER TABLE actors DROP CONSTRAINT actor_pkey CASCADE;

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

<br/><br/>
What if we dropped both indexes?

In [None]:
%sql ALTER TABLE movies DROP CONSTRAINT movie_pkey CASCADE;

In [None]:
%%sql
EXPLAIN ANALYZE
SELECT *
FROM actors, cast_info, movies
WHERE actors.id = cast_info.person_id
    AND movies.id = cast_info.movie_id
LIMIT 10;

In [None]:
result = _.DataFrame()
result.style.set_properties(**{'text-align': 'left'})
print(result)

# Cleanup

We close the connection, then drop the database:

In [None]:
%sql --close postgresql://127.0.0.1:5432/imdb_perf_lecture

In [None]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'