# Performance: Indexes and Scans

In [1]:
# Run this cell to set up imports
import numpy as np
import pandas as pd

In [2]:
%reload_ext sql

## New IMDB Performance database

This is a variation of the IMDB database with keys defined. Note that this is a pretty big database! So if you run the below lines, please also remember to delete the `imdb_perf_lecture` afterwards to save space on your limited postgreSQL server.

In [3]:
!unzip -u data/imdb_perf_lecture.zip -d data/

Archive:  data/imdb_perf_lecture.zip


In [None]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'
!psql -h localhost -c 'CREATE DATABASE imdb_perf_lecture' 
!psql -h localhost -d imdb_perf_lecture -f data/imdb_perf_lecture.sql

In [None]:
%sql postgresql://127.0.0.1:5432/imdb_perf_lecture

## Display indexes

In [None]:
%sqlcmd tables

In [None]:
%sqlcmd columns -t actors

The meta-command `\d <relation>` shows indexes maintained with the `<relation>` table.

You can also look in the system view `pg_indexes` ([documentation 54.11](https://www.postgresql.org/docs/current/view-pg-indexes.html)):

In [None]:
%%sql
SELECT *
FROM pg_indexes
WHERE schemaname = 'public';

Read the `indexdef` as: the Actor relation has an index named `actor_pkey` which is created on the attribute `id`. In this case, the attribute `id` is also the **primary key** of the Actor relation, hence why it has an index. More on why primary keys automatically generate indexes in a bit.

# `EXPLAIN ANALYZE`

This query seems like it runs pretty quickly:

In [None]:
%%sql
SELECT * FROM Actors WHERE id = 23456;

The PostgreSQL command `EXPLAIN ANALYZE` runs the **execution plan** of a statement and displays actual run time statistics. This is useful to understand what the query is actually doing. 

In [None]:
%%sql
EXPLAIN ANALYZE SELECT * FROM Actors WHERE id = 23456;

Try visualizing this on https://explain.dalibo.com/

<br/>

By contrast, the below query on `cast_info` runs quite slowly. Why?

In [None]:
%%sql
EXPLAIN ANALYZE SELECT * FROM Cast_info WHERE person_id = 23456;

<br/>

Explanation: `cast_info` does **not have an index** on `person_id`!


# Creating new Indexes

In the Actors table, `name` is not a primary key. What kind of scan do you think the following query will produce?

In [None]:
%sql EXPLAIN ANALYZE SELECT * FROM Actors WHERE name = 'Tom Hanks';

We can manually create an index, even if it's not a primary key. Below, we create a multi-dimensional index just to show you the syntax:

In [None]:
%sql CREATE INDEX nameIdIndex ON actors(name,id);

This makes our original query much faster:

In [None]:
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE name = 'Tom Hanks';

Why "Index Only" Scan? Well, SQL correctly identified that there are only two attributes in the Actors table, and both are located in the index. So we just need to search the index; we don't need to additionally fetch any records.

# Exercise: Types of Scans

SQL automatically decides whether index scans are worth it. Sometimes, it decides to do a sequential scan instead, or even a bitmap heap scan.

<br/>

The below exact match lookup produces an Index Scan:

In [None]:
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE id = 23456;


This range lookup **also** produces an Index Scan:

In [None]:
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE 23456 <= id AND id < 23500;


However, the below range lookup produces a **Sequential scan**!

In [None]:
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE id >= 23456;

<br/>

And this other range lookup produces a **Bitmap Heap Scan**??

In [None]:
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE 5 <= id AND id < 23457;


* Index scan:
    * For each index key match, there is a page fetch.
    * If multiple index key matches all correspond to a single page, that single page may get fetched multiple times.matches on our query.
* Sequential scan:
    * Once each page is loaded in, all records on that page are scanned in sequence.
* Bitmap heap scan:
    * Pre-scans the index to identify the unique pages to visit, then sequentially scans the subset of pages
    * More here: [stackoverflow](https://stackoverflow.com/questions/6592626/what-is-a-bitmap-heap-scan-in-a-query-plan)
    
<br/><br/><br/><br/>

Takeaway:
* There is no guarantee that records on disk are sorted in the same way as records in the index.
* Therefore index lookups are effectively random lookups! Many random lookups are typically more expensive than many sequential lookups!

<br/>
<br/>
Other range lookups for your practice:

In [None]:
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE id >= 23456 AND id < 23500;

In [None]:
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE id >= 23456 AND id < 23457;

In [None]:
%sql EXPLAIN ANALYZE SELECT * FROM actors WHERE id >= 23456 OR id < 23457;

# Cleanup

We drop the newly created index just to clean things up:

In [None]:
%sql DROP INDEX nameIdIndex;

And we close the connection, then drop the database:

In [None]:
%sql --close postgresql://127.0.0.1:5432/imdb_perf_lecture

In [None]:
!psql -h localhost -c 'DROP DATABASE IF EXISTS imdb_perf_lecture'