<a href="https://colab.research.google.com/github/engineer-nicolas/cs50sql/blob/master/lecture_5_Optimizing/lecture_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lecture 5 - Optimizing - CS50 SQL harvard


Let us open up this database called movies.db in SQLite.

In [1]:
# Let's import the required libraries
import sqlite3
import pandas as pd
import time

# Let's connect to the SQLite database used in CS50
conn = sqlite3.connect("movies.db")

In [2]:
df=pd.read_sql_query(
    """
    SELECT *
    FROM sqlite_master;
    """,
    conn
)
df

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movies,movies,2,"CREATE TABLE ""movies"" (\n ""id"" INTEGER,\n ..."
1,table,people,people,3,"CREATE TABLE ""people"" (\n ""id"" INTEGER,\n ..."
2,table,ratings,ratings,4,"CREATE TABLE ""ratings"" (\n ""id"" INTEGER,\n ..."
3,index,sqlite_autoindex_ratings_1,ratings,5,
4,table,stars,stars,6,"CREATE TABLE ""stars"" (\n ""movie_id"" INTEGER..."
5,index,sqlite_autoindex_stars_1,stars,7,
6,index,person_index,stars,25560,"CREATE INDEX ""person_index"" \nON ""stars"" (""per..."
7,index,name_index,people,29934,"CREATE INDEX ""name_index"" \nON ""people"" (""name"")"
8,index,recents,movies,37210,"CREATE INDEX ""recents"" ON ""movies"" (""titles"")\..."


In [3]:
for e in range(len(df)):
    print(df["sql"][e])

CREATE TABLE "movies" (
    "id" INTEGER,
    "title" TEXT NOT NULL,
    "year" NUMERIC,
    PRIMARY KEY("id")
)
CREATE TABLE "people" (
    "id" INTEGER,
    "name" TEXT NOT NULL,
    "birth" NUMERIC,
    PRIMARY KEY("id")
)
CREATE TABLE "ratings" (
    "id" INTEGER,
    "movie_id" INTEGER UNIQUE,
    "rating" REAL NOT NULL,
    "votes" INTEGER NOT NULL,
    PRIMARY KEY("id"),
    FOREIGN KEY("movie_id") REFERENCES "movies"("id")
)
None
CREATE TABLE "stars" (
    "movie_id" INTEGER,
    "person_id" INTEGER,
    PRIMARY KEY("movie_id", "person_id"),
    FOREIGN KEY("movie_id") REFERENCES "movies"("id"),
    FOREIGN KEY("person_id") REFERENCES "people"("id")
)
None


In [19]:
movies=pd.read_sql_query(
    """
    SELECT *
    FROM "movies"
    """,
    conn
)
movies

Unnamed: 0,id,title,year
0,11801,Tötet nicht mehr,2019
1,13274,Istoriya grazhdanskoy voyny,2021
2,15414,La tierra de los toros,2000
3,15724,Dama de noche,1993
4,31458,El huésped del sevillano,1970
...,...,...,...
419001,27581096,Helt Kul - 20km Hofman - 2011,2011
419002,27583866,All the Cool One's Men,1982
419003,27584383,Leaving California: The Untold Story,2023
419004,27585151,Bloodless Massacres,2020


In [21]:
print("number of rows in movies.db:",len(movies))

number of rows in movies.db: 419006


To find the information pertaining to the movie Cars, we would run the following query:

In [24]:
start = time.perf_counter()
pd.read_sql_query(
    """
    SELECT *
    FROM "movies"
    WHERE "title" = 'Cars';
    """,
    conn
)
end = time.perf_counter()
print(f"Execution time: {end - start:.6f} seconds")

Execution time: 0.272949 seconds


SQLite has a command `.timer on` that enables us to time our queries.

We can optimize this query to be more efficient than a scan. In the same way that textbooks often have an index, databases tables can have an index as well.

## CREATE INDEX ___ ON

An index is a structure used to speed up the retrieval of rows from a table.

In [33]:
conn.execute(
"""
CREATE INDEX IF NOT EXISTS "title_index"
ON "movies" ("title");
"""
)
conn.commit()


In [34]:
df=pd.read_sql_query(
    """
    SELECT *
    FROM sqlite_master;
    """,
    conn
)
df

Unnamed: 0,type,name,tbl_name,rootpage,sql
0,table,movies,movies,2,"CREATE TABLE ""movies"" (\n ""id"" INTEGER,\n ..."
1,table,people,people,3,"CREATE TABLE ""people"" (\n ""id"" INTEGER,\n ..."
2,table,ratings,ratings,4,"CREATE TABLE ""ratings"" (\n ""id"" INTEGER,\n ..."
3,index,sqlite_autoindex_ratings_1,ratings,5,
4,table,stars,stars,6,"CREATE TABLE ""stars"" (\n ""movie_id"" INTEGER..."
5,index,sqlite_autoindex_stars_1,stars,7,
6,index,title_index,movies,25560,"CREATE INDEX ""title_index""\nON ""movies"" (""title"")"


In [35]:
pd.read_sql_query(
"""
PRAGMA index_list("movies");
""",
conn
)


Unnamed: 0,seq,name,unique,origin,partial
0,0,title_index,0,c,0


## EXPLAIN QUERY PLAN

In [39]:
qp=pd.read_sql_query(
"""
EXPLAIN QUERY PLAN
    SELECT *
    FROM "movies"
    WHERE "title" = 'Cars';
""",
conn
)
qp


Unnamed: 0,id,parent,notused,detail
0,3,0,63,SEARCH movies USING INDEX title_index (title=?)


In [40]:
print(qp["detail"][0])

SEARCH movies USING INDEX title_index (title=?)


## DROP INDEX

In [44]:
conn.execute(
"""
DROP INDEX "title_index";
"""
)
conn.commit()

OperationalError: no such index: title_index

In [46]:
q=pd.read_sql_query(
"""
EXPLAIN QUERY PLAN
    SELECT *
    FROM "movies"
    WHERE "title" = 'Cars';
""",
conn
)
print(q["detail"][0])

SEARCH movies USING INDEX title_index (title=?)


## Index across Multiple Tables

We would run the following query to find all the movies Tom Hanks starred in:

In [53]:
tom=pd.read_sql_query(
"""
SELECT "title" FROM "movies"
WHERE "id" IN (
    SELECT "movie_id" FROM "stars"
    WHERE "person_id" = (
        SELECT "id" FROM "people"
        WHERE "name" = 'Tom Hanks'
    )
);
""",
conn
)
tom

Unnamed: 0,title
0,Bachelor Party
1,Splash
2,The Man with One Red Shoe
3,Volunteers
4,Every Time We Say Goodbye
...,...
57,News of the World
58,A Man Called Otto
59,Served: Harvey Weinstein
60,Borat Subsequent Moviefilm


To understand what kind of index could help speed this query up, we can run EXPLAIN QUERY PLAN

In [None]:
qp=pd.read_sql_query(
"""
EXPLAIN QUERY PLAN
SELECT "title" FROM "movies"
WHERE "id" IN (
    SELECT "movie_id" FROM "stars"
    WHERE "person_id" = (
        SELECT "id" FROM "people"
        WHERE "name" = 'Tom Hanks'
    )
);
""",
conn
)
for e in qp["detail"]:
    print(e)

SEARCH movies USING INTEGER PRIMARY KEY (rowid=?)
LIST SUBQUERY 2
SEARCH stars USING INDEX person_index (person_id=?)
SCALAR SUBQUERY 1
SEARCH people USING COVERING INDEX name_index (name=?)
CREATE BLOOM FILTER


This shows us that the query requires two scans — of people and stars.


Let us create the two indexes to speed this query up.

In [56]:
conn.execute(
"""
CREATE INDEX "person_index" 
ON "stars" ("person_id");
"""
)
conn.execute(
"""
CREATE INDEX "name_index" 
ON "people" ("name");
"""
)
conn.commit()

OperationalError: index person_index already exists

Now, we run EXPLAIN QUERY PLAN with the same nested query. We can observe that
all the scans are now searches using indexes

In [55]:
qp=pd.read_sql_query(
"""
EXPLAIN QUERY PLAN
SELECT "title" FROM "movies"
WHERE "id" IN (
    SELECT "movie_id" FROM "stars"
    WHERE "person_id" = (
        SELECT "id" FROM "people"
        WHERE "name" = 'Tom Hanks'
    )
);
""",
conn
)
for e in qp["detail"]:
    print(e)

SEARCH movies USING INTEGER PRIMARY KEY (rowid=?)
LIST SUBQUERY 2
SEARCH stars USING INDEX person_index (person_id=?)
SCALAR SUBQUERY 1
SEARCH people USING COVERING INDEX name_index (name=?)
CREATE BLOOM FILTER


The search on the table ``people`` uses something called a ``COVERING INDEX``

## Covering index

A covering index means that all the information needed for the query can be found within the index itself.

To have our search on the table ``stars`` also use a covering index, we can add ``"movie_id"`` to the index we created for ``stars``. 

This will ensure that the information being looked up (movie ID) and the value being searched on (person ID) are both be in the index

## Space and Time Trade-off

Space Trade-off:
Indexes occupy additional space in the database, so while we gain query speed, we do lose space because they are stored as a data structure called a B-Tree or balanced tree.

Time Trade-off:
Indexes speed up reads (`SELECT`) but slow down writes (`INSERT`, `UPDATE`, `DELETE`)
because the index must also be updated.


## CREATE INDEX + WHERE - Partial Index

This is an index that includes only a subset of rows from a table, allowing us to save some space and it is especially useful when users frequently query only a subset of rows from the table.

For example, let’s create a partial index for movies released in 2023.

In [58]:
conn.execute(
"""
CREATE INDEX "recents" ON "movies" ("titles")
WHERE "year" = 2023;
"""
)
conn.commit()

OperationalError: index recents already exists


We can check that searching for movies released in 2023 uses the new index.

In [59]:
qp=pd.read_sql_query(
"""
EXPLAIN QUERY PLAN
SELECT "title" 
FROM "movies"
WHERE "year" = 2023;
""",
conn
)
for e in qp["detail"]:
    print(e)

SCAN movies USING INDEX recents


## Vacuum