We'll continue to work with factbook.db, a SQLite database that contains information about each country in the world. This database contains just the facts table

Here are some of the columns:

* name -- the name of the country.
* area -- the total land and sea area of the country.
* population -- the population of the country.
* birth_rate -- the birth rate of the country.
* created_at -- the date the record was created.
* updated_at -- the date the record was updated.

To start, let's write and run a query that involves filtering on more than 1 column and use the EXPLAIN QUERY PLAN statement to understand what SQLite is doing to return the results. Our intuition suggests that SQLite will have to perform a full table scan. It will have to check if each row in the table meets the WHERE constraints since there are no indexes in the table to take advantage of.

In [5]:
import sqlite3 as sql
import pandas as pd
conn = sql.connect("factbook.db")

def run_query(q):
    return pd.read_sql_query(q, conn)

In [11]:
# Return the query plan for a query that returns all rows where population is greater than 1000000 and 
# where population_growth is less than 0.05.
# We're interested in all of the columns in the rows.
# Assign the query plan to query_plan_one and use print function to display the query plan.

# q = """ Explain query plan Select * From facts
#        Where population > 1000000 and population_growth < 0.05;"""
# run_query(q)

query_plan_one = conn.execute("""explain query plan select * from facts
                              where population > 1000000 and population_growth < 0.05;""").fetchall()
print(query_plan_one)

[(2, 0, 0, 'SCAN TABLE facts')]


As expected, SQLite had to perform a full table scan to access the data we asked for. Let's add indexes for both the population and population_growth columns to see how SQLite uses these indexes for returning the same query.

In [15]:
# Create an index called pop_idx for the population column in the facts table.
# Create an index called pop_growth_idx for the population_growth column in the facts table.

conn.execute("""Create index if Not exists pop_idx on facts('population');""")
conn.execute("""Create index if Not exists pop_growth_idx on facts('population_growth')""")
query_plan_two = conn.execute("""Explain query plan Select * from facts
                Where population > 1000000 and population_growth < 0.05;""").fetchall()
print(query_plan_two)

[(3, 0, 0, 'SEARCH TABLE facts USING INDEX pop_growth_idx (population_growth<?)')]


SQLite returns only a high-level query plan when we use the EXPLAIN QUERY PLAN statement in front of a query. This means that we'll often have to augment the returned query plan with our own understanding of the available indexes. In this case, the facts table has 2 indexes:

* one ordered by population called pop_idx,
* one ordered by population_growth, called pop_growth_idx.

SQLite struggles to take advantage of both indexes since each index is optimized for lookups on just that column. SQLite can use the indexes to quickly find the row id values where either population is greater than 1000000 or where population_growth is less than 0.05. If SQLite uses the index of population values to return all of the row id values where population is less than 1000000, it can't use those id values to search the pop_growth_idx index quickly to find the rows where population_growth is less than 0.05.

If we look at the query plan, we can infer that SQLite first decided to use the pop_growth_idx index to return the id values for the rows where population_growth was less than 0.05. Then, SQLite used a binary search on the facts table to access the row at each id value, add that row to a temporary collection if the value for population was greater than 1000000, and return the collection of rows.

We may be wondering why SQLite chose the pop_growth_idx instead of the pop_idx. This is because when there are 2 possible indexes available, SQLite tries to estimate which index will result in better performance. Unfortunately, to keep SQLite lightweight, limited ability was added to estimate and plan accurately and SQLite often ends up picking an index at random.

In cases like this, we need to create a multi-column index that contains values from both of the columns we're filtering on. This way, both criteria in the WHERE statement can be evaluted in the index itself and the facts table will only be queried at the end when we have the specific row id values.

While the single column indexes we've created in the past contain just the primary key column (population) and the row id (id) columns, this multi-column index contains the population_growth column as well. SQLite can:

* use binary search to find the first row in this index where population is greater than 1000000,
* add the row to a temporary collection if population_growth is less than 0.05,
* advance to the next row (the index is ordered by population),
* add the row to a temporary collection if population_growth is less than 0.05,
* when the end of the index is reached, look up each row in facts using the id values from the temporary collection.

This way the facts table is only accessed at the end and the index is used to process the WHERE criteria.

When creating a multi-column index, we need to specify which of the columns we want as the primary key. This means that SQLite can use binary search to quickly jump to the first row that matches a specific population value but not before jumping to the first row that matches a specific population_growth value.

To create a multi-column index, we use the same CREATE INDEX syntax as before but instead specify 2 columns in the ON statement:

CREATE INDEX index_name ON table_name(column_name_1, column_name_2);

The important thing to know here is that the first column in the parentheses becomes the primary key for the index.

In [22]:
# Create a multi-column index for population and population_growth named pop_pop_growth_idx with population as the primary key.
# Return the query plan for a query that returns all rows where population is greater than 1000000 and where population_growth is less than 0.05. We're interested in all of the columns in the rows.
# Assign the returned query plan to query_plan_three and use the print function to display it.

conn.execute("CREATE index if not exists pop_pop_growth_idx on facts(population, population_growth);")
query_plan_three = conn.execute("explain query plan select * from facts where population>1000000 and population_growth < 0.05;").fetchall()
print(query_plan_three)

[(3, 0, 0, 'SEARCH TABLE facts USING INDEX pop_pop_growth_idx (population>?)')]


This time, SQLite used the multi-column index pop_pop_growth_idx that we created instead of either pop_growth_idx or pop_idx. SQLite only needed to access the facts table to return the rest of the column values for the rows that met the WHERE criteria. This is only because the pop_pop_growth_idx doesn't contain the other values (besides population and population_growth already).

What if we restricted the columns in the SELECT that we want returned to just population and population_growth? In this case, SQLite will not need to interact with the facts table since the pop_pop_growth_idx can service the query. When an index contains all of the information necessary to answer a query, it's called a **covering index**. Since the index covers for the actual table and can return the requested results to the query, SQLite doesn't need to query the actual table. This can be much more efficient.

In [20]:
# Return the query plan for a query that returns all rows where
# population is greater than 1000000 and 
# where population_growth is less than 0.05. 
# Select only the population and population_growth columns.

query_plan_four = conn.execute("""Explain query Plan Select  population, population_growth From Facts
                                  Where population >1000000 and population_growth<0.05;""").fetchall()
print(query_plan_four)

[(2, 0, 0, 'SEARCH TABLE Facts USING COVERING INDEX pop_pop_growth_idx (population>?)')]


There's two things that stand out from the query plan from the previous screen:

* instead of USING INDEX the query plan says USING COVERING INDEX,
* the query plan still contains SEARCH TABLE facts as before.

Even though the query plan indicates that a binary search on facts was performed, this is misleading and it was instead able to use the covering index.

Covering indexes don't apply just to multi-column indexes. If a query we write only touches a column in the database that we have a single-column index for, SQLite will use only the index to service the query.

In [21]:
# Return the query plan for a query that returns all rows 
# where population is greater than 1000000. 
# We're only interested in the population column.

query_plan_five = conn.execute("""Explain query Plan Select  population From Facts
                                  Where population >1000000;""").fetchall()
print(query_plan_five)

[(2, 0, 0, 'SEARCH TABLE Facts USING COVERING INDEX pop_idx (population>?)')]
