# Multi-column indexing

In the last mission, we explored how to **speed up `SELECT` queries that only filter on one column by creating an index for that column**. 

### In this mission, we'll explore how to create indexes for speeding up queries that filter on multiple columns.

We'll continue to work with `factbook.db`, a SQLite database that contains information about each country in the world. Recall that this database contains just the `facts` table and each row represents a single country. While we created indexes for the facts table in this database in the previous mission, this version of `factbook.db` contains no indexes. <br>

Here are some of the columns:

* `name` -- the name of the country.
* `area` -- the total land and sea area of the country.
* `population` -- the population of the country.
* `birth_rate` -- the birth rate of the country.
* `created_at` -- the date the record was created.
* `updated_at` -- the date the record was updated.

and the first few rows of facts:


In [1]:
import pandas as pd
import sqlite3

In [2]:
conn = sqlite3.connect('data/factbook.db')

In [3]:
pd.read_sql('select * from facts limit 5', conn)

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,created_at,updated_at,leader
0,1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51,2015-11-01 13:19:49.461734,2015-11-01 13:19:49.461734,
1,2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3,2015-11-01 13:19:54.431082,2015-11-01 13:19:54.431082,
2,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92,2015-11-01 13:19:59.961286,2015-11-01 13:19:59.961286,
3,4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0,2015-11-01 13:20:03.659945,2015-11-01 13:20:03.659945,
4,5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46,2015-11-01 13:20:08.625072,2015-11-01 13:20:08.625072,


We limited ourselves to working with queries that only filtered on one column like:

`SELECT * FROM facts WHERE name = 'India';`

In this mission, we'll explore how to create indexes for speeding up queries that filter on multiple columns, like:

`SELECT * FROM facts WHERE population > 1000000 AND population_growth < 2.0;`

We'll also explore how to modify the queries we write to better take advantage of indexes. For example, if we create an index for the name column, we'll explore why the following query:

`SELECT name from facts WHERE name = 'India'`;`

will be faster than:

`SELECT * from facts WHERE name = 'India'`;`

To start, let's write and run a query that involves filtering on more than 1 column and use the `EXPLAIN QUERY PLAN` statement to understand what SQLite is doing to return the results. Our intuition suggests that SQLite will have to perform a full table scan. It will have to check if each row in the table meets the WHERE constraints since there are no indexes in the table to take advantage of. <br>

We've already imported the `sqlite3` library and initialized a connection to `factbook.db` in the coding cell.

In [None]:
# practice

In [4]:
plan_one = '''
    explain query plan select * from facts
                        where population > 1000000 
                                and
                                population_growth < 0.05;
'''
query_plan_one = conn.execute(plan_one).fetchall()

print(query_plan_one)

[(0, 0, 0, 'SEARCH TABLE facts USING INDEX pop_idx (population>?)')]


### As expected, SQLite had to perform a full table scan to access the data we asked for. 
Let's add indexes for both the population and population_growth columns to see how SQLite uses these indexes for returning the same query.

In [5]:
# practice

In [6]:
create_pop_idx = 'create index if not exists pop_idx on facts(population);'
conn.execute(create_pop_idx)

<sqlite3.Cursor at 0x113ca6030>

In [7]:
create_pop_growth_idx = 'create index if not exists pop_growth_idx on facts(population_growth);'
conn.execute(create_pop_growth_idx)

<sqlite3.Cursor at 0x113ca60a0>

In [9]:
plan_two = '''
    explain query plan select * from facts where population > 1000000
                                            and population_growth < 0.05;
'''

query_plan_two = conn.execute(plan_two).fetchall()

print(query_plan_two)

[(0, 0, 0, 'SEARCH TABLE facts USING INDEX pop_growth_idx (population_growth<?)')]


## Explanation

If you recall, SQLite returns only a high-level query plan when you use the `EXPLAIN QUERY PLAN` statement in front of a query. This means that you'll often have to augment the returned query plan with your own understanding of the available indexes. In this case, the `facts` table has 2 indexes:

* one ordered by `population` called `pop_idx`,
* one ordered by `population_growth`, called `pop_growth_idx`.

### SQLite struggles to take advantage of both indexes since each index is optimized for lookups on just that column. 
SQLite can use the indexes to quickly find the row id values where either population is greater than 1000000 or where population_growth is less than 0.05. 
* If SQLite uses the index of population values to return all of the row id values where population is less than 1000000, it can't use those id values to search the pop_growth_idx index quickly to find the rows where population_growth is less than 0.05.

If you look at the query plan, you can infer that 
* SQLite first decided to use the pop_growth_idx index to return the id values for the rows where population_growth was less than 0.05. 
* Then, SQLite used a binary search on the facts table to access the row at each id value, add that row to a temporary collection if the value for population was greater than 1000000, and return the collection of rows.

You may be wondering why SQLite chose the pop_growth_idx instead of the pop_idx. This is because when there are 2 possible indexes available, **SQLite tries to estimate which index will result in better performance**. Unfortunately, to keep SQLite lightweight, limited ability was added to estimate and plan accurately and **SQLite often ends up picking an index at random**.

## Multi-column Index

In cases like this, we need to create a **multi-column** index that contains values from both of the columns we're filtering on. This way, 
* both criteria in the `WHERE` statement can be evaluted in the index itself 
* and the `facts` table will only be queried at the end when we have the specific row id values.

Here's how a multi-column index for `population` and `population_growth` would look like:

![](img/13.png)

While the single column indexes we've created in the past contain just the primary key column (`population`) and the row id (`id`) columns, this multi-column index contains the `population_growth` column as well. SQLite can:

* use binary search to find the first row in this index where `population` is greater than **1000000**,
* add the row to a temporary collection if `population_growth` is less than **0.05**,
* advance to the next row (*the index is ordered by population*) and check if it's greater than **1000000**,
  * add the row to a temporary collection if `population_growth `is less than **0.05**,
* when the end of the index is reached, look up each row in `facts` using the `id` values from the temporary collection.

This way the `facts` table is only accessed at the end and the index is used to process the `WHERE` criteria. <br>

When creating a **multi-column index**, we need to specify which of the columns we want as the primary key. In the example above, this means that SQLite can use binary search to quickly jump to the first row that matches a specific population value but not for jumping to the first row that matches a specific population_growth value.

## Creating a multi-column index

To create a multi-column index, we use the same `CREATE INDEX` syntax as before but instead specify 2 columns in the `ON` statement:

`CREATE INDEX index_name ON table_name(column_name_1, column_name_2);`

The important thing to know here is 
### that the first column in the parentheses becomes the primary key for the index. 
Let's create a multi-column index for the `population` and `population_growth` columns and return the query plan for the query we've been working with.

In [10]:
# primary key for multi-index = population

create_pop_pop_growth_idx = '''
    create index pop_pop_growth_idx on facts(population,
                                            population_growth);
'''
conn.execute(create_pop_pop_growth_idx)

<sqlite3.Cursor at 0x113ca61f0>

In [12]:
plan_three = '''
        explain query plan select * from facts
                            where population > 1000000
                            and
                            population_growth < 0.05;
'''

In [13]:
query_plan_three = conn.execute(plan_three).fetchall()

print(query_plan_three)

[(0, 0, 0, 'SEARCH TABLE facts USING INDEX pop_pop_growth_idx (population>?)')]


## Covering Index

This time, SQLite used the multi-column index `pop_pop_growth_idx` that we created instead of either `pop_growth_idx` or `pop_idx`. SQLite only needed to access the `facts` table to return the rest of the column values for the rows that met the `WHERE` criteria. This is only because the `pop_pop_growth_idx` doesn't contain the other values (besides population and population_growth already).

### What if we restricted the columns in the SELECT that we want returned to just population and population_growth? 
In this case, SQLite will not need to interact with the facts table since the `pop_pop_growth_idx` can service the query. **When an index contains all of the information necessary to answer a query, it's called a :**

### covering index.

Since the index covers for the actual table and can return the requested results to the query, SQLite doesn't need to query the actual table. For many queries, especially as your data gets larger, this can be much more efficient. <br>

Let's write a query that uses the index we created as a covering index and return it's query plan.

In [14]:
plan_four = '''
    explain query plan select population, population_growth
                        from facts
                        where population > 1000000
                        and
                        population_growth < 0.05;
'''

query_plan_four = conn.execute(plan_four).fetchall()

print(query_plan_four)

[(0, 0, 0, 'SEARCH TABLE facts USING COVERING INDEX pop_pop_growth_idx (population>?)')]


## Covering index for single column

There's two things that stand out from the query plan from the previous screen:

* instead of `USING INDEX` the query plan says `USING COVERING INDEX`,
* the query plan still contains `SEARCH TABLE facts` as before.

Even though the query plan indicates that a binary search on `facts` was performed, this is misleading and it was instead able to use the covering index. You can read more about that [on the documentation](https://www.sqlite.org/queryplanner.html#covidx).<br>


Covering indexes don't apply just to multi-column indexes. If a query we write only touches a column in the database that we have a single-column index for, SQLite will use only the index to service the query. Let's test this by writing a query that can take advantage of just the index, `pop_idx`, for the `population` column.

In [None]:
# practice

In [15]:
plan_five = '''
    explain query plan select population from facts
                        where population > 1000000
                        and
                        population_growth < 0.05;
'''

query_plan_five = conn.execute(plan_five).fetchall()

print(query_plan_five)

[(0, 0, 0, 'SEARCH TABLE facts USING COVERING INDEX pop_pop_growth_idx (population>?)')]


Since only the `population` values were necessary to service the query, SQLite used the `pop_idx` index as a covering index and **didn't have to access the `facts` table**. <br>

In this mission, we explored how to create multi-column indexes and how to restrict our query to utilize an index if we don't always need information on column values only available in the table.