# SQL and Databases: Advanced
## Introduction To Indexing

In this mission, we'll explore how queries are executed in SQLite. After exploring this at a high level, 
### we explore how to create and use indexes for better performance. 
As our data gets larger and our queries more complex, it's important to be able to tweak the queries we write and **optimize a database's schema to ensure that we're getting results back quickly**. <br>

To explore database performance, we'll work with `factbook.db`, a SQLite database that contains information about each country in the world. We'll be working with the `facts` table in the database. Each row in facts represents a single country, and contains several columns, including:

* name -- the name of the country.
* area -- the total land and sea area of the country.
* population -- the population of the country.
* birth_rate -- the birth rate of the country.
* created_at -- the date the record was created.
* updated_at -- the date the record was updated.

Here are the first few rows of `facts`:

In [2]:
import sqlite3
import pandas as pd

In [10]:
conn = sqlite3.connect("data/factbook.db")

In [11]:
pd.read_sql('select * from facts limit 5', conn)

Unnamed: 0,id,code,name,area,area_land,area_water,population,population_growth,birth_rate,death_rate,migration_rate,created_at,updated_at,leader
0,1,af,Afghanistan,652230,652230,0,32564342,2.32,38.57,13.89,1.51,2015-11-01 13:19:49.461734,2015-11-01 13:19:49.461734,
1,2,al,Albania,28748,27398,1350,3029278,0.3,12.92,6.58,3.3,2015-11-01 13:19:54.431082,2015-11-01 13:19:54.431082,
2,3,ag,Algeria,2381741,2381741,0,39542166,1.84,23.67,4.31,0.92,2015-11-01 13:19:59.961286,2015-11-01 13:19:59.961286,
3,4,an,Andorra,468,468,0,85580,0.12,8.13,6.96,0.0,2015-11-01 13:20:03.659945,2015-11-01 13:20:03.659945,
4,5,ao,Angola,1246700,1246700,0,19625353,2.78,38.78,11.49,0.46,2015-11-01 13:20:08.625072,2015-11-01 13:20:08.625072,


In [12]:
schema = conn.execute('pragma table_info(facts)').fetchall()

In [15]:
for sc in schema:
    print(sc)

(0, 'id', 'INTEGER', 1, None, 1)
(1, 'code', 'varchar(255)', 1, None, 0)
(2, 'name', 'varchar(255)', 1, None, 0)
(3, 'area', 'integer', 0, None, 0)
(4, 'area_land', 'integer', 0, None, 0)
(5, 'area_water', 'integer', 0, None, 0)
(6, 'population', 'integer', 0, None, 0)
(7, 'population_growth', 'float', 0, None, 0)
(8, 'birth_rate', 'float', 0, None, 0)
(9, 'death_rate', 'float', 0, None, 0)
(10, 'migration_rate', 'float', 0, None, 0)
(11, 'created_at', 'datetime', 0, None, 0)
(12, 'updated_at', 'datetime', 0, None, 0)
(13, 'leader', 'text', 0, None, 0)


## Query planner

When you execute a SQL query, SQLite performs many steps before returning the results to you. First, it tokenizes and parses your query to look for any syntax errors. If there are any syntax errors, the query execution process halts and the error message is returned to you. If the parser was able to successfully parse the query, then SQLite moves on to the query planning and optimization phase. <br>

There are many different ways for SQLite to access the underlying data in a database. 
### When working with a database that's stored on disk as a file, it's crucial to minimize the amount of disk reads necessary to avoid long running times. 
The **query optimizer** generates cost estimates for the various ways to access the underlying data, factoring in the schema of the tables and the operations the query requires. 
### The heuristics and algorithms that are involved in query optimization is complex and out of this mission's scope. <br>

The optimizer quickly assesses the various ways to access the data and generates a best guess for the fastest **query plan**. This high level query plan is then converted into highly efficient, lower-level C code to interact with the database file on disk. Thankfully, we can observe the query plan to understand what SQLite is doing to return our results.

## Explain query plan

We can use the `EXPLAIN QUERY PLAN` statement before any query we're running to get a high level query plan that would be performed. If you write a `SELECT` statement and place the `EXPLAIN QUERY PLAN` statement before it:

`EXPLAIN QUERY PLAN SELECT * FROM facts;`

the results of the `SELECT` query **won't be returned and instead the high level query plan will be**:

`[(0, 0, 0, 'SCAN TABLE facts')]`

We'll focus on the value at index `4` in the returned tuple in this mission. `SCAN TABLE` means that every row in entire table (`facts`) had to be accessed to evaluate the query. Since the `SELECT` query we wrote returns all of the columns and rows in the `facts` table, the entire table had to be accessed to get the results we requested. <br>

When running the query using the sqlite3 library, you'll still need to use the `fetchall()` method or the query plan won't be returned:

```python
query_plan = conn.execute("EXPLAIN QUERY PLAN SELECT * FROM facts;").fetchall()
```
The query plan is represented as a tuple, which is the sqlite3 library's preferred way of representing results.

### practice
* Return the query plan for the query that returns all columns and rows where area exceeds 40000. Assign the results to query_plan_one.
* Return the query plan for the query that returns only the area column for all rows where area exceeds 40000. Assign the results to query_plan_two.
* Return the query plan for the query that returns the row for the country Czech Republic. Assign the results to query_plan_three.
* Use the print function to display each query plan.

In [17]:
plan_one = '''
    explain query plan select * from facts where area > 40000;
'''

query_plan_one = conn.execute(plan_one).fetchall()

In [16]:
plan_two = '''
    explain query plan select area from facts where area > 40000;
'''

query_plan_two = conn.execute(plan_two).fetchall()

In [20]:
plan_three = '''
    explain query plan select * from facts where name = "Czech Republic";
'''

query_plan_three = conn.execute(plan_three).fetchall()

In [21]:
print(query_plan_one)
print(query_plan_two)
print(query_plan_three)

[(0, 0, 0, 'SCAN TABLE facts')]
[(0, 0, 0, 'SCAN TABLE facts')]
[(0, 0, 0, 'SCAN TABLE facts')]


## Data representation

**You'll notice that all 3 query plans are exactly the same**. The entire `facts` table had to be accessed to return the data we needed for all 3 queries. 
### Even though all the queries asked for a subset of the facts table, SQLite still ends up scanning the entire table. Why is this? 

This is because of the way SQLite represents data. <br>

For the `facts` table, we set the `id` column as the primary key and SQLite uses this column to order the records in the database file. Since the rows are ordered by `id`, SQLite can search for a specific row based on it's `id` value using binary search. Unless we provide specific `id` values in the `WHERE` statement in the query, SQLite can't take advantage of binary search and has to instead scan the entire table, row by row. To return the results for the first 2 queries, SQLite has to:

* access the first row in the table (lowest id value),
  * check if that row's value for area exceeds 40000 and store the row separately in a temporary collection if it is,
* move onto the next row,
  * check if that row's value for area exceeds 40000 and store the row separately in a temporary collection if it is,
* repeat moving and checking each row for the rest of the table,
* return the final collection of rows that meet the criteria.

Here's a diagram of what that looks like:

![](img/9.png)
![](img/10.png)

If we were instead interested in a row with a specific `id` value, like in the following query:

`SELECT * FROM facts WHERE id=15;`

### SQLite can use binary search to quickly find the corresponding row at that id value. 
Instead of performing a full table scan, SQLite would:

* use **binary search** to find the first row where the `id` value matches `15` in `O(log N) time complexity` and store this row in a temporary collection,
* advance to the next row to look for any more rows with the same `id` values and add those rows to the temporary collection,
* return the final collection of rows that matched.

If we set the `id` column to be a `UNIQUE PRIMARY KEY` when we created the schema, SQLite would stop searching when it found the instances that matched the `id` value. It would avoid advancing to the next row(s) since no 2 rows could have the same `id` value. While we didn't enforce the `UNIQUE` constraint on the id column, all of the values currently in the column are in fact unique and SQLite will only have to advance one row to realize this since they're ordered. <br>

### If you need a refresher on algorithmic complexity head to our mission on [Algorithms](https://www.dataquest.io/mission/93/algorithms/). If you want to dive into binary search, check out our mission on [Binary Search](https://www.dataquest.io/mission/94/binary-search/).

In [22]:
# practice

plan_four = '''
    explain query plan select * from facts where id=20;
'''

query_plan_four = conn.execute(plan_four).fetchall()

print(query_plan_four)

[(0, 0, 0, 'SEARCH TABLE facts USING INTEGER PRIMARY KEY (rowid=?)')]


Instead of using **a full table scan**:

`[(0, 0, 0, 'SCAN TABLE facts')]`

SQLite performed **binary search on the facts table using the integer primary key**:

`[(0, 0, 0, 'SEARCH TABLE facts USING INTEGER PRIMARY KEY (rowid=?)')]`

SQLite uses rowid to refer to the primary key of a table. The alias rowid will be displayed in the query plan, no matter what you name the primary key column for that table. Either SCAN or SEARCH will always appear at the start of the query explanation for SELECT queries.

## Indexing

SQLite can take advantage of speedy lookups when searching for a specific primary key. Unfortunately, we don't always have the primary keys for the rows we're interested in beforehand. When we're expressing our intent as a SQL query, we're often thinking in terms of row and column values. We need to find a way that allows us to benefit from the speed of primary key lookups without actually knowing the primary key in advance. <br>

To that end, we could create a separate table that's optimized for lookups by a different column from the facts table instead of by the id. We can make the column we want to query by the primary key, so we get the speed benefits, and embed the id value from the facts table corresponding to that row. We call this table an index and each row in the index contains:

* the value we want to be able to search by, as the **primary key**,
* an `id` value for the corresponding row in `facts`.

Let's walk through a concrete example. If we wrote a `SELECT` query to look up the population of India from the facts table:

`SELECT population FROM facts WHERE name = 'India';`

SQLite would need to **perform a full table scan on facts** to find the specific row where the value for name was India. We can instead create **an index that's ordered by name values (primary key)** and where each row contains the corresponding row's id from the facts table. Here's what that index would look like:

![](img/11.png)

We can write a query that uses the primary key, the country name, of the index table, which we'll call `name_idx`, to look up the row we're interested in and then extract the `id` value for that row in `facts`. Then, we can write a separate query that uses the id value returned from the previous query to look up the specific row in the `facts` table that contains information on India and then return just the population value. <br>

Instead of performing a single full table scan of `facts`, SQLite would perform a binary search on the index then another binary search on facts using the `id` value. Both queries are taking advantage of the primary key for the index and the facts table to quickly return the results we want. Here's a diagram of these concepts:

![](img/12.png)

## Create an Index

Instead of creating a separate table and updating it ourselves, **we can specify a column we want an index table for and SQLite will take care of the rest**. SQLite, and most databases, make it easy for you to create indexes for tables on columns we plan to query often. To create an index we use the [`CREATE INDEX` statement](https://www.sqlite.org/lang_createindex.html). Here's the psuedo-code for that statement:

`CREATE INDEX index_name ON table_name(column_name);`

As you can see from the psuedo-code above, each index we create needs a name (to replace `index_name`). Similar to when you add a table to a database, using the `IF NOT EXISTS` clause helps you avoid attempting to create an index that already exists. Doing so will cause SQLite to throw an error. To create an index for the `area` column called `area_idx`, we write the following query:

`CREATE INDEX IF NOT EXISTS area_idx ON facts(area);`



An empty array will be returned when you run the query. The main benefit of having SQLite handle the maintenance of indexes we create is that the indexes are used automatically when we execute a query whenever there will be any speed advantages. As our queries become more complex, letting SQLite decide how and when to use the indexes we create helps us be much more productive. <br>

If we create an index for the `area` column in the `facts` table, **SQLite will use the index whenever we search for rows in `facts` using that column**. This index would be similar to the one we worked with in the past step and each row would only contain the area value and the corresponding row's id value. The index would be ordered by the area values for quick lookups.

* All three of the following queries would take advantage of the `area_idx` index:

```sql
SELECT * FROM facts WHERE area = 10000;
SELECT * FROM facts WHERE area > 10000;
SELECT * FROM facts WHERE area < 10000;
```

Since the `area_idx` index would be ordered by the `area` values, SQLite would:

* search for the first instance in the index where area equaled 10000 and store the id value in a temporary collection.
* it would then advance to the next row in the index to check if the WHERE condition was still met.
  * if not, then the temporary collection would be returned and the process completes.
  * if so, then SQLite would add that id value to the collection and check the next row.
* when SQLite finds a value for area that doesn't match the WHERE condition,
  * it will look up and return the rows in facts using the id values stored in the temporary collection.
  * each of these lookups will be O(log N) time complexity and while this could add up, it will still be faster than a full table scan.

This process allows us to just write one query instead of 2 and have SQLite maintain and interact with the index. A table can have many indexes, and most tables in production environments usually do have many indexes. Every time you add or delete a row to the table, all of the indexes will be updated. If you edit the values in a row, SQLite will figure out which indexes are affected by the changes and update those indexes. <br>

While creating indexes gives us tremendous speed benefits, they come at the cost of space. Each index needs to be stored in the database file. In addition, adding, editing, and deleting rows takes longer since each of the affected indexes need to be updated. Since indexes can be created after a table is created, it's recommended to only create an index when you find yourself querying on a specific column frequently. Throughout the rest of the course, we'll explore how to understand the tradeoffs and you'll develop a better sense of how to create indexes in an optimal way.

In [23]:
# practice

plan_six = '''
    explain query plan select * from facts where population > 10000;
'''

query_plan_six = conn.execute(plan_six).fetchall()
print(query_plan_six)

[(0, 0, 0, 'SCAN TABLE facts')]


In [24]:
create_index_pop = '''
    create index if not exists pop_idx on facts(population)
'''
conn.execute(create_index_pop)

<sqlite3.Cursor at 0x114130dc0>

In [25]:
plan_seven = plan_six
query_plan_seven = conn.execute(plan_seven).fetchall()

print(query_plan_seven)

[(0, 0, 0, 'SEARCH TABLE facts USING INDEX pop_idx (population>?)')]


Instead of ending in `USING INDEX pop_idx (population)`, the query plan ended in `USING INDEX pop_idx (population>?)`. This is to indicate the granularity of the lookup that SQLite had to do for that index.

* In this mission, we explored how SQLite accessed data and how to create and take advantages of indexes. In the next mission, we'll learn how to create more complex indexes and dive deeper into database performance and learn about multi-column indices.

In [26]:
conn.close()