## BTree
Is a tree like data structure often used by database index. BTree improves upon data locality characteristics of data structure like BST and is more suited to be stored on disk (in a typical implementation, the size of each node is matched with disk block size). A BTree of order $m$ has the following properties:
- each node has at most $m$ children
- each internal node has at least $\frac{m}{2}$ children
- each non-leaf node with $k$ children has $k-1$ elements
- the root node has at least 2 children, unless it is the sole node
- all leaves are at the same level, making it *balanced*
- within each node, elements are ordered
- any child to the left of a node element must contain elements less than it. Children to the right must have elements greater than it.

A sample BTree as elements get added to it:  
![BTree Basics](./images/btree_basics.png)

### Operations
**Searching:**
The arrangement of elements in this manner makes searching quite easy:
- check node if it contains the element we are looking for
- if the element is not in the node, then find the potential child node where it could be
- find the element in that child and repeat

**Inserting:**
Elements in a BTree are inserted in lead nodes. For any new element, find the correct leaf node and add element in ascending order. If the leaf node is full:  
- the node needs to be split in two. Pick the median element and move that to the parent node. Rest of the elements are moved into two separate nodes  
  ![BTree Insert 1](./images/btree_add_1.png)
- What if the parent node is full? In that case the parent node gets split:  
  ![BTree Insert 2](./images/btree_add_2.png)

### Application
In a database, each element of a node is a key/value pair:  
![BTree Key Value Pair](./images/btree_kv.png)

Database like MySQL use a fancier version of BTree called B+Tree which:
- stores only keys (and child pointers) in non-leaf node
- stores keys and values in the leaf node
- nodes in each level are connected forming a linked list

![B+Tree](./images/bplustree.png)

MySQL stoes data using a **clustered index** and stores table rows directly in B+Trees. A new B+Tree is created for every table and all rows are stored in it. The primary key forms the key and the rest of the columns form the value. When an index is created for some other column, another B+Tree is created. In this B+Tree, the key is the newly indexed column and value is the primary key. For search queries involving new index column, first the B+Tree associated with index is searched to get list of matching primary keys and then the table B+Tree is searched.

Postgres on the other hand stores data as **heap files** and doesn't store data itself in a B+Tree. Data is stored as unordered collections of rows stored on disk pages — basically, a flat file with no inherent order. It builds a B+Tree that maps keys to location of rows in the heap.

[More details](https://planetscale.com/blog/btrees-and-database-indexes)

## Database Index
One way to search in a database is to load one row after other and check for rows with matching condition - *full table scan*. However, this works only for small tables. Another way is to use index.

A database index speeds up query execution by storing data in specialized data structure like BTree and B+Tree (there are other data structures used as well). Whenever rows are added, updated or removed, the corresponding index has to be touched as well. This means:
- more indexes mean slower writes
- more indexes consume more space

Index is not guaranteed to speed up queries. Query that uses non-indexed column see no benefit. 

**Fast Lookup:** As an example, consider the query below where birth_year column is indexed:
```sql
WHERE birth_year = 2002;
```
![Fast Lookup](./images/query1.png)

Indexes also help in ranged queries:
```sql
WHERE birth_year > 2005 ORDER BY birth_year LIMIT 3;
```
![Scan in one direction](./images/query2.png)

**Multi-column index:** involves creating index out of multiple columns. In this case key is a tuple of indexed columns in the specified order. So consider index:
```sql
CREATE INDEX idx_name_birth_year ON members(name, birth_year);
```
on table:
```
id	name	birth_year
1	Alice	1990
2	Bob	    1985
3	Alice	1988
4	Carol	1990
5	Bob	    1992
```

A query like the one below is found as:
```sql
WHERE name = 'Bob' AND birth_year = 1992;
```
![Multi Column Index](./images/query3.png)

If the query however was using only a subset of columns of the index like:
```sql
WHERE birth_year = 1990;
```
the multi-column index created above wouldn't work. A multi-column index starting with birth_year though would work. Therefore order of columns matters.  

Consider another example where index was created on three columns:
```sql
CREATE INDEX idx_name_birth_year_and_place ON members(name, birth_year, birth_place);
```
If the query skipped birth_year, the index created above would still help, though not in the same magnitude:
```sql
WHERE name = 'Bob' AND birth_place = 'Detroit';
```
All the rows matching `name = 'Bob'` would be filtered out (fast lookup) and then each row would be tested against `birth_place = 'Detroit'`. Note that if we have index on `(name, birth_year, birth_place)`, then the index `(name, birth_year)` would be redundant.

In most cases, the column on which we plan to use for range queries should come last in our multi-column index (after the equality-filtered columns). Consider the query:
```sql
WHERE country = 'India' AND employed = 'Y' AND age > 28;
```
If the index was created on `(country, age, employed)` the search scheme would look like:  
![Ranged Query 1](./images/query5.png)  
However, if the index was created on `(country, employed, age)` the search scheme would look like:  
![Ranged Query 1](./images/query4.png)