## Partitioning
Refers to dividing data into multiple pieces which can be stored, accessed and managed separately. It offers the following benefits:
- **Scalability:** as the database size grows, it is not possible to store all of it on one node. We can partition data and store partitions on separate nodes.
- **Performance:** as we spread data across multiple nodes, we reduce read contention. Requests to database are now handled by multiple machines.
- **Availability:** even if one node goes down, there are other nodes to server data. Partitioning is often combined with data replication.

Partitioning does not require partitions to be stored on different nodes. Infact creating partitions using `PARTITION BY` clause creates partitions that reside on the same node. When partitions are spread across multiple nodes this is termed as **sharding**.

Even when partitioning involves only one node, it can be beneficial:
- improved querying performance by skipping irrelevant partitions
- query can be run parallaly across multiple partitions

### Horizontal vs Vertical Partitioning

## Partition Method
When we partition data, one the main considerations is to ensure that data is distributed equally among the partitions. One way to do this is to randomly store data in different nodes. But this makes reading data slow - we don't know which partition holds the data we are looking for; so we need to search in all partitions.

### Partition by Range
The table is partitioned into “ranges” defined by a key, with no overlap between the ranges of values assigned to different partitions. It is upto the database administrator to define the ranges in a way that partitions are equally filled.

In Postgres, do define partition by range:

In [None]:
%%sql
-- # Table below is the parent table and it doesn't store the data itself
CREATE TABLE employee_logins (
    id SERIAL PRIMARY KEY,
    employee_id INT NOT NULL,
    login_date DATE NOT NULL,
    device TEXT
) PARTITION BY RANGE (login_date);

-- # Define partitions:
-- # 2024
CREATE TABLE employee_logins_2024
    PARTITION OF employee_logins
    FOR VALUES FROM ('2024-01-01') TO ('2024-12-31');
-- # 2025
CREATE TABLE employee_logins_2025
    PARTITION OF employee_logins
    FOR VALUES FROM ('2025-01-01') TO ('2025-12-31');

-- # Any other date
CREATE TABLE employee_logins_default
    PARTITION OF employee_logins DEFAULT;

-- # Select query using the parent table
SELECT * FROM employee_logins WHERE login_date = '2024-05-10';

-- # Insert using the parent table
INSERT INTO employee_logins (employee_id, login_date, device)
VALUES (101, '2024-01-15', 'Laptop');

MySQL does things a little differently:

In [None]:
%%sql
CREATE TABLE employee_logins (
    id INT NOT NULL AUTO_INCREMENT,
    employee_id INT NOT NULL,
    login_date DATE NOT NULL,
    device VARCHAR(100),
    PRIMARY KEY (id, login_date)  -- # must include partition column (login_date)
)

PARTITION BY RANGE (YEAR(login_date)) (
    PARTITION employee_logins_2024 VALUES LESS THAN (2025),
    PARTITION employee_logins_2025 VALUES LESS THAN (2026),
    PARTITION employee_logins_default VALUES LESS THAN MAXVALUE
);

Ranged queries (based on partition column) on this kind of partitioning works well - database can figure out easily which partitions have the requested data.

### Partition by Hash
To prevent skew and hotspot - we can instead partition by hash which works by hashing the key to figure out the target partition. Postgres and MySQL for example use the modulo function over the hash:

In [None]:
%%sql
CREATE TABLE employee_logins (
    id SERIAL PRIMARY KEY,
    employee_id INT NOT NULL,
    login_date DATE NOT NULL,
    device TEXT
) PARTITION BY HASH (employee_id);

-- # Create 4 hash partitions
CREATE TABLE employee_logins_p0 PARTITION OF employee_logins
    FOR VALUES WITH (MODULUS 4, REMAINDER 0);

CREATE TABLE employee_logins_p1 PARTITION OF employee_logins
    FOR VALUES WITH (MODULUS 4, REMAINDER 1);

CREATE TABLE employee_logins_p2 PARTITION OF employee_logins
    FOR VALUES WITH (MODULUS 4, REMAINDER 2);

CREATE TABLE employee_logins_p3 PARTITION OF employee_logins
    FOR VALUES WITH (MODULUS 4, REMAINDER 3);

MySQL:

In [None]:
%%sql
CREATE TABLE employee_logins (
    id INT NOT NULL AUTO_INCREMENT,
    employee_id INT NOT NULL,
    login_date DATE NOT NULL,
    device VARCHAR(100),
    PRIMARY KEY (id, employee_id)   -- # must include partition column (employee_id)
)

PARTITION BY HASH(employee_id)
PARTITIONS 4;

Distributed databases however do not use `hash % N` logic since it requires significant data rebalancing in case new nodes are added or nodes are removed. Distributed databses divide the hash space into contiguous ranges, which can be moved around dynamically for rebalancing.*Consistent Hashing* is one approach which is typically used by *key-value* stores.

Ranged queries (`BETWEEN > <`) don't perform the best since keys that were once adjacent are now scattered across all the partitions.

## Partitioning and Indexes
How do indexes defined in conjunction with partitions? There are two approaches:
- **Local Index:** each partition maintains its own index covering only rows that are present in the partition. Lets consider table:
  ![Local Index](./images/local_index.png)  
  If the age column is indexed, it is not guaranteed that people having same age would be present in the same partition. So if the query is to search by age, then the database will have to look inside all partitions.
- **Term-Partitioned Global Index:** we create a global index and partition it differently than the primary key. Reads using this method is fast and easy. Writing is slower because if a row is added to partition A, its index may be in partition B.
  ![Local Index](./images/global_index.png)  

## Rebalancing Partitions