# Data Fragmentation (Partitioning)

- division of relation $r$ into fragments $r_1, r_2, ..., r_n$ that contain sufficient information to reconstruct relation $r$



- **horizontal fragmentation**: each tuple of $r$ is assigned to one or more fragments stoerd at different servers
    - advantages: allows parallel processing on fragments of a relation, which leads to scalability; allows a relation to be split so that tuples are located where they are most frequently accessed

<img src="img/Snip20191118_42.png" width=80%/>




- **vertical fragmentation**: the schema for relation $r$ is split into several smaller schemas
    - all schemas must contain a common candidate key (or superkey) to ensure the lossless join property
    - a special attribute, the tuple-id attribute, may be added to each schema to serve as a candidate key
    
<img src="img/Snip20191118_43.png" width=80%/>


- vertical and horizontal fragmentation can be mixed


# Practical Considerations

- suppose that a relation has $k$ attributes and $n$ tuples
    - horizontal fragmentation makes it possible to partition the table across up to $n$ fragments
    - vertical fragmentation makes it possible to partition the table across up to $k$ fragments
    - $n>>k$ in practice, thus horizontal partitioning enables greater scalability


- \[**Hash Partitioning**\] the simplest strategy for horizontal fragmentation is to determine the fragment by applying a hash function to a **partitioning key**, usually the primary key or a prefix of the primary key
    - tends to spread the storage load uniformly across fragments or partitions
    - does not always spread the computation load well since some tuples may be accessed much more frequenctly than others

- the hash function can be designed in such a way that when the $n$th server is added to the cluster, roughly $1/n$ the fragments will be re-hashed, which ccan be accomplished using **consistent hashing**
    




# Data Modeling for Horizontal Fragmentation

- Goals
    - spread data evenly (accomplished through hash-partitioning)
    - minimize the number of partitions/fragments read (accomplished by choosing an appropriate partitioning key)
- Non-goals
    - minimize number of writes (ordinarily accomplished by avoiding data duplication)
    - minimize data duplication (ordinarily accomplished through normalization)


# Choosing the Partitioning Key

- a *small* query can be answered more quickly if it refers to only one fragment of the database
    - e.g., `SELECT account_name FROM account WHERE branch_name='Hillside'`, the partitioning key that best serves this query is `(branch_name)` since the query may access multiple accounts in the same branch by referring to only one partition
    - `(branch_name)` as partitioning key also works well for `SELECT branch_name, count(*) FROM account GROUP BY branch_name`


# Adding Redundancy: Example 1

- $instructor(\underline{\textrm{ID}}, name, dept\_name, salary)$
- $department(\underline{dept\_name}, building, budget)$


- consider `SELECT building FROM instructor NATURAL JOIN department WHERE ID = 123`

- we want to avoid accessing multiple fragments/partitions
- $instructor$ must be partitioned on $\textrm{ID}$ to allow efficient lookup of $dept\_name$
- $department$ must be partitioned on $dept\_name$ to allow efficient lookup of building

## Option - Demormalized Schema with Duplicated Attribute


- $instructor(\underline{\textrm{ID}}, name, dept\_name, building, salary)$
- $department(\underline{dept\_name}, building, budget)$


- violates both 3NF and BCNF, and now there is update anomaly
- no joins needed for previous query: `SELECT building FROM instructor WHERE ID = 123`
- the query can be answered efficiently (i.e., by referring to one partition) if $instructor$ is partitioned by its primary key ID

## Option - Introducing Redundant Relation

- $id\_builidng(\underline{\textrm{ID}}, building)$
- satisfies both 3NF and BCNF
- no joins: `SELECT building FROM id_building WHERE ID = 123`
- the query can be answered efficiently if $id\_building$ is partitioned by its primary key ID


- **disadvantages**
    - both ID and building are now duplicated
    - the additional relation is only useful for answering one query (i.e., look up *building* given ID)
    



# Adding Redundancy: Example 2

- assume instructor-to-department mapping is many-to-many
    - $instructor(\underline{\textrm{ID}}, name, salary)$
    - $department(\underline{dept\_name}, building, budget)$
    - $inst\_dept(\underline{\textrm{ID}}, \underline{dept\_name})$


- consider `SELECT building FROM inst_dept NATURAL JOIN department WHERE ID = 123`

## Option - Denormalize Schema with Duplicated Attribute & Multi-valued Attribute

- $instructor(\underline{\textrm{ID}}, name, salary, \textrm{List}<building>)$


- speeds up building query (one fragment / one row accessed)
- **disadvantages**
    - multiple buildings stored in one instructor tuple (1NF violation)
    - multi-valued attributes not supported directly by most databases


- going further, the schema can be redunced to one relation: $instructor(\underline{\textrm{ID}}, name, salary, \textrm{List}<dept\_name, building, budget>)$
    - lookup of instructor by `dept_name` becomes hard

## Option - Denormalize Schema with Duplicated Attribute

- $instructor(\underline{\textrm{ID}}, name, salary)$
- $department(\underline{dept\_name}, building, budget)$
- $inst\_dept(\underline{\textrm{ID}}, \underline{dept\_name}, building)$

- the query can be answered directly by inst_dept, efficiently if inst_dept is partitioned by ID, and not the case if the partitioning key is the entire primary key because in that case the department name would have to be looked up first

