# Properties of Storage Media

- **speed** with which data can be accessed
     - latency per I/O operation (IOP) vs throughput
     - reads vs writes, sequential vs random access, empty vs full
 - **cost** (e.g., dollar per GB, Joules per bit stored or accessed)
 - **capacity** (raw vs usable)
 - **density** (e.g., bits per square inch)
 - **volatility**
     - **volatile media**: loses content when power is switched off
         - e.g., CPU registers, cache memory, main memory
     - **non-volatile media**: content persists when power is switched off
         - e.g., hard drives, solid-state drives, battery-backed memory
 - **reliability**
     - mean time between failures (MTBF)
     - number of write cycles until wear-out

# Physical Storage Media

## CPU registers and Cache
- **volatile**, SRAM-based, managed by processor
- capacities commonly ~10KB (L1), ~100KB (L2), ~10MB (L3)

## Main Memory
- **volatile**, SDRAM-based, managed by OS/application
- latency 1~10ns, capacities ~10GB up to ~1TB (increasingly large/inexpensive enough to store the entire DB)

## Flash Memory (e.g., NAND flash)
- **non-volatile**, used in SSDs and USB drives
- latency 10~100us, capacities ~100GB to ~1TB
- drawbacks
    - cells must be erased before they can be re-written
    - cells wear out after 10K ~ 1M write/erase cycles
- **flash translation layer** remaps logical address to physical addresses to mask erase latency and worn out pages, also perform **wear leveling** and **sparing** to increase longevity
    - It is stored as an array containing physical page numbers, indexed by logical page numbers. This representation gives an overhead equal to the size of the page address for each page.

## Conventional (magnetic) Hard Disk
- **non-volatile**, mature, less expensive than flash
- most popular medium for **reliable long-term storage** of data
- latency between 3ms (enterprise) and 15ms (mobile)
    - seek time (<1ms) to move read/write heads
    - rotational latency (a few ms) to position platter under head
- capacities commonly in the hundreds of GB up to several TB
- **much slower for random I/O than for sequential I/O**
    - e.g., < 1 MB/s random vs 150MB/s sequential
- susceptible to mechanical vibrations and shocks

## Optical Disks
- **non-volatile**, least expensive, mostly read-only
- latency on the order of 100ms, capacities up to 100GB
- used to store DBMS binaries and data sets, record backups

<img src="img/Snip20191028_28.png" width=80%/>

<img src="img/Snip20191028_29.png" width=80%/>

# Solid State Drive Internals

<img src="img/Snip20191028_30.png" width=50%/>

- SSD provides sector read and write operations to file system, like a HD
- internally, an SSD maps **sectors** (logical) to **pages** (physical)
- the SSD can write data one page at a time, but it must **erase** an entire **block** of pages before overwriting data
- earsing a block is much slower than reading or writing a page
- overwriting the same flash cell repeatedly leads to **wear-out**
- the flash translation layer (FTL) remaps pages and blocks to enable efficient overwriting, as well as to perform leveling

# Conventional Block I/O Devices

<img src="img/Snip20191028_31.png" width=50%/>

- **block**: a contiguous sequence of sectors from a single track
     - unit of storage **allocation** and **data transfer**
     - sizes range from 512 bytes to several kilobytes
         - smaller blocks: more transfers from disk
         - larger blocks: more space wasted due to partially filled blocks
         - typical block sizes today range from 4 to 16 kilobytes
- *secondary storage devices in general are block devices*

# RAID

- **Redundant Arrays of Inexpensive/Independent Disks**
    - a disk organization technique that uses multiple physical disks to provide the illusion of a single more reliable and/or more performant disk
    - increases **capacity** and **speed** by using multiple disks in parallel
    - increases **reliability** through redundancy, ensuring survival of data if a small enough subset of disk fails

- the chance that some disk out of a set of $N$ disks will fail is much higher than the chance that a specific single disk will fail
    - e.g., a system with 100 disks, each with a MTBF of 100,000 hours (approx. 11 years), will have a system MTBF of 1000 hours (approx. 41 days)

# Improvement in Reliability via Redundancy

- **redundandcy**: store extra information that can be used to rebuild information lost in a disk failure
    - **mirroring**: duplicate every disk
        - one logical disk comprises two physical disks
        - every write is carried out on both disks but reads can be served using either disk
        - if one disk in a pair fails, data remains available in the other; data loss occurs only if both the disk and its mirror fail before the system is reparied
    - **parity bits**: store additional bits to compensate for corrupted ones 
        - additional bits used to detect and possibly correct errors
- **reliability**: a measure of how infrequently failures occur; measured in terms of **mean tim to data loss**, which depends on the mean time between failures (MTBF) and the mean time to repair
    - e.g., MTBF = 100,000 hours, MTR = 10 hours => MTDL = 500 * 10^6 hours (57,000 years) for a mirrored pair of disks (assuming independent failures)
    

# Improvement in Performance via Parallelism

- two main goals of parallelism in a disk system
    - increase throughput by distributing small I/O requests across multiple disks
    - reduce response time by parallelizing large I/O requests
- the dominant technique for parallelizing I/O is **striping**, which writes data across multiple disks to improve throughput
    - *mirroring alone only allows us to paralleize reads*
- practical systems generally use **block-level striping**: with $n$ disks, block $i$ of a file goes to disk number $(i\mod n) + 1$
    - requests for different blocks can run in parallel if the blocks reside on different disks
    - a request for a long sequence of blocks can utilize all disks in parallel


# RAID Storage and Performance Calculations

- definitions for storage
    - **raw capacity**: total amount of physical storage in the RAID
    - **effective/usable capacity**: how much application data can actually be stored in the RAID

- definitions for performance
    - **I/O operations per second (IOPS)**: how many small (e.g., 4KB) *random* reads or writes a RAID can perform in one second
    - **Raw IOPS**: how many small reads or writes are actually performed on the disks used in a RAID; on IOP applied to a RAID might require several raw IOPS, which leads to a performance penalty
    - **throughput**: how many bytes per second can be read from or written to a RAID; can be measured for random/sequential RW

# RAID Levels

- RAID levels are **schemes for providing redundancy at lower cost by using disk striping combined with mirroring or parity bits**

# RAID 10
- **RAID Level 1 + 0 (aka RAID 10)**: mirrored disks with block striping
    - two copies of everything (relatively expensive)
    - *best* write performance, supports reading in parallel from both mirrors
    - straightforward recovery when disk fails (copy from mirror)
    - good for update-heavy workloads
- **RAID 0** splits stripes data evenly across multiple disks without parity information, redundancy, or fault tolerance
- **RAID 1** consists of an exact copy (mirror) of a set of data on multiple disks

## RAID 10 Read

<img src="img/Snip20191028_33.png" width=80%/>

- small reads: one thread
    - penalty factor = 1

## RAID 10 Write

<img src="img/Snip20191028_34.png" width=80%/>


# RAID 5: block-interleaved distributed parity

- partitions data and parity among $N+1$ disks ($N\ge 2$)
    - e.g., with 5 disks, parity block for $n$th set of blocks is stored on disk $(n \mod 5) + 1$, with the data blocks stored on the other 4 disks
    - recovery entails *reading $N$ remaining disks*, slower than RAID 10
    - more cost-effective than RAID 10, but slower in some cases
    - good for reads and large writes, but small writes pay a penalty: they must read-modify-write data to update parity
- RAID 5 for each stripe provides only one parity
  
## RAID 5 Read
- small read: 1 logical IOP maps to 1 Raw IOP, throughput is up to 5x better than using one disk (no penalty)
- large reads: parity blocks need to be skipped; in cases we can estimate the large read throughput as 4x better than one disk because if we read all five disks in parallel then 4/5 of the blocks are data, and 1/5 are parity

<img src="img/Snip20191028_35.png" width=80%/>

## RAID 5 Write

- small write: read and write one data block + one parity block; 1 functional IOP turns into 4 Raw IOPs; small write throughput is up to (5/4)x better than using one disk (4x penalty)
- large write: write parity blocks and data blocks; 4 logical IOPs turn into 5 Raw IOPs; write throughput is 4x better than one disk if we write all five disks in parallel then 4/5 of the blocks are data and 1/5 are parity

<img src="img/Snip20191028_36.png" width=80%/>

## Peak Performance

Workloads | Random (Small Files) | Sequential (Large File)
---|---|---
Reads|**many threads**|**one thread**
Writes|**many threads**|**one thread**

## Raid Performance

\begin{equation}
\textrm{Raid_Performance} = {\textrm{Performance_One_Disk} \times N \over \textrm{Penalty}}
\end{equation}

- $N$: number of disks in Raid





# File Structures
- DB can be considered as a collection of **files** representing relations
- each file is a sequence of **records** representing tuples
- a record is a sequence of **fields** representing attributes

## Fixed-Length Records
- assume record size is fixed; each file has records of one particular type only; different files are used for different relations

- simple approach
    - store record $i$ starting from byte $n \times (i - 1)$ where $n$ is the size of each record
    - record access is simple but records may cross blocks, remedy: do not allow records to cross block boundaries
- alternatives for deletion of record $i$
    - move records $i+1, ..., n$ to $i, ..., n - 1$: expensive
    - move record $n$ to $i$: destroys sort order
    - do not move records, but link all free records in a *free list*

### Free Lists
- store the address of the first deleted record in the file header
- the $i$th deleted record records a pointer to the $(i+1)$st deleted record
- **problem**: is the table is sorted by instructor ID then how can we insert a record efficiently without reclaiming a deleted record?

<img src="img/Snip20191028_37.png" width=80%/>

## Variable-Length Records

- variable-length records arise in database systems in several ways
    - storage of multiple record types in a file
    - record types that allow variable lengths for one or more fields such as varchar strings
    - record types that allow repeating fields (used in some older data models)
- attributes are stored in a fixed order; variable length attributes represented by fixed size (offset, length), with actual data stored after all fixed length attributes
- null values represented efficiently using a **null bitmap**

<img src="img/Snip20191028_38.png" width=80%/>

### Slotted Page Structure

<img src="img/Snip20191028_39.png" width=80%/>

- **Slotted page** header
    - number of record entries
    - end of free space in the block
    - location and size of each record
- records can be moved around within a page to keep them contiguous as long as the corresponding entry in the header is updated
- pointers to records from other structures should not point directly to a record, instead they should point to the header entry for the record

# Organization of Records in Files

- **Heap**: a record can be placed anywhere in a file where there is space
    - unsorted
- **Sequential**: store records in sequential order, based on the value of the **search key** of each record
    - sorted
    - OLAP friendly
    - MySQL MyISAM
- Records of each relation may be stored in separate file but in a **multitable clustering file organization** the records of several different relations can be stored in the same file
- **Indexed-organized table**: records are stored using a dictionary structure such as a hash (unordered) or a B-tree (ordered)
    - this type of organization supports fundamental operations (lookup, insert, update, delete) efficiently
    - in practic tables are often stored as B-trees ordered by the primary key

## Sequential File Organization
- records in the file are ordered by a **search key**
- deletion: use pointer chains
- insertion: locate the position where the record is to be inserted
    - if there is space, then insert there
    - if there is no free space, insert the record in an **overflow block**
    - pointer chain must be updated regardless
- need to reorganize the file from time to time to restore the sequential order
    - since OLAP does not need to update the records, this cost is not a problem

<img src="img/Snip20191101_95.png" width=60%/>

## Multitable Clustering File Organization

- store several relations in one file using a **multitable clustering** file organization
- good for queries involving *department* $\bowtie$ *instructor*, and for queries involving one single department and its instructors
- bad for queries involving only *department*

<img src="img/Snip20191101_96.png" width=60%/>

- results in variable size records
- can add pointer chains to link records of a particular relation

<img src="img/Snip20191101_98.png" width=60%/>

## Data Dictionary Storage

- the **data dictionary** (also called **system catalog**) stores **metadata**
    - information about relations
        - names of relations
        - names, types, and lengths of attributes of each relation
        - names and definitions of views
        - integrity constraints
    - user and accounting information
    - statistical and descriptive data
        - number of tuples in each relation
    - physical file organization information
        - how the relation is stored (sequential, hash, B-tree)
        - physical location of the relation
    - information about indexes

### Relational Representation of Metadata

<img src="img/Snip20191101_99.png" width=60%/>

# Storage Access

- a file is partitioned into fixed-length storage units called **blocks**, which are units of both storage allocation and data transfer
- database system seeks to minimize the number of block transfers between the disk and memory
    - the number of disk accesses can be reduced by keeping as many blocks as possible in main memory
- **buffer**: portion of main memory available to store copies of disk blocks
- **buffer pool**: collection of buffers used by the database
- **buffer pool manager**: subsystem responsible for allocating buffer space in main memory, loading blocks from disk into the buffer pool, evicting blocks as needed to create space in the buffer pool, and write back dirty (i.e., modified) blocks to disk

# Textbook RAID notes




# 10.3 RAID

## 10.3.1 Improvement of Reliability via Redundancy

- mirroring: to duplicate every disk
    - a logical disk consists of two physical disks, and every write is carried out on both disks

## 10.3.2 Improvement in Performance via Parallelism

- block-level striping: stripes blocks across multiple disks
    - treats the array of disks as a single large disk, and gives blocks logical numbers
    - assumes block numbers start from 0
    - with an array of $n$ disks, block-level striping assigns logical block $i$ of the disk array to disk $(i \textrm{ mod } n) + 1$, and uses the $\lfloor i/n \rfloor$th physical block of the disk to store the logical block $i$


- example: with 8 disks, logical block 0 is stored in physical block 0 of disk 1, while logical block 11 is stored in physical block 1 of disk 4


- when reading a large file, block-level striping fetches $n$ blocks at a time in parallel from the $n$ disks, giving a **high data transfer rate for large reads**; when a single block is read, the data-transfer rate is the same as on one disk, but the remaining $n-1$ disks are free to perform other actions


## 10.3.3 RAID Levels

<img src="img/Snip20191124_25.png" width=50%/>

- **RAID level 0** refers to disk arrays with striping at the level of blocks, without any redundancy


- **RAID level 1/1+0/10** refers to disk mirroring with block striping


- **RAID level 5** refers to block-interleaved distributed parity
    - partitioning data and parity among all $N+1$ disks
    - all disks can participate in satisfying read requests
    - for each set of $N$ logical blocks, one of the disks stores the parity, and the other $N$ disks store the blocks
    - **a parity block cannot store parity for blocks in the same disk**

<img src="img/Snip20191124_26.png" width=40%/>

