## Storage Systems (StoSys) XM\_0092

## **Lecture 3: FTL and GC**

Animesh Trivedi Autumn 2020, Period 2



## Syllabus outline

- 1. Welcome and introduction to NVM (today)
- 2. Host interfacing and software implications
- 3. Flash Translation Layer (FTL) and Garbage Collection (GC)



- 4. NVM Block Storage File systems
- 5. NVM Block Storage Key-Value Stores
- 6. Emerging Byte-addressable Storage
- 7. Networked NVM Storage
- 8. Trends: Specialization and Programmability
- 9. Distributed Storage / Systems I
- 10. Distributed Storage / Systems II

## **NAND Flash Constraints**

- Complex internal geometry
- No in-place updates
  - Any in-place writes must do a full block read-erase-write cycle
- Limited Program/Erase (P/E) cycles
- No mechanical latency
- Asymmetric read-write performance
  - Reads from flash chips is typically faster than writes
- Sequential page writes/programming within a block
- Anything else?









Logical Page Addresses (LPA) is what a host software sees

- A host can issue a read/write operation on a particular LBA
- Based on the device, it can be 512 bytes (backward compatible BIOS might not know about 4kB pages) or some kB (to match the actual flash page size, 4 or 8 kB)



What happens when the whole device is written once? Without Erase?



What happens when the whole device is written once? Without Erase?

**Garbage collection** - move data around to prepare blocks for erase



## What Does GC do?

block0 block0 page0 page0 If the whole page1 page1 block was page2 page2 old mappings Erase page3 page3 (in-place) (b) (a)

IF the whole block (all pages) contains old data (shown as red) then the FTL can issue the erase command on the block

- Block will be erased, and ready for re-programming
- Put in the pool of free blocks

Easy, right?

## What Does GC do?



What else should we consider?

## **How does GC find free blocks?**

If a device is 8GB, and we wrote 8GB of data then where is the free space?

## **Concept: Over Provisioning (OP)**

If a device is 8GB, and we wrote 8GB of data then where is the free space?

Answer: Over Provisioning, have more than what you report (or report less to user/OS)



Same over provisioned blocks are also used to manage bad blocks

So in essence, your 16 GB flash drive internally might be 17, 18, or 20 GB flash drive. Sometimes, OP is configurable, other times it is a trade-secret

- **More**: better performance, more space for GC, expensive
- **Low**: cheaper, but less margins for errors

## **Concept: Write Amplification (WA)**



What is happening here : **copy-merge-write** cycle

- 1. The OS/kernel writes 4kB page
- 2. Internally device has to read 16kB and then write out again 16kB after the modified page

```
data written to device
Write amplification = ------data written by user
```

Here amplification is **4x**. The exact value depends upon the page/block size, how many free blocks, FTL design, and many more...

A device can also expose smaller than the page I/O units like 512 bytes sectors

## **Concept: Wear Leveling**

Different flash pages get written and programmed at a different rate

- How frequently they are written
- How frequently they are erased

As a result - their remaining age is different. For SLC chips, 100K is a typical number of cycles available

- Page 0 and 2 are really healthy and not much written
- Page 1 has only half of its PE cycle remaining
- ullet Page 3 has close to death  $\to$  error rate will increase significantly, and block will be marked bad

An ideal flash drive will try to spread the load evenly across all pages - wear leveling



## **Concepts: Steady State**

When you have a fresh SSD (Fresh out of the Box, of **FoB**):

- All pages/blocks are erased
- All pages have full life time
- GC has not done much work
- FTL mappings all empty

If you benchmark your SDD in this state you get a very different performance results, because

- No erasure has been done so far
- No running out of the blocks, need to look for new blocks
- No FTL lookups to update entries

In fact, of you get a FoB flash SSD - it might be the case that you get < 1 useconds for reading pages and blocks - can you guess why?

## **Steady State: Impact**



A Order Of Magnitude Gap

Almost

## **FTL Responsibilities**

- Address Translation
  - a. Map user visible pages to internal flash locations
- 2. Garbage Collection
  - a. Clear old blocks which do not have "live" data
- Wear Leveling
  - a. Make sure not to "burn" one block and all blocks "age" uniformly i. BTW is this uniform aging a good idea?
- 4. Parallelism and Load Balancing
  - a. Use all parallel units for performance
- 5. Bad Block Management
  - a. Error corrections
  - b. When blocks are bad (they develop a high read/write error rate, mark them and don't use)

Also, work in a resource constrained environment with limited CPU power and DRAM (*why*?)



## **Flash Device Setup**



## **Design1: A Directly Mapped FTL**



#### A simple directly mapped FTL

- Trivial look up fast and simple
- When read comes up for a LPA(x), read the PPA(x)
- When write comes up
  - If the page is in "E" (erased) state then simple, write it out (write1, W1)
  - If the page is in "I" or "W" state then the FTL needs to "Erase" the block, not just the page (for write2, W2)

## **Design1: A Directly Mapped FTL**

2. Update the new content



So, what is the write amplification factor here?

What are the challenges with such directly mapped design?

## **Issues with Directly Mapped FTL**

#### 1. Performance

- a. Every over-write to a page will result in a full copy-erase-write cycle
- b. Very high **write amplification factor** 
  - i. Directly proportional to the number of pages in a block

#### 2. Wear-leveling

- a. Not all pages are equally popular
- b. Physical pages containing frequently written data (called HOT data) will be erased many times, thus **wearing them out** 
  - i. Every flash page has a finite number of P/E cycle (e.g., 100K for SLC)
  - ii. After exhausting their quotas, they will develop high error rates
- c. One way to fix would be to tell the application to write evenly across the whole device (but applications rarely write pages, file systems do)

#### 3. In-place updates

a. During a failure, the in SRAM content might be lost

### Let's do a Bit Better



| LPA | PPA | Valid/Flags |
|-----|-----|-------------|
| 0   | 0x3 | I           |
| 1   | 0x0 | Е           |
| 2   | 0x1 | W           |
| 3   | 0x3 | W           |

Invalid, Written, Erased

Lets keep track of "mappings" in a table, and we can rearrange them when they needs to updated

Two types (same idea as the page tables in the CPU):

- Directly mapped from logical to physical mappings fast lookup (but failure prone)
- Inversely mapped from physical to logical mappings, easy to reverse lookup, and construct a mappings after failure

A device can use one or mix of these two.

## **Page-Mapped FTLs**

Block 0

Does tracking of LPA (arbitrary) → PPA (arbitrary)



Block 1

Block 2



Invalid, Written, Erased

## **Page-Mapped FTLs**



| LPA | PPA | Valid/Flags |
|-----|-----|-------------|
| 0   | 0   | E -> V      |
| 100 | 1   | E -> V      |
| 101 | 2   | E -> V      |
| 400 | 3   | E -> V      |

Invalid, Written, Erased

#### **For Write**

- As the new content comes in, we fill in the next pages we have around.
- Simple strategy (multiple policies possible regarding the selection of the next free page ...any guesses?)

#### For read

- Lookup in the table the location of PPA and then issue a read to the flash chip

## **Page-Mapped FTLs: Overwriting**



Invalid, Written, Erased

|   | LPA | PPA          | Valid/Flags |
|---|-----|--------------|-------------|
|   | 0   | 0            | V           |
|   | 100 | <del>1</del> | "I"         |
|   | 101 | 2            | V           |
|   | 400 | 3            | V           |
| ) | 100 | 4            | V           |

When a new content comes in, FTL marks the old location invalid, and choose a new page to write the new content

- If the write was for a full page, then just write to the new location and update the table
- If the writes was less than the page then
- Read the old content, merge, and then write the merged paged to page 4
   This is called **out-of-place writes** helps with the failure (as the old content is kept)

## The Good and Bad about Page-level FTLs

#### The Good

• Most flexible, the best performance, least amount of WA as only a single page is merged

#### The Bad

- How much memory 1TB SSD need with 4kB pages?
  - Let's assume at least 8 bytes entry per page
  - o 1TB / 4kB = 256 million entries
  - o In total the size of the FTL table: 2 GB (this much SRAM in your SSD)
  - We are not even talking about space for other items yet
- So why cannot we put 2 GB of memory in SSDs?
  - Complexity form factor, electricity, circuitry, power
  - o Price, will get super expensive. Memory is expensive

## **Block-Level Mapping**

Instead of keeping track of per-page mapping, keep track of per-block mapping

**Recall**: A block is a collect of pages (10-100s of pages), it is the unit of erase

|                 | Block Size | Page Size | OOB Size  | # Pages/Block |
|-----------------|------------|-----------|-----------|---------------|
| Small-block SLC | 16KB       | 512 bytes | 16 bytes  | 32            |
| Large-block SLC | 128KB      | 2KB       | 64 bytes  | 64            |
| Large-block MLC | 512KB      | 4KB       | 128 bytes | 128           |

So by what factor, the size of the table will be decreased? Number of pages in a block. So, let's do the calculation again:

- 1 TB / 128 KB x 8 bytes = 64 MB (still large, but manageable)
- SSDs can have 256-512kB blocks.

#### So this is it then?

## **Basic Working: Reads and Writes**



|     | LBA  | РВА | Valid/Flags |
|-----|------|-----|-------------|
| (2) | 8000 | 0   | V           |
|     | 400  | 1   | V           |

Invalid, Written, Erased

Nice, small FTL table

**Read:** find the block level mapping,

- R(8100) comes in : block address (8000) + page offset (1)
- Lookup PBA of 8000, this is 0
- read(0 + 1) for the content of LPA(8100)

Offset based address calculation ← **Important!** 

## **Challenges with the Block-Level Mappings**

Assume a size of 100 bytes for each page, block size 400 bytes

W(8000, a1), W(400,a2), W(700 a3), W(8100, a4), W(8000, b1), W(500, b2)



| LBA  | PBA | Valid/Flags |
|------|-----|-------------|
| 8000 | 0   | V           |
| 400  | 1   | V           |

Invalid, Written, Erased

#### Now what?

- What to do with W(8000)? This is an overwrite
- What to do with W(500)? This is a **write in a middle of a block**

So, why cannot we program page 5 in block 1 where write to LPA(500) is suppose to go?

The only thing we can do is to copy, merge, and write out the content of blocks 0 and 1 to a new blocks (block 2, if no more available then erase blocks)  $\rightarrow$  **High Write Amplification factor** 

## **Challenges with the Block-Level Mappings**

Assume a size of 100 bytes for each page, block size 400 bytes



| LBA  | РВА | Valid/Flags |
|------|-----|-------------|
| 8000 | 0   | V           |
| 400  | 1   | V           |

Invalid, Written, Erased

The key problem is that due to the **offset calculation** the location of a page is fixed inside a block. We cannot just choose the next free page inside a block

• Expensive read-merge-erase-write cycle that we saw with design 1

The second problem is dependency on the write pattern

## **FTL Design Options so far**

#### 1. Page-level FTLs

- a. Good mapping granularity
- b. Any page can be mapped anywhere, less sensitive to workload patterns
- c. But, large FTL size and need large SRAM in flash

#### 2. Block-level FTLs

- a. Small FTL size
- b. High write amplification factor
- c. Sensitive to workload patterns

There are variants to both page-, and block-level FTLs that helps to mitigate these issues to a certain extend (we are skipping them here)

# Question: What is the most important data structure you know?

(obviously, it can not be a factual answer, but what do we think?)



## Log: A very powerful data structure

#### A sequential appending data structure

- Write only at the end (tail)
- Read from anywhere

#### Many unique properties



- No in-place updates, once written, the data becomes immutable
- Serialized writing, one point of writing, the tail either the write succeeds or not (atomicity)
- Converts a random write to a sequential one ← very important
- Parallel reading
- Ordering of events (writes, transactions, or whatever)
- Logging events (used in DBs, file systems) for failure recovery

We will see use of logs in FTLs, flash file system designs, distributed systems. One of the most important data structures around, and it matches NAND flash properties!

## **Hybrid-Log FTLs**

Divide the device into two parts, same intuition as we saw with the block devices (block level mappins were restrictive)

\*\*Reads\*\*
Write\*\*

#### Data blocks

- Contains the stable data
- Mapped per-block basis

#### Log blocks

- All fresh writes go to log pages
- All filled in a strictly sequential pattern, from low page to higher pages
- Mapped per-page basis

At some point in time data is moved from the log to the data blocks



## Working of a Hybrid-Log FTL - Simple Case



#### A nice sequential pattern

- Block 1 is currently treated as a log block which absorbs incoming writes
- Once filled, then we can convert it to a "data" block and map it as a "block-level" entry

#### **Data Block Mappings**

| LBA PE | A Valid/Flags |
|--------|---------------|
|--------|---------------|

#### **Log Page Mappings**

| LPA  | PPA | Valid/Flags |
|------|-----|-------------|
| 8000 | 4   | V           |
| 8001 | 5   | V           |
| 8002 | 6   | V           |
| 8003 | 7   | V           |

## Working of a Hybrid-Log FTL - Simple Case

W(8000, a1), W(8100,a2), W(8200, a3), W(8400, a4)



#### A nice sequential pattern

- Block 1 is currently treated as a log block which absorbs incoming writes
- Once filled, then we can convert it to a "data" block and map it as a "block-level" entry

This is called "Switch Merge"

#### **Data Block Mappings**

| LBA  | РВА | Valid/Flags |
|------|-----|-------------|
| 8000 | 1   | v           |

#### **Log Page Mappings**

| LPA             | PPA | Valid/Flags |
|-----------------|-----|-------------|
| 8000            | 4   | ¥           |
| <del>8001</del> | 5   | ₩           |
| 8002            | 6   | ¥           |
| 8003            | 7   | ₩           |

## Working of a Hybrid-Log FTL - Updates to the Block



New updates to already written blocks (but still in order)

- Pick up a new "log" block, say block 2
- Use page 8 and 9 to write new values

Then at some point we can merge the data block 1 and log block 2

#### **Data Block Mappings**

| LBA  | PBA | Valid/Flags |
|------|-----|-------------|
| 8000 | 1   | v           |

#### **Log Page Mappings**

| LPA  | PPA | Valid/Flags |
|------|-----|-------------|
| 8000 | 8   | v           |
| 8100 | 9   | v           |

# Working of a Hybrid-Log FTL - Updates to the Block



**Data Block Mappings** 

| LBA  | РВА          | Valid/Flags |
|------|--------------|-------------|
| 8000 | <b>1</b> → 2 | v           |

#### **Log Page Mappings**

| LPA             | PPA | Valid/Flags |
|-----------------|-----|-------------|
| 8000            | 8   | ₩           |
| <del>8100</del> | 9   | ₩           |

New updates to already written blocks (but still in order)

- Pick up a new "log" block, say block 2
- Use page 8 and 9 to write new values

Then at **some point** we can merge the data block 1 and log block 2

This is called "Partial Merge"

# Working of a Hybrid-Log FTL - Updates to the Block

W(8400,b1), W(2300,b2), W(7400,b3), W(8100,b4),



Now completely random writes, take block 0 as the log block

- Multiple writes might have come here
  - Some for already valid entries (like 8000-8400 range)
  - Some for a new writes, previously unmapped

What happens now?

#### **Data Block Mappings**

| LBA  | РВА | Valid/Flags |
|------|-----|-------------|
| 8000 | 2   | v           |

#### **Log Page Mappings**

| LPA  | PPA | Valid/Flags |
|------|-----|-------------|
| 8400 | 0   | V           |
| 2300 | 1   | V           |
| 7400 | 2   | V           |
| 8100 | 3   | V           |

## Working of a Hybrid-Log FTL - Random Writes



#### **Data Block Mappings**

| LBA  | PBA | Valid/Flags |
|------|-----|-------------|
| 8000 | 1   | v           |

#### **Log Page Mappings**

| LPA             | PPA | Valid/Flags |
|-----------------|-----|-------------|
| <del>8400</del> | 0   | ¥           |
| 2300            | 1   | V           |
| 7400            | 2   | V           |
| <del>8100</del> | 3   | ¥           |

# Convert from a log to data block

## Working of a Hybrid-Log FTL - Random Writes

b1

b2

a3

a4



a3

c1

**c**3

c4

b1

c4

#### **Data Block Mappings**

| LBA  | РВА | Valid/Flags |
|------|-----|-------------|
| 8000 | 1   | v           |

#### **Log Page Mappings**

| LPA  | PPA | Valid/Flags |
|------|-----|-------------|
| 8400 | 0   | ₩           |
| 2300 | 1   | V           |
| 7400 | 2   | V           |
| 8100 | 3   | ¥           |

# **Choices between Data and Log Blocks**

Which log blocks should track updates from which data blocks?

- 1. **Block associative:** a log block uniquely logs updates to a data block (1:1)
  - a. trivial updates and merging (only switch and partial, never full), but 2x capacity waste
- Set associative: log blocks log updates from "set" of contiguous data blocks (n:m)
  - a. Balance between the merge cost and flexibility
- 3. **Fully associative:** any log block can track updates from any data blocks (any:any)
  - a. most flexible, but need full merge

A device can implement a mix of any of these for various workload patterns

### So far we have seen

- Read and write access pattern can affect the performance
- Random writes in between a block can be bad
  - Leads to partial/full merge (cannot do just switch merge)
- Sequential writes are good

Often FTL management and data caching (in SRAM) in the device is managed together

- Whenever a data is ready to be flushed out to NAND chips, corresponding FTL entries are also written out
- One keeps a "working set" of hot data pages and FTL mappings in SRAM memory

Hence, locality matters with SSDs (did not we say random and sequential performance is the same for SSDs?)

### Also to consider

#### Are sequential reads also better than the random ones?

- Because even with the hybrid strategy the size of block-level data table, and the page-level log table can be large
  - They are stored on the flash and brought in demand to SRAM
  - A single block-level entry covers "x" numbers of pages
  - Hence, sequential/close LPAs have faster performance than the random ones

Freshly formatted device (no data) have a very low write/read latency (< 10 usec)

- The device knows there is no old data, write to the best location
- The device knows no data has been written so far, read with zeros
- Even with writes, most writes can be absorbed with on-device SRAM (needs flush)

# A Typical Example



# **Garbage Collection**

#### So far we have implicitly discussed that

- A new free block is always available
- Old marked blocks are someone Erased in due time

How does this happen? The process of erasing blocks with old data to make them suitable for new data is called **Garbage Collection (GC)** 

A block to be erased might contain some live and some expired/dead data

#### **Terminology:**

- **Live data** which is the correct, freshest copy of the data
- Dead/expired old data, which is no longer needed
- **Age of a block/page** representing the number of P/E cycle a page/block has gone through (recall: there is a finite number of cycles)

# Why Bother about GC?



A flash device has limited resources. If the embedded CPU is occupied running the GC, how is it going to server a user request (interference)? Bad impact on worst case latencies (95 and 99 percentiles)

Hence, it is important to understand GC internals and do "SSD-friendly" data management on flash devices

# A Typical GC Cycle



#### Multiple design choices

- How to find live-dead data?
- 2. When to invoke GC?
- 3. How to select the target block(s)?
- 4. How to minimize interference with the user/system I/O?
- 5. How to stage data for better GC?
- 6. How to do high-performance GC?

# Identifying "live" vs "dead" data pages

#### Internally FTL keep track of LPA to PPA mappings

• Every time a new version of the page is written, the previous one can be marked as "dead"

Consider the case below ....



In this case, various files which were created at the systems / file system and application level were created, written to, and then deleted. How does an FTL know about blocks mapped to these files? Technically they contain "dead" data.



### **TRIM Command**



TRIM: a new command (apart from read/write) for SSDs

- tells an SSD if a range of page addresses are no longer needed
- remove FTL entries
- mark those pages dead
- can trigger GC to reclaim those pages

### When to invoke GC?

#### As quickly as possible

- interferences with the user I/O request
- contention on shared data and control channels
- useless work if the data written was temporary

#### As late as possible

- might run out of free blocks
- difficult to handle bursty data (if not free block available)

#### Typically, during the **idle periods**

- check the command queue and model when will be the next free window for GC
- when run out of block, then FTL MUST run the GC (no other option)
- hard to predict the future (ML, anyone?)

# **How to Select a Target/Victim Block**

- **Random:** a good and simple strategy if writing load is uniform across the whole device (how often is this the case?)
- FIFO, RR: pick the last block that was erased and use it
  - Advantage: uniform wear leveling across all blocks

Is this good enough? Any other ideas? Or what other criterias should we consider?

### Which Block is Preferred?





Block (a) contains ¾ dead pages Block (b) contains ¼ dead pages

Block (a) gives maximum free space for minimum amount of data copies, also reduces write amplification. Pick the one with least amount of live data. **Greedy approach. Is that it?** 

What if the Block (a) was an old block, close to 90K PE cycles (out of 100K) and Block (b) was at 50K? What is a better option now?

#### Some sort of **cost-benefit Trade-off (CBT)**

- Cost is a function of (amount of work, age, SRAM needed?)
- Benefit is amount of free space reclaimed

# **How to Select a Target/Victim Block**

- **Random:** a good and simple strategy if writing load is uniform across the whole device (how often is this the case?)
- FIFO, RR: pick the last block that was erased and use it
  - Advantage: uniform wear leveling across all blocks
  - o Problem: what if the last block contained all "live" data?
- Greedy: Pick the block which has the least amount of "live" data
  - o Advantage: minimum amount of live data copy, maximum free space reclamation
  - Challenge: what about wear leveling?
- **Cost-Benefit**: various strategies that mix (i) live data amount; (ii) age of the block; (iii) amount of memory required for state keeping;

As you can see there is not strategy that wins all. Modern device FTLs use a combinations of these strategies

# At this point

- 1. FTL seems complicated, and there is no single win strategy for all
  - a. Hidden from the systems and applications (no API to talk to it)
- 2. GC seems complicated and there is no single win strategy for all
  - a. Hidden from the systems and applications (no API to talk to it)
- 3. Both of them are resource heavy
  - a. Need CPU power
  - b. Need DRAM/SRAM on board
- 4. A large design space with multiple objectives
  - a. Performance, very much workload dependent
  - b. Wear-leveling, minimize interference, binning and grouping of data

Is there is a better way to tackle these challenges ...

### **Host-Based FTLs**

Implement all the complex logic on host, why?

- Host has a powerful CPUs (multicore Intel CPUs)
- Host has a lot of DRAM (10-100s of GB)
- All GC and user traffic is visible to the host FTL, no hidden state
- Application-specific customization possible, no need to restrict yourself on the Block interface

This is what you are doing using the OpenChannel SSDs and LightNVM infrastructure

On device FTLs are called "Embedded FTLs" vs "Host-based FTLs"

### **Baidu's Software Defined Flash**

#### SDF: Software-Defined Flash for Web-Scale Internet Storage Systems

Jian Ouyang Shiding Lin
Baidu, Inc.
{ouyangjian, linshiding}@baidu.com

Song Jiang \*

Peking University and
Wayne State University
sjiang@wayne.edu

Zhenyu Hou Yong Wang
Yuanzheng Wang
Baidu, Inc.
{houzhenyu, wangyong03,
wangyuanzheng | @baidu.com

#### Abstract

In the last several years hundreds of thousands of SSDs have been deployed in the data centers of Baidu. China's largest Internet search company. Currently only 40% or less of the raw bandwidth of the flash memory in the SSDs is delivered by the storage system to the applications. Moreover, because of space over-provisioning in the SSD to accommodate nonsequential or random writes, and additionally, parity coding across flash channels, typically only 50-70% of the raw capacity of a commodity SSD can be used for user data. Given the large scale of Baidu's data center, making the most effective use of its SSDs is of great importance. Specifically, we seek to maximize both bandwidth and usable capacity.

To achieve this goal we propose software-defined flash (SDF), a hardware/software co-designed storage system to maximally exploit the performance characteristics of flash memory in the context of our workloads. SDF exposes individual flash channels to the host software and eliminates space over-provisioning. The host software, given direct access to the raw flash channels of the SSD, can effectively organize its data and schedule its data access to better realize the SSD's raw performance potential.

Currently more than 3000 SDFs have been deployed in Baidu's storage system that supports its web page and image repository services. Our measurements show that SDF can deliver approximately 95% of the raw flash bandwidth and provide 99% of the flash capacity for user data. SDF

\* This work was performed during his visiting professorship at Peking University.

increases I/O bandwidth by 300% and reduces per-GB hardware cost by 50% on average compared with the commodity-SSD-based system used at Baidu.

Categories and Subject Descriptors B.3.2 [Memory Structures]: Design Styles - mass storage (e.g., magnetic, optical, RAID)

Keywords Solid-State Drive (SSD), Flash Memory, Data Center.

#### 1. Introduction

To accommodate ever-increasing demand on I/O performance in Internet data centers, flash-memory-based solidstate drives (SSDs) have been widely deployed for their high throughput and low latency. Baidu was one of the first large-scale Internet companies to widely adopt SSDs in their storage infrastructures and has installed more than 300,000 SSDs in its production system over the last seven years to support I/O requests from various sources including indexing services, online/offline key-value storage, table storage, an advertisement system, mySOL databases, and a content delivery network. Today SSDs are widely used in data centers, delivering one order of magnitude greater throughput, and two orders of magnitude greater input/output operations per second (IOPS), than conventional hard disks. Given SSD's much higher acquisition cost per unit capacity, achieving its full performance and storage potential is of particular importance, but we have determined that both raw bandwidth and raw storage capacity of commodity SSDs, in a range of performance categories, are substantially under-

### **Software-Defined Flash Architecture**





- Full control over I/O Stack (exposed parallelism)
- Full control over GC (reduce interference with user I/O)
- Full control over device provisioning, can be customized for workloads
- Predictable Performance when to do read/write/erase

# **SDF: Performance Example**



### The Idea itself is Not New

Fusion-IO (and other companies, Violin Memory) did it back in 2007, run FTL on the host-CPU

Multiple embedded systems do the same, FTL runs on their embedded CPU which is running the application

OCSSD, and LightNVM framework is built on the same idea

• Uses a more standard NVMe interface for multi-vendor support

Scale and deployment of execution is unique in Baidu's case

Pretty cool and influential work

# Design of FTL is a very large field

#### What we are not covering yet

- 1. How to extract parallelism... (allocation)
  - For example, if reading 1MB from a device,
     it would be nice if its blocks and pages were
     distributed across parallel dies for performance
    - i. **Striping:** horizontal, vertical, n-dimensionals



- 2. Wear-leveling
  - a. Static vs Dynamic wear leveling
  - b. Keeping track of block ages
  - c. Keeping track of hot and cold data and separate them into various logical "age bins"
- 3. Bad block management
  - a. ECC checks and corrections
  - b. Maintenance of bad block maps

### **Based on all these details: Unwritten Contract**

#### The Unwritten Contract of Solid State Drives

Jun He Sudarsun Kannan Andrea C. Arpaci-Dusseau Remzi H. Arpaci-Dusseau

Department of Computer Sciences, University of Wisconsin-Madison

#### Abstract

We perform a detailed vertical analysis of application performance atop a range of modern file systems and SSD FTLs. We formalize the "unwritten contract" that clients of SSDs should follow to obtain high performance, and conduct our analysis to uncover application and file system designs that violate the contract. Our analysis, which utilizes a highly detailed SSD simulation underneath traces taken from real workloads and file systems, provides insight into how to better construct applications, file systems, and FTLs to realize robust and sustainable performance.

as hard drives, how higher layers utilize said interface can greatly affect overall throughput and latency.

Our first contribution is to formalize the "unwritten contract" between file systems and SSDs, detailing how upper layers must treat SSDs to extract the highest instantaneous and long-term performance. Our work here is inspired by Schlosser and Ganger's unwritten contract for hard drives [82], which includes three rules that must be tacitly followed in order to achieve high performance on Hard Disk Drives (HDDs); similar rules have been suggested for SMR (Shingled Magnetic Recording) drives [46].

### **What are Unwritten Contract**

**The Written contract** is the API from the device, essential to get the basic function working

Read, Write, Flush, Trim ← The essentials

#### The Unwritten contract is for performance

- Fluctuation in the performance due to the internal SSD complexity that you have seen so far
- Multi-level details to take into consideration

How can you develop your application for the best performance?

### **Unwritten Contract**

### 1. Request Scale

a. Leverage parallelism

#### 2. Locality

a. Respect locality with the FTL mappings

### 3. Align Sequentially

a. Access aligned data to help the FTL

#### 4. Group by Death Time

a. Gives the best chance to GC for clean up

| Rule                                | Impact                  | Metric                |
|-------------------------------------|-------------------------|-----------------------|
| Decement Scale                      | $7.2 \times, 18 \times$ | Read bandwidth        |
| Request Scale                       | $10\times, 4\times$     | Write bandwidth       |
| Lacality                            | 1.6×                    | Average response time |
| Locality                            | $2.2 \times$            | Average response time |
| All Comments                        | $2.5 \times$            | Execution time        |
| Aligned Sequentiality               | $2.4 \times$            | Erasure count         |
|                                     | 4.8×                    | Write bandwidth       |
| Grouping by Death Time              | $1.6 \times$            | Throughput (ops/sec)  |
| CONT. 5 COLO TO TO THE TOTAL TO THE | $1.8 \times$            | Erasure count         |
| Uniform Data Lifetime               | 1.6×                    | Write latency         |

#### 5. Uniform Data Lifetime

a. Group data with the same lifetime to uniformly wear out flash pages

### What You Should Know from this Lecture

- Concepts: wear-leveling, over provisioning, write amplification, steady state
- 2. Three basic FTL designs: page-level, block-level, and hybrid
  - a. Their advantages and disadvantages
- 3. GC design choices
  - a. Trim command
  - b. Various victim block selection algorithms
- 4. Embedded vs host FTL design options
  - a. Advantages of host-based FTL designs
- 5. How do these design choices influence the performance of an SSD

# **Further Reading**

- [Important] Chapter 44, Flash-based SSDs, <a href="http://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf">http://pages.cs.wisc.edu/~remzi/OSTEP/file-ssd.pdf</a>
- An Evaluation of Different Page Allocation Strategies on High-Speed SSDs, <a href="https://www.usenix.org/system/files/conference/hotstorage12/hotstorage12-final55.pdf">https://www.usenix.org/system/files/conference/hotstorage12-final55.pdf</a>
- Wear Unleveling: Improving NAND Flash Lifetime by Balancing Page Endurance, <a href="https://www.usenix.org/conference/fast14/technical-sessions/presentation/iimenez">https://www.usenix.org/conference/fast14/technical-sessions/presentation/iimenez</a>
- Jun He, Sudarsun Kannan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. 2017. *The Unwritten Contract of Solid State Drives.* In Proceedings of the 12th ACM EuroSys, 2017.
- LightNVM: The Linux Open-Channel SSD Subsystem, <a href="https://www.usenix.org/system/files/conference/fast17/fast17-bjorling.pdf">https://www.usenix.org/system/files/conference/fast17/fast17-bjorling.pdf</a>
- Matias Bjørling, Javier González, and Philippe Bonnet. 2017. LightNVM: the Linux open-channel SSD subsystem. In Proceedings of the 15th Usenix Conference on File and Storage Technologies (FAST'17). USENIX Association, USA, 359–373.
- Eran Gal and Sivan Toledo. 2005. Algorithms and data structures for flash memories. ACM Comput. Surv. 37, 2 (June 2005), 138–163.
- Dongzhe Ma, Jianhua Feng, and Guoliang Li. 2014. A survey of address translation technologies for flash memories. ACM Computing Surveys 46, 3, Article 36 (January 2014), 39 pages.
- Luc Bouganim, Björn Þór Jónsson, Philippe Bonnet: uFLIP: Understanding Flash IO Patterns. CIDR 2009.
- Andrew Birrell, Michael Isard, Chuck Thacker, and Ted Wobber. 2007. A design for high-performance flash disks. *SIGOPS Operating Systems Reviews*. 41, 2 (April 2007), 88–93.
- Feng Chen, Tian Luo, and Xiaodong Zhang. 2011. CAFTL: a content-aware flash translation layer enhancing the lifespan of flash memory based solid state drives. In Proceedings of the 9th USENIX FAST, 2011.
- Aayush Gupta, Youngjae Kim, and Bhuvan Urgaonkar. 2009. DFTL: a flash translation layer employing demand-based selective caching of page-level address mappings. In Proceedings of the 14th ACM ASPLOS, 2009.
- Ji-Yong Shin, Zeng-Lin Xia, Ning-Yi Xu, Rui Gao, Xiong-Fei Cai, Seungryoul Maeng, and Feng-Hsiung Hsu. 2009. FTL design exploration in reconfigurable high-performance SSD for server applications. In Proceedings of the 23rd ACM ICS 2009.

# **Backup Slides**

# Static vs. Dynamic Wear Leveling



|   | LBA         | РВА            | Valid/Flags |
|---|-------------|----------------|-------------|
| / | 8000        | <mark>0</mark> | v           |
|   | 400         | 1              | v           |
|   |             | -              | -           |
|   |             | I              | I           |
|   | LBA         | РВА            | Valid/Flags |
|   | LBA<br>8000 | PBA<br>2       | Valid/Flags |

**Dynamic wear-leveling:** In this example, block 0 and 2 are written continuously, thus, also aged continuously

**But what about block 1?** The FTL did not change the block location 1 because it was never updated.

**Static Wear-leveling:** Cycle around all the blocks (even, the cold, static blocks) too. Time to time, FTL will read the old blocks and just move them around for even wear-leveling

# **Impact of Workload Patterns**



Hot data - frequently updated - is mixed with the cold data - rarely updated in a single block

Everytime the block is updated, we need to copy the cold data from the target block to the new one

If the whole block was hot data ("all validated" at the same time"), easy - just erase

A new write, that is

# **Grouping Data Together**

**Why group together**? To group various write/update patterns together to minimize the effort required to "prepare" a block for GC

We cannot avoid not doing GC - but we can minimize the "prep" time

#### How to group data?

- Based on age: all pages with the "similar" creation and deletion time should be group together
- **Based on temperate:** how (in)frequently a page is updated. Frequently updated data together will expire together quickly, hence, easy to discard the whole block and just erase (no live data)
- Mix of various other policies -- stream/namespace specific policies

# FTL Design Exploration in Reconfigurable High-Performance SSD for Server Applications, ICS 2009

| FTL design choices                                   | Best conditions to use                                                                                 | Pros                                                                                        | Cons                                                                                         |
|------------------------------------------------------|--------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------|----------------------------------------------------------------------------------------------|
| Static allocation (b, c, d)                          | Dominant sequential IO requests                                                                        | High parallelism, Even distribu-<br>tion of requests                                        | Less benefit for random IO, un-<br>controllable wear level among<br>flash modules            |
| Static allocation (a, e, f, g)                       | No benefit for most cases                                                                              | N.A.                                                                                        | Bad figures for most of evalua-<br>tion metrics                                              |
| Page striping unit                                   | High IO rate                                                                                           | Utilizes many flash modules in<br>parallel                                                  | Small advantage from data local-<br>ity increasing cleaning operation                        |
| Block striping<br>unit                               | Low IO rate                                                                                            | Small number of cleaning opera-<br>tions                                                    | Uneven request distribution in-<br>creasing response time                                    |
| Dynamic alloca-<br>tion with chip<br>allocation pool | Similar ratio of random and<br>sequential IO                                                           | Parallelism for both sequential<br>and random IOs, moderate per-<br>formance for most cases | Potential for data and request<br>skew                                                       |
| Dynamic alloca-<br>tion with SSD<br>allocation pool  | Dominant random IO requests                                                                            | High parallelism for random IO,<br>Even distribution of random re-<br>quests                | Potential for data and request<br>skew, relatively worse perfor-<br>mance for sequential IOs |
| Load balancing                                       | Skewed IO request to few<br>flash modules                                                              | Even utilization of flash modules                                                           | Increased page migration                                                                     |
| Hot/Cold Separa-<br>tion                             | Evenly distributed IO request and hot/cold data                                                        | Less erase, and page copying                                                                | Misclassification of data de-<br>grades performance                                          |
| Large wear level-<br>ing cluster                     | Unevenly distributed IO and<br>erase requests, requirements<br>for even wear level through-<br>out SSD | Even wear level throughout large<br>cluster                                                 | Larger overhead and response time                                                            |
| Small wear level-<br>ing cluster                     | Evenly distributed IO and<br>erase requests, Requirements<br>for small IO response time                | Small overhead for wear leveling                                                            | Potential for uneven wear level<br>among flash modules                                       |

Table 6: Summary of FTL exploration and tradeoffs

# Page-Mapped FTLs: Failure Analysis

1

W(0, a1), W(100,a2), W(101, a3), W(400, a4), W(100, b1)



Invalid, Written, Erased

| LPA | PPA          | Valid/Flags |
|-----|--------------|-------------|
| 0   | 0            | V           |
| 100 | <del>1</del> | "I"         |
| 101 | 2            | V           |
| 400 | 3            | V           |
| 100 | 4            | V           |

Let's consider when a failure happens

- Before 1 : no write has happened then
- Between 1 and 2: (while writing p4) then no state has been changed, the last state remains
- Between 2 and 3: new content has been written on the page 4, but no FTL entry, the last state remains
- Between 3 and 4: new content written, new FTL entry, but the old is not invalidated, device can find out which is the last written state with timestamp and that wins
- After 4 : everything committed, failure will have no effect

In any case - either you get the old content or the new one, the date is not lost