

# Silo: Speculative Hardware Logging for Atomic Durability in Persistent Memory

Ming Zhang, Yu Hua

Huazhong University of Science and Technology, China

# Persistent Memory (PM)



# **Atomic Durability**

- > A group of updates are written to PM in an all or nothing manner
- Current 64-bit CPUs only support 8B atomic write<sup>[1-3]</sup>



Partial updates!

# Write-Ahead Logging in Transaction

> Back-up data before updates to ensure atomic durability





# Write-Ahead Logging in Transaction

> Back-up data before updates to ensure atomic durability







# **Hardware Logging**

#### Software Logging

```
Tx_begin
create Log
write Log
flush Log
sfence
write data
flush data
sfence
Tx_end
```



Log operations exist on the critical path
Throughput decreases by up to **70**%<sup>[1]</sup>

#### Hardware Logging





Offload logging operations to hardware

- ✓ Better performance
- ✓ Easy programming

### State-of-The-Art

## Log as backup



# Challenges

Tx\_begin
write A
write B
Tx\_end

**Heavy Writes** 



Logging supports to recover data from a system crash, but increases the write traffic

→ Exacerbate PM endurance





#### Hardware undo+redo

- **FWB**<sup>[1]</sup> writes logs to PM before the updated data for each write
- MorLog<sup>[2]</sup> flushes logs to PM before commit to ensure durability
  - → Increase latency

# **Key Ideas**

- > Speculative Logging
  - Crash is rare for a single machine<sup>[1-2]</sup>
  - → Do not conservatively write logs to PM in common cases (no failures)
  - → Only write logs to PM in rare cases (e.g., crashes) to guarantee atomic durability
- > Log as Data
  - Logs are able to record the new data
  - → Use on-chip logs to in-place update the PM data after commit in common cases

Make the common case fast and guarantee recoverability



# Silo: Speculative Hardware Logging



# Log Reduction

> Reduce the size of on-chip log buffer based on write behaviors

#### **Log Ignorance**

A write does not modify the data

- E.g., copy and assignment<sup>[1]</sup>
- Old data == New data



Log generator ignores this write

Does not produce log entry

#### **Log Merging**

Multiple writes modify the same data

- Temporal locality of programs
- Only the oldest and newest data are required



# Log Update

- > Use the new data in on-chip logs to in-place update the data region
- Not block cacheline evictions
  - Set the flush-bit to 1 to discard the log after commit if an updated cacheline is evicted



#### > Benefits

- Write reduction: Don't write logs to PM in common cases
- No ordering constraints: Don't wait for flushing logs (and cachelines) to the log (and data) regions

# **Write Coalescing**

- Silo allows two update paths
  - 8B: Log in-place Updates (LU)
  - 64B: Cacheline Evictions (CE)
- LU and CE are coalesced inside an on-PM buffer
  - W1-W3 have overlapped bytes
  - W4-W5 are not overlapped
  - W6 is merged into cachelines
- > Correctness: No race risk



- 1 Flush-bit in log is 1. CE updates the data region
- ② LU and CE are coalesced to update the data region
- 3 LU writes the data region. CE will not write twice\*



### **Rare Cases**

> Silo writes logs to guarantee correctness in two rare cases



### **Evaluation**

#### > Benchmarks

- Micro-benchmarks
  - Array, Btree, Hash, Queue, RBtree
- Macro-benchmarks
  - TPCC, YCSB

#### > Comparisons

- Base: A hardware logging baseline
- **FWB**<sup>[2]</sup>: The hardware logging design of FWB
- MorLog<sup>[3]</sup>: The morphable hardware logging
- LAD<sup>[4]</sup>: The logless atomic durability design
- Silo: Our speculative logging design

#### Gem5 Simulation

| Processor         |                                                |  |  |  |
|-------------------|------------------------------------------------|--|--|--|
| Cores             | 8 cores, x86-64, 2 GHz                         |  |  |  |
| L1 I/D            | Private, 64B per line, 32KB, 8-way, 4 cycles   |  |  |  |
| L2                | Private, 64B per line, 256KB, 8-way, 12 cycles |  |  |  |
| LLC               | Shared, 64B per line, 8MB, 16-way, 28 cycles   |  |  |  |
| Mem Ctrl          | FRFCFS, 64-entry queue in ADR domain           |  |  |  |
| Log Buffer        | 680B per core, FIFO, 8 cycles, battery-backed  |  |  |  |
| Persistent Memory |                                                |  |  |  |
| Capacity          | 16GB phase-change memory                       |  |  |  |
| Latency           | Read / Write: 50 / 150 ns <sup>[1]</sup>       |  |  |  |

# **Transaction Throughput**

"Log as data": No ordering constraints

Do not wait to persist logs and cachelines



| Wait to persist logs and cachelines |                                                                    |            |
|-------------------------------------|--------------------------------------------------------------------|------------|
|                                     | Wait to persist cachelines:<br>$L1 \rightarrow LLC \rightarrow MC$ | \<br> <br> |

| Silo improves throughput               | 1 core | 8 cores |
|----------------------------------------|--------|---------|
| Existing hardware logging designs      | 1.4x   | 4.3x    |
| Existing hardware logless design (LAD) | 1.1x   | 1.5x    |

### **Write Traffic**



# Overhead of Log Buffer



| Battery consumption*                     | Intel's eADR       | BBB@HPCA'21              | Our Silo                          | _ |
|------------------------------------------|--------------------|--------------------------|-----------------------------------|---|
| Flush Size for 8 cores (KB)              | 10,496             | 16                       | Smaller than [eADR] 888.2x; 91.6x |   |
| Flush Energy (µJ)                        | 54,377             | 194                      | [BBB] <b>3.2</b> x; <b>2.1</b> x  | ` |
| Supercapacitor (size: mm³; area: mm²)    | <b>151</b> ; 28.4  | <mark>0.54</mark> ; 0.66 | 0.17; 0.31                        |   |
| Lithium thin-film (size: mm³; area: mm²) | <b>1.51</b> ; 1.32 | 0.0054; 0.031            | 0.0017; 0.014                     |   |

### More Results

- Handle large transactions
  - Log overflow occurs
  - Throughput decreases by only 7.4%
- Change latency of log buffer
  - A 128-cycle log buffer only decreases the throughput by 3.3% over an 8-cycle one

Find more details in our paper!



overflowed logs to be flushed in parallel with generating new logs. Fig. 14b shows that the write traffic only increases by up to 1.9× on average since Silo flushes the overflowed undo logs in a batch manner to mitigate the write amplification to PM media. The performance on Btree, Hash, Queue, and RBtree decrease when running large transactions due to writing extra overflowed logs. Note that Array shows stable performance since most of the logs are ignored as analyzed in § VI-D. Hence, the logs do not frequently overflow, Moreover, the results on TPCC and YCSB keep stable due to their good locality, which enables substantial logs to be merged on chip. In summary, the log overflow does not always occur in large transactions. Even if it occurs, Silo does not abort transactions or incur severe performance degradations

We study how the access latency of the log buffer affects the performance. We change the latency from 8 to 128 cycles guarantee the crash consistency for single operations on PM. to cover various buffer types (e.g., SRAM). The throughputs of micro-/macro-benchmarks are normalized to Array/TPCC based data structures, such as NVTree [59], Fast&Fair [20], using an 8-cycle buffer. Fig. 15 shows that the throughput Level Hashing [64], and MOD [19], leverage customized generally keeps stable when increasing the latency. In Silo, schemes to ensure the consistency for single updates. Second, the CPU store does not need to wait for writing logs to the hardware-based schemes, such as eADR [22] and BBB [5], the buffer during transaction execution, and the new data in adopt battery-backed caches to persist CPU writes. Orthogonal logs are read from the buffer to update the data region in the to these studies, our Silo focuses on the atomic durability for background after commit. Thus, reading or writing the log buffer is not on the critical path. Using a 128-cycle buffer only decreases the throughput by 3.3% over an 8-cycle one on average. Moreover, the write traffic is not affected when (PM), this paper proposes Silo, a speculative hardware logging changing the latency. In summary, the latency of log buffer has negligible effect on the efficiency of Silo.

#### VII. RELATED WORK

studies, Silo adopts the hardware logging approach.

transaction execution. Prior hardware undo loggings [28], [46] ence Foundation of China (NSFC) under Grant No. 62125202 need to persist all the undated data before commit. ASAP [2] and U22B2022, and Key Laboratory of Information Storage asynchronously persists the undo logs and the updated data System, Ministry of Education of China,



cies. Existing redo schemes [16], [25], [27], [51] enforce the ordering between redo logs and data. DHTM [27] writes redo logs to provide durability for hardware transactional memory, but the transaction size is limited by LLC. CCHL [51] compresses and consolidates logs to reduce writes. Legacy undo+redo designs [38], [52] exploit the benefits of undo and redo loggings, but still write extra logs. Unlike them, Silo uses the on-chip logs to directly in-place update the data region in common failure-free cases, thus reducing the overheads.

Multi-Versioning Schemes for Atomic Durability. Atomic durability can be ensured by multi-versioning [10], [18], [35], [61]. Kiln [61] uses a non-volatile last level cache (NVLLC) to store the updated data. LAD buffers the updated cachelines in memory controller until committed to PM. Kamino-Tx [35] maintains the main and backup versions of data regions in PM. HOOP [10] designs an indirection layer that redirects the addresses for out-of-place updates. Unlike them, Silo adopts hardware logging to ensure atomic durability, while enabling in-place updates without the needs of NVLLC, data region backups, and physical address redirections

Crash Consistency for Single Operations. Some studies They can be divided into two categories. First, the softwarea group of updates based on the ACID transaction.

In order to ensure atomic durability for persistent memory approach that leverages the new data in the on-chip logs to inplace update the PM data region in common failure-free cases. Hence, it is unnecessary to write logs to the PM log region to back up data, thus improving the performance and reducing WAL for Atomic Durability. Software loggings [12], [14], the overheads. Only in rare cases, e.g., system crashes, Silo [48], [56] rely on CPU instructions to enforce the durability or selectively flushes necessary on-chip logs to PM for data der between logs and data. DudeTM [34] and SoftWrAP [17] recovery without any loss of correctness. Experimental results use a DRAM cache to remove the persist operations from the demonstrate that Silo significantly outperforms state-of-the-art critical path, but need to track the data versions. Unlike these studies in terms of transaction throughput and write traffic.

ACKNOWLEDGMENTS

### Conclusion

- > Ensuring atomic durability is important for persistent memory (PM)
- Prior hardware logging studies: Log as Backup
  - Heavy writes to PM
  - Ordering constraints between persisting logs and data
- > We propose a speculative logging design Silo: Log as Data
  - Use on-chip logs to in-place update data (<u>Make common case fast</u>)
  - Write logs to back up data in rare cases (<u>Guarantee recoverability</u>)

#### > Benefits

- Improve transaction throughput
- Reduce write traffic to PM
- Low hardware overhead

# Thank you!