#### Abstract

#### Optimizing Memory Management for Disaggregated Architectures

Yupeng Tang

2024

The increasing demand for scalable and efficient data center architectures has led to the adoption of resource disaggregation, which separates compute, memory, and storage resources across various interconnects. This paradigm shift from traditional monolithic server architectures allows for more flexible resource allocation and utilization. Memory disaggregation, in particular, addresses the bottleneck issues of traditional setups by decoupling memory resources, presenting them as pooled resources accessible on demand. This approach enhances efficiency, scalability, and adaptability, especially for memory-intensive workloads.

However, transitioning existing applications to a disaggregated architecture presents significant challenges due to the mismatch between current cloud stacks designed for monolithic systems and the requirements of disaggregated systems. These challenges span across different layers of the stack, including application interfaces, OS support, performance overheads, and the limitations of existing interconnect technologies. This dissertation focuses on addressing these challenges, particularly in the context of memory management within disaggregated architectures.

Our approach involves a comprehensive examination of the requirements for successful disaggregation, proposing strategies to mitigate performance penalties and enhance resource management. By adopting a top-down perspective, we aim to bridge the gap between service layers and core hardware elements, ultimately facilitating the transition to disaggregated data center architectures.

### Optimizing Memory Management for Disaggregated Architectures

A Dissertation
Presented to the Faculty of the Graduate School
of
Yale University
in Candidacy for the Degree of
Doctor of Philosophy

 $\begin{array}{c} \text{by} \\ \text{Yupeng Tang} \end{array}$ 

Dissertation Director: Anurag Khandelwal

 $\mathrm{Dec},\,2024$ 

Copyright © 2024 by Yupeng Tang
All rights reserved.

# Contents

| A | Acknowledgements |                                                                     |    |
|---|------------------|---------------------------------------------------------------------|----|
| 1 | Intr             | roduction                                                           | 1  |
|   | 1.1              | Limitations of Existing Approaches                                  | 2  |
|   | 1.2              | Thesis Overview                                                     | 3  |
|   |                  | 1.2.1 Memory management as a Service                                | 4  |
|   |                  | 1.2.2 In-network memory management OS-design                        | 4  |
|   |                  | 1.2.3 Memory management adaptation for new-generation interconnects | 4  |
|   | 1.3              | Outline and Previously Published Work                               | 5  |
| 2 | Mei              | mory Management as a Service                                        | 6  |
|   | 2.1              | Elastic memory management for data analytics                        | 8  |
|   | 2.2              | Introduction                                                        | 9  |
|   | 2.3              | Motivation                                                          | 9  |
|   | 2.4              | Jiffy Design                                                        | 10 |
|   |                  | 2.4.1 Overview                                                      | 10 |
|   |                  | 2.4.2 Hierarchical Addressing                                       | 11 |
|   |                  | 2.4.3 Data Lifetime Management                                      | 12 |
|   |                  | 2.4.4 Flexible Data Repartitioning                                  | 13 |
|   | 2.5              | Implementation                                                      | 15 |
|   |                  | 2.5.1 Jiffy Interface                                               | 15 |
|   | 2.6              | Implementation                                                      | 16 |
|   |                  | 2.6.1 Jiffy Controller                                              | 16 |

|   |      | 2.6.2 Jiffy Data Plane                          | 17  |
|---|------|-------------------------------------------------|-----|
|   | 2.7  | Jiffy Programming Model                         | 18  |
|   |      | 2.7.1 Map-Reduce Model                          | 18  |
|   |      | 2.7.2 Dataflow and Streaming Dataflow Models    | 19  |
|   |      | 2.7.3 Piccolo                                   | 20  |
|   | 2.8  | Applications and Evaluation                     | 21  |
|   | 2.9  | Related Work                                    | 21  |
|   | 2.10 | Conclusion                                      | 21  |
| 3 | One  | erating System Layer                            | 22  |
|   | 3.1  | Hierarchical OS design                          |     |
|   | 3.2  | In-Network Memory Management                    |     |
|   | 0.4  |                                                 | 24  |
|   |      | 3.2.2 Background and Motivation                 | 26  |
|   |      | 3.2.3 MIND Design                               |     |
|   |      |                                                 |     |
|   |      | 3.2.5 Evaluation                                |     |
|   |      | 3.2.6 Discussion and Conclusion                 |     |
|   | 3.3  | Near Memory Processing                          | 37  |
|   | 3.3  | 3.3.1 Introduction                              |     |
|   |      | 3.3.2 Motivation and Pulse Overview             |     |
|   | 3.4  | PULSE Programming Model                         | 45  |
|   |      | 3.4.1 Accelerating Pointer Traversals on a Node | 48  |
|   | 3.5  | Distributed Pointer Traversals                  | 54  |
|   | 3.6  | Evaluation                                      | 56  |
|   |      | 3.6.1 Performance for Real-world Applications   | 57  |
|   |      | 3.6.2 Understanding Pulse Performance           | 59  |
|   | 3.7  | Future Trends and Research                      | 60  |
| 4 | TT   | druono I ovon                                   | e r |
| 4 |      | dware Layer                                     | 65  |
|   | 11   | Next generation Interconnects                   | 65  |

| 4.2 | Introduction                                                |  | 66  |
|-----|-------------------------------------------------------------|--|-----|
| 4.3 | Background and Methodology                                  |  | 69  |
|     | 4.3.1 Compute Express Link (CXL) Overview                   |  | 69  |
|     | 4.3.2 Hardware Support for CXL                              |  | 70  |
|     | 4.3.3 Software Support for CXL                              |  | 71  |
|     | 4.3.4 Experimental Platform Description                     |  | 73  |
| 4.4 | CXL 1.1 Performance Characteristics                         |  | 73  |
|     | 4.4.1 Experimental Configuration                            |  | 73  |
|     | 4.4.2 Basic Latency and Bandwidth Characteristics           |  | 75  |
|     | 4.4.3 Different Read-Write Ratios & Access Pattern          |  | 77  |
|     | 4.4.4 Key insights                                          |  | 77  |
| 4.5 | Memory Capacity-bound Applications                          |  | 79  |
|     | 4.5.1 In-memory key-value stores                            |  | 79  |
|     | 4.5.2 Spark SQL                                             |  | 82  |
|     | 4.5.3 Spare Cores for Virtual Machine                       |  | 86  |
| 4.6 | Memory Bandwidth-Bound applications                         |  | 88  |
|     | 4.6.1 Methodology and Software Configurations               |  | 90  |
|     | 4.6.2 Analysis                                              |  | 91  |
|     | 4.6.3 Insights                                              |  | 92  |
| 4.7 | Cost Implications                                           |  | 92  |
| Fut | ure Work                                                    |  | 96  |
| 5.1 | Introduction                                                |  |     |
| 5.2 | CXL-based KV Cache Offloading                               |  |     |
| 5.3 | Performance Evaluation                                      |  |     |
| 0.0 | 5.3.1 Measurements on CXL-GPU interconnect performance      |  |     |
|     | 5.3.2 Evaluation on TTFT under varying input context length |  |     |
|     | 5.3.3 Evaluation on serving throughput while adhering SLO   |  |     |
| 5.4 | Cost-Efficiency Modeling                                    |  |     |
| 5.5 | Conclusion                                                  |  |     |
| 0.0 |                                                             |  | LUC |

| A | App | endix  |                                                                       | 1  |
|---|-----|--------|-----------------------------------------------------------------------|----|
|   | A.  | Multip | blexing $M+N$ Iterator Executions for Maximizing Pipeline Utilization | 1  |
|   |     | A.1    | PULSE Empirical Analysis                                              | 2  |
|   | B.  | PULSE  | Supported Data Structures                                             | 2  |
|   |     | B.1    | List data structure in STL library                                    | 4  |
|   |     | B.2    | List data structure in Boost library                                  | 5  |
|   |     | B.3    | Tree data structure in Google library                                 | 6  |
|   |     | B.4    | Tree data structure in STL library                                    | 7  |
|   |     | B.5    | Tree data structure in Boost library                                  | 8  |
|   | C.  | PULSE  | Additional Evaluation Results                                         | 9  |
|   |     | C.1    | Traditional Core Architecture vs. PULSE                               | 9  |
|   |     | C.2    | Network and Memory Bandwidth Utilization                              | 11 |
|   |     | C.3    | PULSE Sensitivity Analysis                                            | 11 |

# List of Figures

| 1.1 | Cloud Stack of Disaggregated Architecture.                                         | 2  |
|-----|------------------------------------------------------------------------------------|----|
| 3.1 | Need for accelerating pointer traversals. (top) The performance of                 |    |
|     | pointer traversals in disaggregated architectures is bottlenecked by slow mem-     |    |
|     | ory interconnect. (bottom) Just as caches offer limited but fast caches near       |    |
|     | CPUs, we argue that memory needs a counterpart for traversal-heavy work-           |    |
|     | loads: a lightweight but fast accelerator for cache-unfriendly pointer traversals. | 38 |
| 3.2 | Time cloud applications spend in pointer traversals. See §3.3.2 for                |    |
|     | details                                                                            | 39 |
| 3.3 | PULSE Overview. Developers use PULSE's iterator interface (§3.4) to express        |    |
|     | pointer traversals, translated to Pulse ISA by its dispatch engine (§3.4.1).       |    |
|     | During execution, Pulse accelerator ensures energy efficiency (§3.4.1) and         |    |
|     | in-network design enable distributed traversals ( $\S 3.5$ )                       | 43 |
| 3.4 | PULSE accelerator architecture. (top) Traditional multi-core architec-             |    |
|     | tures with tightly coupled logic and memory pipelines result in low utilization    |    |
|     | and longer execution times. (bottom) PULSE accelerator's disaggregated de-         |    |
|     | sign with an unequal number of logic and memory pipelines efficiently multi-       |    |
|     | plexes concurrent iterator executions across them for near-optimal utilization     |    |
|     | and performance                                                                    | 62 |
| 3.5 | PULSE accelerator overview. See §3.4.1 for details.                                | 62 |
| 3.6 | Hierarchical translation & distributed traversal (§3.5).                           | 62 |

| 3.7  | Application latency (top) & throughput (bottom) (§3.6.1). The                        |    |
|------|--------------------------------------------------------------------------------------|----|
|      | darker color indicates the time spent on cross-node pointer traversals, which        |    |
|      | increases with the number of memory nodes in WiredTiger and BTrDB                    | 63 |
| 3.8  | Application energy consumption per operation (§3.6.1).                               | 63 |
| 3.9  | Impact of distributed pointer traversals (§3.6.2).                                   | 63 |
| 3.10 | Latency breakdown for PULSE accelerator (§3.6.2).                                    | 63 |
| 3.11 | Slowdown with simulated CXL interconnect (§3.7).                                     | 64 |
| 4.1  | CXL Overview. In this study, we focus on commercial CXL 1.1 Type-3                   |    |
|      | devices, leveraging CXL.io and CXL.mem protocols for memory expansion in             |    |
|      | single-server environments                                                           | 67 |
| 4.2  | CXL Experimental Platform. (a) Each CXL server is equipped with two                  |    |
|      | A1000 memory expansion cards. SNC-4(§4.4.1) is enabled only for the raw              |    |
|      | performance benchmarks<br>(§4.4) and bandwidth-bound benchmarks<br>(§4.6), and       |    |
|      | each SNC Domain is equipped with two DDR5 channels. (a) illustrates Socket           |    |
|      | 0; Socket 1 shares a similar setup except for the absence of CXL memory.             |    |
|      | (b) Our platform comprises two CXL servers and one baseline server. The              |    |
|      | baseline server replicates the same configuration but lacks any CXL memory           |    |
|      | cards                                                                                | 69 |
| 4.3  | Overall effect of read-write ratio on MMEM and CXL across dif-                       |    |
|      | ferent distances. The workloads are represented by read:write ratios (e.g.,          |    |
|      | 0:1 for write-only, 1:0 for read-only). Accessing CXL memory locally incurs          |    |
|      | higher latency compared to MMEM but is more comparable to accessing                  |    |
|      | MMEM on a remote socket. MMEM bandwidth peaks at $67~\mathrm{GB/s}$ , versus         |    |
|      | $54.6~\mathrm{GB/s}$ for CXL memory. Performance significantly declines when access- |    |
|      | ing CXL memory on a remote socket (§4.4.2). In specific scenarios, such              |    |
|      | as the write-only workload (0:1) in (b), the plot may show instances where           |    |
|      | bandwidth decreases and latency increases with heavier loads. The Y-axis is          |    |
|      | on a logarithmic scale.                                                              | 74 |

| 4.4  | A detailed comparison of MMEM versus CXL over diverse NU-                              |     |
|------|----------------------------------------------------------------------------------------|-----|
|      | $\mathbf{MA/socket\ distances\ and\ workloads.}\ (a)$ -(f) shows the latency-bandwidth |     |
|      | trend difference of accessing data from different distances in sequential access       |     |
|      | pattern, sorted by the proportion of write. We refer to main memory as                 |     |
|      | <b>MMEM</b> , with MMEM-r and CXL-r representing remote socket MMEM and                |     |
|      | cxl memory access, respectively. The Y-axis is on a logarithmic scale                  | 75  |
| 4.5  | KeyDB YCSB latency and throughput under different configura-                           |     |
|      | tions. (a) Average throughput of four YCSB workload under different system             |     |
|      | configuration. (b) Tail latency of YCSB-A (c) Tail latency CDF of YCSB-C,              |     |
|      | both reported by the YCSB client [199]                                                 | 79  |
| 4.6  | Spark memory layout and shuffle spill. Each Spark executor possesses a                 |     |
|      | fixed-size On-Heap memory, which is dynamically divided between execution              |     |
|      | and storage memory. If there is insufficient memory during shuffle operations,         |     |
|      | the Spark executor will spill the data to the disk                                     | 82  |
| 4.7  | Spark execution time and shuffle percentage. (a) Execution time of                     |     |
|      | each TPC-H query normalized to the execution time running on MMEM. (b)                 |     |
|      | The percentage of time spent of shuffle operation for each query. The solid            |     |
|      | bars represent shuffle writes, while hollow bars represent shuffle reads               | 83  |
| 4.8  | KeyDB Performance with YCSB-C on CXL/MMEM.                                             | 87  |
| 4.9  | LLM inference framework. The Httpserver receive requests and forward                   |     |
|      | the tokenized requests to the CPU inference backend. The CPU inference                 |     |
|      | backend serves the requests and reply the next token                                   | 89  |
| 4.10 | CPU LLM inference.                                                                     | 89  |
| 5.1  | Experiment results. Please use the same y-axis title for (b) and (c)                   | 99  |
| 5.2  | Example of ROI modeling: replace computation with memory access                        | 102 |
| A.1  | Time cloud applications spend in pointer traversals based on prior                     |     |
|      | studies                                                                                | 2   |

| A.2 | Network and memory bandwidth utilization. PULSE and RPC utilize                       |    |
|-----|---------------------------------------------------------------------------------------|----|
|     | over $90\%$ of the available memory bandwidth, while the cache-based approach         |    |
|     | suffers from swap system overhead. In Webservice, the network bandwidth               |    |
|     | becomes the bottleneck due to large 8 KB data transfers                               | 9  |
| A.3 | (a) PULSE latency is up to $1.3 \times$ lower for skewed than uniform access patterns |    |
|     | due to caching. (b) Offloaded allocations in PULSE improve the WebService             |    |
|     | request latencies as the proportion of writes increases by reducing the number        |    |
|     | of round trips per allocation.                                                        | 10 |
| A.4 | Sensitivity to traversal length and the number of memory pipelines.                   |    |
|     | (a) PULSE latency scales linearly with the length of traversal. (b) PULSE             |    |
|     | accelerator can saturate memory bandwidth with just two PULSE memory                  |    |
|     | pipelines                                                                             | 10 |
| A.5 | Allocation policy. PULSE performs better with the partitioned allocation              |    |
|     | since it minimizes cross-node traversals.                                             | 11 |
| A 6 | Application performance using workload with uniform distribution.                     | 12 |

# List of Tables

| 3.1 | PULSE adapts a restricted subset of RISC-V ISA (§3.4.1) 49                  |
|-----|-----------------------------------------------------------------------------|
| 3.2 | Workloads used in our evaluation (§3.6). $t_c$ and $t_d$ correspond to com- |
|     | pute and memory access time at the PULSE accelerator                        |
| 4.1 | Configurations used in capacity experiments                                 |
| 4.2 | Intel Processor Series                                                      |
| 4.3 | Parameters of our Abstract Cost Model                                       |
| 5.1 | ROI Modeling                                                                |
| A.1 | Additional data structure supported by PULSE                                |
| A.2 | Comparison between traditional core architecture and Pulse architecture 10  |

# Acknowledgements

A lot of people are awesome. Probably your family, friends, advisor, and that one super special high school teacher who believed in you.

## Chapter 1

## Introduction

The growing demand for scalable and efficient data center architectures has led to the emergence of resource disaggregation [1–9]. This modern paradigm represents a significant shift from traditional monolithic server architectures. In conventional setups, servers are typically equipped with a fixed combination of compute, memory, and storage resources. In contrast, resource-disaggregated systems physically separate these resources and distribute them across various interconnects, such as networks [1–3], CXL [10,11], and others. This separation allows for more flexible resource allocation and utilization.

Within the broader context of resource disaggregation in modern data center architectures, memory disaggregation [4–9] plays a crucial and foundational role. In traditional monolithic server configurations, memory often becomes a bottleneck, limiting the scalability and adaptability of applications. This issue has been frequently observed and reported in production data centers [12–21]. By decoupling memory resources from compute and storage elements and presenting them as pooled, disaggregated resources [22,23], data centers can achieve increased efficiency, scalability, and adaptability. Memory-intensive applications [24–26] can access the memory they need on demand, without being constrained by the limitations of individual servers.

Memory disaggregation is the first step toward realizing the full potential of resource disaggregation, enabling data centers to efficiently allocate and utilize resources based on dynamic application needs. This ultimately leads to improved performance and better resource utilization.



Fig. 1.1: Cloud Stack of Disaggregated Architecture.

### 1.1 Limitations of Existing Approaches

While resource disaggregation offers numerous advantages, transitioning existing applications to a disaggregated architecture is far from straightforward. Recent research has explored various approaches to address this challenge. Some efforts focus on adapting applications to optimize their use of disaggregated memory [27–30], while others aim to transparently port applications, shifting the responsibility of mitigating performance penalties—arising from the mismatch between disaggregated architectures and traditional software interfaces—to the service or operating system layer [1,2,31–34].

The core issue lies in the fundamental mismatch between the existing cloud stack, designed for monolithic architectures, and the requirements of disaggregated architectures (Figure 1.1). The current cloud and hardware stacks are not inherently aware of the unique characteristics of disaggregated memory, leading to distinct challenges across different layers of the stack:

**Application interface.** In disaggregated architectures, applications face unique challenges compared to traditional monolithic systems. The primary difference is resource distribution:

compute, memory, and storage are spread across multiple nodes instead of centralized in one server. This requires complex communication and data management strategies to handle increased latency and resource management needs. In contrast, monolithic architectures offer integrated resources, simplifying application interaction. Adapting to disaggregated systems involves significantly redesigning applications for effective resource utilization and management.

OS support. Unlike monolithic servers where the OS manages resources within a single server, the placement and function of the OS in disaggregated architectures are still subjects of debate in both industry and academia. Options include centralizing the OS at a single point [1] in the architecture or disaggregating its functions across different resource nodes [2].

Performance overheads of disaggregation. Transitioning existing applications to a disaggregated architecture transparently introduces a spectrum of performance challenges. These include, but are not limited to, managing memory partitioning [35] and addressing applications with irregular memory access patterns [36]. Various other issues, such as latency sensitivity, bandwidth limitations, and the overhead of remote resource management, compound this complexity. These factors contribute to the overall performance penalty that disaggregated systems must carefully consider and mitigate.

Future interconnects. Using networks as interconnects for resource disaggregation has been a subject of exploration in academia and industry. However, networks have inherent challenges, such as performance slowdowns compared to intra-server resource access and a lack of inherent coherency. Advanced hardware technologies like Compute Express Link (CXL) [10, 11, 37] offer promising enhancements with faster access times and hardware-supported cache coherence. Yet, the current state of hardware prototypes and software support for these technologies remains limited.

#### 1.2 Thesis Overview

In this dissertation, we attempt to take a top-down approach and explore the optimal memory management solutions for three most significant layers, i.e. Service, OS and Hardware layers of disaggregated memory architectures.

#### 1.2.1 Memory management as a Service

With least modification to lower layers such as OS/Hardware, we explore the design requirement and challenges in provoding memory management as a service. We proposed an end-to-end system design called Jiffy, which enables multiple application/tasks multiplex memory in a elastic manner. Jiffy also provides multiple popular data structure interface and can be easily applied to existing cloud applications.

#### 1.2.2 In-network memory management OS-design

As we decouple compute and memory resources in disaggregated architecture. There is no single host as if in monolithic architecture in order to implement the key unit of resource management - the operating system. We proposal a new generation operating system design by placing OS functionality inside the interconnects. We start by a system called MIND, addressing the basic problems in memory management, such as memory address translation, memory protection, and cache coherence between multiple hosts. Such resource decoupling and in-network memory management serves well for cache-friendly workload, but performs poor for cache-friendly workload due to the back-and-forth communication over the slower interconnects. We then develop optimizations for dealing with cache-unfriendly workloads. We design and implement a near memory accelerator from scratch, named PULSE. PULSE analyzes popular pointer traversal applications and identify a common but simple interface that can be easily integrated into existing cloud applications.

#### 1.2.3 Memory management adaptation for new-generation interconnects

In prior work [1,2], ethernet is considered as the most popular interconnect for disaggregated data centers. However, as new memory interconnects are emerging, such as Compute Express Link(CXL), new adaptation of memory management needs to be made regarding the new interconnect interface. Within the context of disaggregated architecture, new problems arises such as how can the applications leverage multiple tiers of memory. Therefore, we start with a performance analysis on CXL 1.1 single host extended memory, and then we propose a new system design that integrates disaggregated CXL memory pool with today's

emerging popular application - LLM inference.

## 1.3 Outline and Previously Published Work

This dissertation is organized as follows. Chapter 2 introduces Jiffy, a distributed memory management system that decouples memory capacity and lifetime from compute in the serverless paradigm. Chapter 3 describes two innovated system design: (1) MIND, a rack-scale memory disaggregation system that uses programmable switches to embed memory management logic in the network fabric. (2) PULSE, a framework centered on enhancing in-network optimizations for irregular memory accesses within disaggregated data centers. Chapter 4 presents our exploration in latest Compute Express Link(CXL) hardware. We conclude with our contributions and possible future work directions in Chapter 5.

Chapter 2 revises material from [35]. Chapter 3 revises material from [1] and [36]. Finally, Chapter 4 revises material from [38].

## Chapter 2

## Memory Management as a Service

The service layer, positioned above the OS layer, plays a pivotal role in facilitating efficient and seamless memory sharing across multiple computing and memory nodes within a disaggregated architecture. As application software, it provides greater flexibility than the operating system, allowing for a variety of services to be offered to applications. These adaptable services enable applications to choose options best suited to their specific needs. However, this requires that the storage and compute are easily decoupled, otherwise the application developers will need to spend enormous effort to modify the application for it to use memory management service.

Serverless architecture offer on-demand elasticity of compute and storage and decouples them logically. Recent work on serverless analytics has demonstrated the benefit of using serverless architecture for resource- and cost-efficient data analytics. The key idea of serverless analytics is to use a remote low-latency, high-throughput shared far-memory system for (1) inter-task communication and (2) for multi-stage jobs, storing intermediate data beyond the lifetime of the task that produced the data. This makes it a perfect target for disaggregate memory since compute and memory are decoupled logically when the serverless task is assigned.

Designing a memory management service is a non-trivial tasks. Our discussion begins with an outline of the essential requirements for such memory management services, focusing on the unique challenges introduced by disaggregation. We then highlight our current efforts to tackle these challenges and explore potential directions for future research in this rapidly

evolving domain.

Elasticity. Memory usage in modern computing environments can be highly variable, with applications experiencing fluctuating memory demands [35]. Elasticity allows the memory service to dynamically allocate and deallocate memory resources based on current requirements, optimizing resource utilization. In typical applications with dynamic memory requirements, such as data analytics, applications are organized into jobs that contain multiple tasks. Each task can be assigned to run on an arbitrary compute node. Each task communicates with the other using memory as intermediate storage. Previous solutions [39] tend to allocate resources in a job granularity. Jobs specify their memory demands before the job is submitted and the system reserves the amount of memory for the entire job lifetime. The tradeoff between performance and resource utilization for such job-level resource allocation is indeed well studied in prior work [35]. On the one hand, if jobs specify an average demand of memory, the job will degrade as running out of memory will lead to swapping data out to slower storage medium (e.g. S3 storage), while on the other hand allocating at peak granularity will result in resource wastage.

Isolation. The second requirement is the isolation between different compute tasks. Since multiple computing threads can be using the same disaggregated memory pool, it's essential to multiplex between applications to improve resource efficiency but at the same time keep the memory of different threads isolated from each other, which means that the memory usage of a particular application should not affect other existing applications. The number of tasks reading and writing to the shared disaggregated memory can change rapidly in serverless analytics which makes the problem even more severe.

Lifetime management. Decoupling compute tasks from their intermediate storage means that the tasks can fail independent of the intermediate data, therefore we need mechanisms for explicity lifetime management of intermediate data.

**Data repartitioning.** Decoupling tasks from their intermediate data also means that data partitioning upon elastic scaling of memory capacity becomes challenging, especially for certain data types used in serverless analytics (e.g. key-value store). If it's the application's responsibility to perform such repartitioning, it will involve large network transfers between

compute tasks and the far memory system and massive read/write operations every time the capacity is scaled. What's more, the application need to implement different partitioning strategies for different kind of data structures used. Therefore, new mechanisms to efficiently enable data partitioning within the far memory system is essential.

We present Jiffy, an elastic disaggregated-memory system for stateful serverless analytics. Jiffy allocates memory resources at the granularity of small fixed-size memory blocks - multiple memory blocks store intermediate data for individual tasks within a job. Jiffy design is motivated by virtual memory design in operating systems that also does memory allocation to individual process at the granularity of fixed-size memory blocks(pages). Jiffy adapts this design to stateful serverless analytics. Performing resource allocation at the granularity of small memory blocks allows Jiffy to elastically scale memory resources allocated to individual jobs without a priori knowledge of intermediate data sizes and to meet the instantaneous job demands at seconds timescales. As a result, Jiffy can efficiently multiplex the available faster memory capacity across concurrently running jobs, thus minimizing the overheads of reads and writes to significantly slower secondary storage (e.g., S3 or disaggregated storage)

## 2.1 Elastic memory management for data analytics

Data analytics applications, which utilize disaggregated memory for inter-task communication and intermediate data storage, are becoming increasingly common. As discussed in [39–42], these applications handle user requests in the form of jobs, each defining its memory needs upon creation. The dilemma of balancing performance with resource efficiency for job-level memory allocation has been extensively studied [43,44]. If a job is based on average demand, performance may decline during peak demand periods due to inadequate memory, causing data spillage to slower secondary storage, such as SSDs. Conversely, allocating memory for peak demands leads to underutilization of resources when the actual demand is below peak. Evaluations on Snowflake's workload, as shown in [43], indicate a significant fluctuation in the ratio of peak to average demands, sometimes varying by two orders of magnitude within minutes.

In response to the challenges of dynamically allocating memory resources in data ana-

lytics applications, we have developed Jiffy [35], an elastic memory service tailored for disaggregated architectures. As shown in Figure ??, Jiffy allocates memory in small, fixed-size blocks, enabling the dynamic adjustment of memory allocation for individual jobs without prior knowledge of intermediate data sizes. Jiffy employs a hierarchical address space that reflects the structure of the analytics job, facilitating efficient management of the relationship between memory blocks and tasks while ensuring task-level isolation.

#### 2.2 Introduction

Serverless architectures offer flexible compute and storage options, charging users for precise resource usage. Initially used for web microservices, IoT, and ETL tasks, recent advancements show their efficacy in data analytics. Serverless analytics leverage remote, high-throughput memory systems for inter-task communication and storing intermediate data. However, existing far-memory systems face limitations, allocating resources at the job level, leading to performance issues and underutilization.

To address this, we introduce Jiffy, an elastic far-memory system for stateful serverless analytics. Unlike conventional systems, Jiffy allocates memory in small, fixed-size blocks, enabling dynamic scaling and efficient resource utilization. This approach resolves challenges unique to serverless analytics, including task mapping, task isolation, and data lifetime management.

Our implementation of Jiffy features an intuitive API for seamless data manipulation. We demonstrate its versatility by implementing popular distributed frameworks like MapReduce, Dryad, StreamScope, and Piccolo. Evaluation against state-of-the-art systems indicates Jiffy's superior resource utilization and application performance, achieving up to 3x better efficiency and 1.6–2.5x performance improvements.

#### 2.3 Motivation

The leading system for stateful serverless analytics is Pocket, a distributed system designed for high-throughput, low-latency storage of intermediate data. Pocket effectively tackles several key challenges in stateful serverless analytics, including:

Centralized management. Pocket's architecture features separate control, metadata, and data planes. While data storage is distributed across multiple servers, management functions are centralized, simplifying resource allocation and storage organization. A single metadata server can handle significant request loads, supporting thousands of serverless tasks.

Multi-tiered data storage. Pocket's data plane stores job data across multiple servers and serves them via a key-value API. It supports storage across different tiers like DRAM, Flash, or HDD, enabling flexibility based on performance and cost constraints.

**Dynamic resource management.** Pocket can scale memory capacity by adding or removing memory servers based on demand. The controller allocates resources for jobs and informs the metadata plane for proper data placement.

Analytics execution with Pocket. Jobs interact with Pocket by registering with the control plane, specifying memory resources needed. The controller allocates resources and informs the metadata plane. Serverless tasks can access data directly from memory servers. Once a job finishes, it deregisters to release resources.

In our analysis, we focus on challenges in Pocket's resource allocation. Pocket allocates memory at the job level, which poses challenges in accurately predicting intermediate data sizes and leads to performance degradation or resource underutilization. This issue persists due to the dynamic nature of intermediate data sizes across different stages of execution.

## 2.4 Jiffy Design

#### 2.4.1 Overview

Jiffy facilitates precise sharing of far-memory capacity among concurrent serverless analytics tasks for intermediate data storage. Drawing inspiration from virtual memory, Jiffy divides memory capacity into fixed-sized blocks, akin to virtual memory pages, and performs allocations at this granular level. This approach yields two key benefits: firstly, Jiffy can swiftly adapt to instantaneous job demands, adjusting capacity at the block level within seconds. Secondly, Jiffy doesn't necessitate prior knowledge of intermediate data sizes from jobs; instead, it dynamically manages resources as tasks write or delete data.

It's worth noting that multiplexing available memory capacity differs from merely scaling the memory pool's overall capacity. While prior systems like Pocket focus on the latter, adding or removing memory servers based on job arrivals or completions, Jiffy prioritizes efficient sharing of available capacity among concurrent jobs. This approach minimizes underutilization of existing capacity, a common issue in job-level resource allocation systems. Even during high memory capacity utilization, Jiffy can augment capacity by adding memory servers akin to Pocket. Notably, by efficiently multiplexing capacity across concurrent jobs, Jiffy reduces the need for frequent additions or removals of memory servers.

In addressing the challenges posed by serverless analytics, Jiffy implements hierarchical addressing, data lifetime management, and flexible data repartitioning. These mechanisms are discussed in detail in subsequent sections, with illustrative examples provided in Fig. 3, depicting a typical analytics job's execution plan organized as a directed acyclic graph (DAG) with computation tasks represented as serverless functions exchanging intermediate data via Jiffy.

#### 2.4.2 Hierarchical Addressing

Analytics jobs typically follow a multi-stage or directed acyclic graph structure. In server-less analytics, where compute elasticity is integral, each job may entail tens to thousands of individual tasks. Consequently, achieving fine-grained resource allocation necessitates an efficient mechanism for maintaining an updated mapping between tasks and allocated memory blocks. Additionally, the rapidly changing number of tasks accessing shared memory underscores the importance of isolation at the task level to prevent performance degradation across jobs. In this context, Jiffy's hierarchical addressing system plays a crucial role.

Instead of relying on a network structure, Jiffy employs a hierarchical addressing mechanism tailored to the execution structure of analytics jobs. It organizes intermediate data within a virtual address hierarchy, reflecting the dependencies between tasks in the job's DAG. For instance, internal nodes represent tasks, while leaf nodes denote memory blocks storing intermediate data. The addressing scheme enables precise resource allocation at the task level, independent of other tasks, akin to virtual memory's process-level isolation.

This hierarchical addressing facilitates efficient management of resource allocations, en-

suring that overflow into persistent storage doesn't impact the performance of other tasks. Each memory block, once allocated, remains dedicated to its task until explicitly released, guaranteeing isolation at the task level regardless of concurrency. This approach aligns with virtual memory principles, where each process enjoys its own address space, ensuring isolation at the process level.

Jiffy's design considers two key aspects. Firstly, resource allocation is decoupled from policy enforcement, allowing seamless integration of fairness algorithms atop Jiffy's allocation mechanism. Secondly, address translation, handled centrally, enables addressing for arbitrary DAGs without imposing limitations on execution structure complexity. While Jiffy's hierarchical addressing introduces complexity at the controller, its scalability is validated in our evaluation, accommodating realistic deployment demands.

Regarding block sizing, Jiffy's approach, akin to traditional virtual memory's page sizing, balances metadata overhead and memory utilization. Larger block sizes reduce per-block metadata, but may lead to data fragmentation, while smaller sizes optimize memory utilization at the expense of increased metadata overhead. Jiffy mitigates fragmentation via data repartitioning and allows block size configuration during initialization for compatibility with analytics frameworks.

Isolation granularity in Jiffy is task-level by default, but can be adjusted finer or coarser by adapting the hierarchy. For most analytics frameworks, task-level isolation suffices, but custom hierarchies can be created using Jiffy's API to tailor isolation to specific needs.

#### 2.4.3 Data Lifetime Management

Existing far-memory systems for serverless analytics typically manage data lifetimes at the granularity of entire jobs, reclaiming storage only when a job explicitly deregisters. However, in serverless analytics, the intermediate data of a task is dissociated from its execution, residing in the far-memory system. This decoupling extends to fault domains: traditional mechanisms, such as reference counting, can result in dangling intermediate data if a task fails. To address this inefficiency, effective task-level data lifetime management mechanisms are required.

Jiffy tackles this challenge by integrating lease management mechanisms with hierarchical

addressing. Each address-prefix in a job's hierarchical addressing is associated with a lease, and data remains in memory only as long as the lease is renewed. Consequently, jobs periodically renew leases for the address-prefixes of running tasks. Jiffy tracks lease renewal times for each node in the address hierarchy, updating them accordingly. Upon lease expiry, Jiffy reclaims allocated memory after flushing data to persistent storage, ensuring data integrity even in the event of network delays.

A novel aspect of Jiffy's lease management is its utilization of DAG-based hierarchical addressing to determine dependencies between leases. When a task renews its lease, Jiffy extends the renewal to the prefixes of tasks it depends on (parent nodes) and the prefixes of tasks dependent on it (descendant nodes), minimizing the number of renewal messages sent. This approach ensures that not only is a task's own data retained in memory while it's active, but also the data of tasks it depends on and tasks dependent on it. This mechanism strikes a balance between age-based eviction and explicit resource management, granting jobs control over resource lifetimes while tying resource fate to job status.

In an example scenario, task T7 periodically renews leases for its prefix during execution, ensuring the retention of intermediate data for blocks under it in memory. Lease renewals for T7's prefix also extend to its parent and descendant tasks, ensuring continuity of data access. However, leases for inactive tasks are not automatically renewed, preventing unnecessary resource retention.

Lease duration in Jiffy involves a tradeoff between control plane bandwidth and system utilization. Longer lease durations reduce network traffic but may lead to underutilization of resources until leases expire. Jiffy's sensitivity to lease durations is evaluated in the subsequent section.

#### 2.4.4 Flexible Data Repartitioning

Decoupling compute tasks from their intermediate data in serverless analytics poses a challenge in achieving memory elasticity efficiently at fine granularities. When memory is allocated or deallocated to a task, repartitioning the intermediate data across the remaining memory blocks becomes necessary. However, due to the decoupling and the high concurrency of tasks, it's impractical to expect the application to handle this repartitioning. For

instance, in many existing serverless analytics systems, key-value stores are used to store intermediate data. If a compute task were to handle repartitioning upon memory scaling, it would need to fetch key-value pairs from the store over the network, compute new data partitions, and then write back the data, incurring significant network latency and bandwidth overheads.

As discussed in §5, Jiffy already incorporates standard data structures utilized in data analytics frameworks, ranging from files to key-value pairs to queues. Analytics jobs leveraging these data structures can delegate repartitioning of intermediate data upon resource allocation/deallocation to Jiffy. Each block allocated to a Jiffy data structure monitors the fraction of memory capacity currently utilized for data storage. When usage surpasses a high threshold, Jiffy allocates a new block to the corresponding address-prefix. Subsequently, the overloaded block initiates data structure-specific repartitioning to migrate some data to the new block. Conversely, when block usage falls below a low threshold, Jiffy identifies another block with low usage within the address-prefix for potential data merging. The block then undergoes the necessary repartitioning before deallocation by Jiffy.

By tasking the target block with repartitioning instead of the compute task, Jiffy circumvents network and computational overheads for the task itself. Furthermore, data repartitioning in Jiffy occurs asynchronously, enabling data access operations across data structure blocks to proceed even during repartitioning. This ensures minimal impact on application performance due to repartitioning.

The data structures integrated into Jiffy enable the implementation of serverless versions of various powerful distributed programming frameworks, including MapReduce, Dryad, StreamScope, and Piccolo. Notably, the simplicity of repartitioning mechanisms required by analytics framework data structures allows serverless applications utilizing these programming models to seamlessly run on Jiffy and leverage its adaptable data repartitioning without any modifications.

Regarding thresholds for elastic scaling, the high and low thresholds in Jiffy present a tradeoff between data plane network bandwidth and task performance on one side and system utilization on the other. Optimizing these thresholds balances the frequency of elastic scaling triggers and system utilization efficiency. We evaluate Jiffy's sensitivity to threshold selections in §6.6.

### 2.5 Implementation

We implement Jiffy based on prior Serverless memory management system - Pocket. We reused the scalable and fault-tolerant metadata plane, system-wide capacity scaling, analytics execution model, etc. However, Jiffy implements hierarchical addressing, lease management and efficient data repartitioning to resolve unique challenges introduced by serverless environment.

#### 2.5.1 Jiffy Interface

We describe Jiffy interface in terms of its user-facing API and internal API.

User-facing API. User-facing API. Jiffy's user-facing interface (Table 1) is divided along its two core abstractions: hierarchical addresses and data structures. Jobs add a new address-prefix to their address hierarchy using createAddrPrefix, specifying the parent address-prefix, along with optional arguments such as initial capacity. Jiffy also provides a createHierarchy interface to directly generate the complete address hierarchy from the application's execution plan (i.e., DAG), and flush/load interfaces to persist/load address-prefix data from external storage (e.g., S3). Jiffy provides three built-in data structures that can be associated with an address-prefix (via initDataStructure), and a way to define new data structures using its internal API.

Similar to existing systems, data structures also expose a notification interface, so that tasks that consume intermediate data can be notified on data availability. For instance, a task can subscribe to write operations on its parent task's data structure, and obtain a listener handle. Jiffy asynchronously notifies the listener upon a write to the data structure, which the task can get via listener.get().

Internal API. The data layout within blocks in Jiffy is unique to the data structure that owns it. As such, Jiffy blocks expose a set of data structure operators (Fig. 6) that uniquely define how data structure requests are routed across their blocks and how data is accessed or modified. These operators are used internally within Jiffy for its built-in data structures

(§5) and are not exposed to jobs directly.

The getBlock operator determines which block an operation request is routed to based on the operation type and operation-specific arguments (e.g., based on key hashes for a KV-store) and returns a handle to the corresponding block. Each Jiffy block exposes writeOp, readOp, and deleteOp operators to facilitate data structure-specific access logic (e.g., get, put, and delete for KV-store). Jiffy executes individual operators atomically using sequence numbers, but does not support atomic transactions that span multiple operators.

### 2.6 Implementation

Jiffy's high-level design components are similar to Pocket's, except for one difference: Jiffy combines the control and metadata planes into a unified control plane. We found this design choice allowed us to significantly simplify interactions between the control and metadata components, without affecting their performance. While this does couple their fault domains, standard fault-tolerance mechanisms are still applicable to the unified control plane.

#### 2.6.1 Jiffy Controller

The Jiffy controller (Fig. 7) maintains two pieces of system-wide state. First, it stores a free block list, which lists the set of blocks that have not been allocated to any job yet, along with their corresponding physical server addresses. Second, it stores an address hierarchy per job, where each node in the hierarchy stores a variety of metadata for its address prefix, including access permissions (for enforcing access control), timestamps (for lease renewal), a block-map (to locate the blocks associated with the address prefix in the data plane), along with metadata to identify the data structure associated with the address prefix and how data is partitioned across its blocks. The mapping between job IDs (which uniquely identify jobs) and their address hierarchies is stored in a hash table at the controller.

**Block allocator.** When a job creates an address prefix in Jiffy, the block allocator at the control plane assigns it the number of blocks corresponding to the requested initial capacity from its pool of free blocks. While assigning the blocks, the controller updates its state: the free block list, access permissions, and block-map for that address prefix. Assignment

of blocks across address prefixes is akin to virtual memory in traditional operating systems:

Jiffy multiplexes its physical memory pools at the data plane across different prefixes at block granularity, while individual tasks operate under the illusion that their prefixes have infinite memory resources.

Metadata manager. The metadata manager tracks the partitioning information specific to different data structures (§5) and assists clients in maintaining a consistent view of how the data is organized across the blocks allocated to each data structure. We defer the discussion of data structure-specific metadata stored at the control plane to §5, but note that this metadata is updated whenever blocks allocated to an address prefix are scaled. A client detects that a scaling has occurred when it queries the data plane and updates its view of the partitioning metadata by querying the control plane.

Lease manager. The lease manager implements lifetime management in Jiffy. It comprises a lease renewal service that listens for renewal requests from jobs and updates the lease renewal timestamp of relevant nodes in its address hierarchy, and a lease expiry worker that periodically traverses all address hierarchies, marking nodes with timestamps older than the associated lease period as expired.

Controller scaling and fault tolerance. In order to scale the control plane, Jiffy can employ multiple controller servers, each managing control operations for a non-overlapping subset of address hierarchies (across jobs) and blocks (across memory servers at the data plane). Jiffy employs hash partitioning to distribute both address prefixes and memory blocks (via their block IDs) across controller servers. Moreover, Jiffy employs the same approach to scale its control plane to multiple cores on a multi-core server. Jiffy adopts primary-backup based mechanisms from prior work [8, 69] at each controller server for fault-tolerance.

#### 2.6.2 Jiffy Data Plane

Jiffy data plane is responsible for two main tasks: providing jobs with efficient, datastructure specific atomic access to data, and repartitioning data across blocks allocated by the control plane during resource scaling. It partitions the resources in a pool of memory servers across fixed-sized blocks. Each memory server maintains, for the blocks managed by it, a mapping from unique block IDs to pointers to raw memory allocated to the blocks, along with two additional metadata: data structure-specific operator implementations as described in §4.1, and a subscription map that maps data structure operations to client handles that have subscribed to receive notifications for that operation.

We implement a high-performance RPC layer at the data plane using Apache Thrift [70] for interactions between clients and memory servers. While Thrift already provides low-overhead serialization/deserialization protocols, we add two key optimizations at the RPC layer. First, our server-side implementation employs asynchronous framed IO to multiplex multiple client sessions, permitting requests across different sessions to be processed in a non-blocking manner for lower latency and higher throughput. Second, while our client-side library is implemented in Python for compatibility with AWS Lambda, it employs thin Python wrappers around Thrift's C-libraries to minimize performance overheads.

Data repartitioning for a Jiffy data structure is implemented as follows: when a block's usage grows above the high threshold, the block sends a signal to the control plane, which, in turn, allocates a new block to the address prefix and responds to the overloaded block with its location. The overloaded block then repartitions and moves part of its data to the new block (see Fig. 8); a similar mechanism is used when the block's usage falls below the low threshold.

For applications that require fault tolerance and persistence for their intermediate data, Jiffy supports chain replication [71] at block granularity and synchronously persisting data to external stores (e.g., S3) at address-prefix granularity.

## 2.7 Jiffy Programming Model

#### 2.7.1 Map-Reduce Model

A Map-Reduce (MR) program [53] comprises map functions that process a series of input key-value (KV) pairs to generate intermediate KV pairs, and reduce functions that merge all intermediate values for the same intermediate key. MR frameworks [53, 67, 72] parallelize map and reduce functions across multiple workers. Data exchange between map and reduce

workers occurs via a shuffle phase, where intermediate KV pairs are distributed in a way that ensures values belonging to the same key are routed to the same worker.

MR on Jiffy executes map/reduce tasks as serverless tasks. A master process launches, tracks progress of, and handles failures for tasks across MR jobs. Jiffy stores intermediate KV pairs across multiple shuffle files, where shuffle files contain a partitioned subset of KV pairs collected from all map tasks. Since multiple map tasks can write to the same shuffle file, Jiffy's strong consistency semantics ensures correctness. The master process handles explicit lease renewals.

Jiffy Files. A Jiffy file is a collection of blocks, each storing a fixed-sized chunk of the file. The controller stores the mapping between blocks and file offset ranges managed by them at the metadata manager; this mapping is cached at clients accessing the file, and updated whenever the number of blocks allocated to the file is scaled in Jiffy. The getBlock operator forwards requests to different file blocks based on the offset range for the request. Files support sequential reads, and writes via append-only semantics. For random access, files support seek with arbitrary offsets. Jiffy uses the provided offset to identify the corresponding block and forwards subsequent read requests to it. Finally, since files are append-only, blocks can only be added to it (not removed), and do not require repartitioning when new blocks are added.

#### 2.7.2 Dataflow and Streaming Dataflow Models

In the dataflow programming model, programmers provide DAGs to describe an application's communication patterns. DAG vertices correspond to computations, while data channels form directed edges between them. We use Dryad [54] as a reference dataflow execution engine, where channels can be files, shared memory FIFO queues, etc. The Dryad runtime schedules DAG vertices across multiple workers based on their dataflow dependencies. A vertex is scheduled when all its input channels are ready: a file channel is ready if all its data items have been written, while a queue is ready if it has any data item. Streaming dataflow [55] employs a similar approach, except channels are continuous event streams.

Dataflow on Jiffy maps each DAG vertex to a serverless task, while a master process

handles scheduling, fault tolerance, and lease renewals for Jiffy. We use Jiffy FIFO queues and files as data channels. Since queue-based channels are considered ready as long as some vertex is writing to it, Jiffy allows downstream tasks to efficiently detect if items produced by upstream tasks are available via notifications.

Jiffy Queues. A FIFO queue in Jiffy is a continuously growing linked list of blocks, where each block stores multiple data items, and a pointer to the next block in the list. The queue size can be upper-bounded (in number of items) by specifying a maxQueueLength. The controller only stores the head and the tail blocks in the queue's linked list, which the client caches and updates whenever blocks are added/removed. The queue supports enqueue/dequeue to add/remove items. The getBlock operator routes enqueue and dequeue operations to the current tail and head blocks in the linked list, respectively. While blocks can be both added and removed from a queue, queues do not need subsequent data repartitioning. Finally, the queue leverages Jiffy notifications to asynchronously detect when there is data in the queue to consume, or space in the queue to add more items, via subscriptions to enqueue and dequeue, respectively.

#### 2.7.3 Piccolo

Piccolo [56] is a data-centric programming model that enables distributed compute machines to share mutable, distributed state. In Piccolo, kernel functions specify sequential application logic and share state with concurrent kernel functions through a KV interface, while centralized control functions manage and coordinate both the shared KV stores and the instances of kernel functions. Concurrent updates to the same key in the KV store are resolved using user-defined accumulators.

Piccolo on Jiffy runs kernel functions across serverless tasks, while control tasks are managed by a centralized master process. The shared state is distributed across Jiffy's KV-store data structures (detailed below). KV-stores can be created either per kernel function or shared across multiple functions, depending on the application requirements. The master process also handles periodic lease renewals for Jiffy KV-stores. Similar to Piccolo, Jiffy checkpoints KV-stores by flushing them to an external store.

Jiffy KV-store. The Jiffy KV-store hashes each key to one of H hash slots in the range [0, H-1] (H=1024 by default). The KV-store shards key-value pairs across multiple Jiffy blocks, with each block responsible for one or more hash slots within this range. Each hash slot is entirely contained within a single block. The controller stores the mapping between the blocks and the hash slots they manage; this metadata is cached at the client and updated during resource scaling. Each block stores the key-value pairs that hash to its slots in a hash table, with Jiffy utilizing cuckoo hashing [73] to support highly concurrent KV operations. The KV-store supports typical get, put, and delete operations through implementations of readOp, writeOp, and deleteOp operators. The getBlock operator routes requests to the appropriate KV-store blocks based on key hashes.

Unlike files and queues, data in the KV-store must be repartitioned when a block is added or removed. When a block nears its capacity, Jiffy reassigns half of its hash slots to a new block, transfers the corresponding key-value pairs, and updates the block-to-hash-slot mapping at the controller. Similarly, when a block is nearly empty, its hash slots are merged with another block.

## 2.8 Applications and Evaluation

#### 2.9 Related Work

#### 2.10 Conclusion

## Chapter 3

## Operating System Layer

In the previous chapter we explore a design of memory management for disaggregated architecture in the service layer. However, integrating general application with an external memory service is challenging. In this chapter, we explore how to follow the class design of operating system, and leave the memory management functionality within the operating system. Transparency is an important aspect when considering migrating existing data center applications on disaggregated architecture. The operating system layer plays a crucial role in supporting the core functionality of a disaggregated architecture. This includes tasks like thread scheduling and data movement (paging). One of the key questions that arises is where the operating system should be situated within this architecture. There are two main options to consider:

Centralized OS Management. One approach is to place the operating system at a central point within the system, providing it with a global view. The advantage of this approach is that it maintains a well-defined operating system structure, requiring only minor modifications for application integration. However, ensuring that the central OS design doesn't introduce significant overhead is essential since the operating system typically lies on the critical path for applications, such as paging.

**Disaggregation of OS Functions.** An alternative approach involves the disaggregation of operating system functions across various resource blades, a concept explored in [2]. The rationale behind this approach is that many OS functionalities are closely intertwined

with specific resources and remain largely independent of other system components. For instance, GPU driver functionality can be situated within GPU resource pools rather than near compute or memory nodes. While this approach offers enhanced flexibility, it requires a substantial effort to overhaul the operating system. It may introduce synchronization overhead due to the inherently distributed nature of the system, necessitating additional coordination.

In the upcoming subsections, we present a hierarchical OS design, combining elements from the previously discussed options. Subsequently, we delve into our validation efforts concerning centralized and disaggregated OS functionality. Finally, we introduce prospective avenues for future work.

### 3.1 Hierarchical OS design

Rather than exclusively opting for one of these two approaches, we advocate for a hybrid OS design that integrates elements from both options mentioned earlier. Our observation suggests that operating system functionality can be classified into two distinct groups:

Non-disaggregated Functionalities. This category encompasses OS functionality that necessitates a holistic view of the entire system, including tasks like thread scheduling and memory management tasks such as memory address translation, protection, and paging. The operating system actively monitors the whole system, including available memory and compute resources, dynamically allocating computing and data resources to optimize system performance.

Disaggregated Functionalities. In contrast, this category comprises OS functions closely intertwined with specific resource types, including memory, SSD, or GPU drivers. In these contexts, it is more logical to position the functionality near the respective resource itself. Regarding memory management, this entails the implementation of memory access optimizations, such as enhancing the speed of irregular memory access. These optimization processes do not interact with other system components, obviating the need for a global view of the system.

# 3.2 In-Network Memory Management

### 3.2.1 Introduction

The current state of data center network bandwidth is rapidly approaching parity with intraserver resource interconnects, with projections indicating an imminent surpassing of this threshold. This dynamic shift has ignited considerable interest within both academic and industrial circles towards memory disaggregation—a paradigm where compute and memory are physically decoupled into network-attached resource blades. This transformation promises to revolutionize resource utilization, hardware diversity, resource scalability, and fault tolerance compared to conventional data center architectures.

However, memory disaggregation presents formidable challenges, primarily revolving around three key requisites. Firstly, remote memory access demands low latency and high throughput, with previous studies targeting latency under 10 microseconds and bandwidth exceeding 100 Gbps per compute blade to minimize performance degradation in applications. Secondly, both memory and compute resources must exhibit elastic scalability, aligning with the essence of disaggregation. Lastly, seamless adoption and immediate deployment necessitate compatibility with unaltered applications.

Despite years of concerted research efforts directed towards enabling memory disaggregation, existing approaches have failed to concurrently meet all three requirements. Most strategies mandate application modifications due to alterations in hardware, programming models, or memory interfaces. Recent endeavors facilitating transparent access to disaggregated memory have encountered limitations on application compute elasticity—processes are confined to compute resources on a single blade to mitigate cache coherence traffic over the network, driven by performance apprehensions.

Introducing MIND, a pioneering memory management system tailored for rack-scale memory disaggregation, which effectively fulfills all three prerequisites for disaggregated memory. At the core of MIND lies a novel concept—embedding memory management logic and metadata within the network fabric. This innovative approach capitalizes on the insight that the network fabric in a disaggregated memory architecture essentially functions as a CPU-memory interconnect. In MIND, programmable network switches, strategically posi-

tioned for in-network processing, assume the mantle of Memory Management Units (MMUs), enabling a high-performance shared memory abstraction. Leveraging programmable hardware at line rate, MIND minimizes latency and bandwidth overheads.

However, the realization of in-network memory management necessitates navigating through the unique constraints imposed by programmable switch ASICs. These challenges include limited on-chip memory capacity, constraints on computational cycles per packet, and staged packet processing pipelines spread across physically decoupled match-action stages.

To address the trifecta of requirements for memory disaggregation, MIND ingeniously maneuvers through these constraints and harnesses the capabilities of contemporary programmable switches to enable in-network memory management for disaggregated architectures. This is achieved through a systematic overhaul of traditional memory management mechanisms:

MIND adopts a globally shared virtual address space, partitioned across memory blades to minimize the volume of address translation entries stored in the on-chip memory of switch ASICs. Simultaneously, it implements a physical memory allocation mechanism that evenly distributes allocations across memory blades for optimal memory throughput.

MIND incorporates domain-based memory protection, inspired by capability-based schemes, facilitating fine-grained and flexible protection by dissociating the storage of memory permissions from address translation entries. Interestingly, this decoupling reduces on-chip memory overheads in switch ASICs.

MIND adapts directory-based MSI coherence to the in-network setting, leveraging network-centric hardware primitives like multicast in switch ASICs to efficiently realize its coherence protocol.

To mitigate the performance impact of coarse-grained cache directory tracking due to limited on-chip memory in switch ASICs, MIND introduces a novel Bounded Splitting algorithm that dynamically sizes memory regions to constrain both switch storage requirements and performance overheads stemming from false invalidations.

The MIND design is realized on a disaggregated cluster emulated using traditional servers connected by a programmable switch. Results demonstrate that MIND facilitates trans-

parent resource elasticity for real-world workloads while matching or even surpassing the performance of prior memory disaggregation proposals. However, it's noted that workloads characterized by high read-write contention exhibit sub-linear scaling with additional threads due to the limitations of current hardware. Present x86 architectures hinder the implementation of relaxed consistency models commonly employed in shared memory systems, and the switch TCAM capacity nears saturation with cache directory entries for such workloads. Potential approaches for enhancing scalability with future advancements in switch ASIC and compute blade architectures are discussed.

# 3.2.2 Background and Motivation

This section motivates MIND. We discuss key enabling technologies, followed by challenges in realizing memory disaggregation goals using existing designs.

Assumptions: We focus on memory disaggregation at the rack-scale, where memory and compute blades are connected by a single programmable switch. We restrict our scope to partial memory disaggregation: while most of the memory is network-attached, CPU blades possess a small amount (few GBs) of local DRAM as cache.

2.1 Enabling Technologies We now briefly describe MIND's enabling technologies.

Programmable switches: In recent years, programmable switches have evolved along two well-coordinated directions: development of a flexible programming language for network switches and the design of switch hardware that can be programmed with it. These switches host an application-specific integrated circuit (ASIC), along with a general-purpose CPU with DRAM. The switch ASIC comprises ingress pipelines, a traffic manager, and egress pipelines, which process packets in that order. Programmability is facilitated through a programmable parser and match-action units in the ingress/egress pipelines.

The program defines how the parser parses packet headers to extract a set of fields, and multiple stages of match-action units process them. The general-purpose CPU is connected to the switch ASIC via a PCIe interface and serves two functions: (i) performing packet processing that cannot be performed in the ASIC due to resource constraints, and, (ii) hosting controller functions that compute network-wide policies and push them to the switch ASIC.

While this discussion focuses on switch ASICs with Reconfigurable Match Action Tables (RMTs), it is possible to realize MIND using FPGAs, custom ASICs, or even general-purpose CPUs. Each exposes different tradeoffs, but we adopt RMT switches due to their performance, availability, power, and cost efficiency.

DSM Designs: Traditionally, shared memory has been explored in the context of NUMA and distributed shared memory (DSM) architectures. In such designs, the virtual address space is partitioned across the various nodes, i.e., each partition has a home node that manages its metadata, e.g., the page table. Each node also has a cache to facilitate performance for frequently accessed memory blocks. We distinguish memory blocks from pages since caching granularities can be different from memory access granularities.

With the copies of blocks potentially residing across multiple node caches, coherence protocols are required to ensure each node operates on the latest version of a block. In popular directory-based invalidation protocols like MSI (used in MIND), each memory block can be in one of three states: Modified (M), where a single node has exclusive read and write access to the block; Shared (S), where one or more caches have shared read-only access to the block; and Invalid (I), where the block is not present in any cache. A directory tracks the state of each block, along with the list of nodes that currently hold the block in their cache. The directory is typically partitioned across the various nodes, with each home node tracking directory entries for its own address space partition. Memory access for a block that is not local involves contacting the home node for the block, triggering a state transition and potential invalidation of the block across other nodes, followed by retrieving the block from the node that owns it.

While it is possible to realize more sophisticated coherence protocols, we restrict our focus to MSI in this work due to its simplicity.

As outlined earlier, extending the benefits of resource disaggregation to memory and making them widely applicable to cloud services demands (i) low-latency and high-throughput access to memory, and (ii) a transparent memory abstraction that supports elastic scaling of memory and compute resources without requiring modifications to existing applications. Unfortunately, prior designs for memory disaggregation expose a hard tradeoff between these two goals. Specifically, transparent elastic scaling of an application's compute resources ne-

cessitates a shared memory abstraction over the disaggregated memory pool, which imposes non-trivial performance overheads due to the cache-coherence required for both application data and memory management metadata. We now discuss why this tradeoff is fundamental to existing designs. We focus on page-based memory disaggregation designs here.

Transparent designs: While transparent distributed shared memories (DSMs) have been studied for several decades, their adaptation to disaggregated memory has not been explored. We consider two possible adaptations for the approach outlined earlier to understand their performance overheads and shed light on why they have remained unexplored thus far. The first is a compute-centric approach, where each compute blade owns a partition of the address space and manages the corresponding metadata, but the memory itself is disaggregated. A compute blade must now wait for several sequential remote requests to be completed for every un-cached memory read or write, for example, to the remote home compute blade to trigger state transition for the block and invalidate relevant blades, and to fetch the memory block from the blade that currently owns the block.

An alternate memory-centric design that places metadata at corresponding home memory blades still suffers multiple sequential remote requests for a memory access as before, with the only difference being that the home node accesses are now directed to memory blades. While these overheads can be reduced by caching the metadata at compute blades, it necessitates coherence for the metadata as well, incurring additional design complexity and performance overheads.

Non-transparent designs: Due to the anticipated overheads of adapting DSM to memory disaggregation, existing proposals limit processes to a single compute blade, i.e., while compute blades cache data locally, different compute blades do not share memory to avoid sending coherence messages over the network. As such, these proposals achieve memory performance only by limiting transparent compute elasticity for an application to the resources available on a single compute blade, requiring application modifications if they wish to scale beyond a compute blade.

# 3.2.3 MIND Design

To break the tradeoff highlighted above, we place memory management in the network fabric for three reasons. First, the network fabric enjoys a central location in the disaggregated architecture. Therefore, placing memory management in the data access path between compute and memory resources obviates the need for metadata coherence. Second, modern network switches permit the implementation of such logic in integrated programmable ASICs. These ASICs are capable of executing at line rate even for multi-terabit traffic. In fact, many memory management functionalities have similar counterparts in networking, allowing us to leverage decades of innovation in network hardware and protocol design for disaggregated memory management.

Finally, placing the cache coherence logic and directory in the network switch permits the design of specialized in-network coherence protocols with reduced network latency and bandwidth overheads. Effective in-network memory management requires: (i) efficient storage by minimizing in-network metadata given the limited memory on the switch data plane; (ii) high memory throughput by load-balancing memory traffic across memory blades; and (iii) low access latency to shared memory via efficient cache coherence design that hides the network latency.

Next, we elicit three design principles followed by MIND to realize the above goals and provide an overview of its design.

### MIND Design Principles

MIND adheres to three key principles to achieve the memory disaggregation goals outlined earlier:

P1: Decouple memory management functionalities to allow each to be optimized for its specific objectives. P2: Utilize a centralized control plane's global view of the disaggregated memory subsystem to compute optimal policies for each memory management functionality. P3: Leverage network-centric hardware primitives within the programmable switch ASIC to efficiently implement the policies determined by P2. MIND applies P1 by separating memory allocation from addressing, address translation from memory protection, and cache access

and eviction from coherence protocol execution. P2 and P3 are employed to efficiently realize these objectives. Traditional server-based operating systems, however, are unable to take advantage of these principles due to their reliance on fixed-function hardware modules, such as the MMU and memory controller, which typically couple various memory management tasks (e.g., address translation and memory protection in page-table walkers) for reasons of complexity, performance, and power efficiency.

#### Overview

MIND provides a transparent virtual memory abstraction to applications, similar to traditional server-based OSes. However, unlike previous disaggregated memory designs, MIND places all memory management logic and metadata in the network, rather than on CPU or memory blades, or a separate global controller.

In MIND's design, CPU blades run user processes and threads and possess a small amount of local DRAM used as a cache. Memory allocations and deallocations from user processes are intercepted at the CPU blade and forwarded to the switch control plane. The control plane, which has a global view of the system, performs memory allocations, assigns permissions, and responds to user processes. All memory load/store operations are handled by the CPU blade's cache. This cache is virtually addressed and stores permissions to enforce memory protection. If a page is not cached locally, a page fault is triggered, causing the CPU blade to fetch the page from memory blades using RDMA requests, evicting other cached pages if necessary. If a memory access requires a coherence state update (e.g., a store on a shared block), a page fault triggers the cache coherence logic at the switch.

MIND performs page-level remote accesses due to its page-fault-based design, although future CPU architectures may support more flexible access granularities. Since CPU blades do not store memory management metadata, the RDMA requests contain only virtual addresses, without any endpoint information for the memory blade holding the page. The switch data plane intercepts these requests, handles cache coherence by updating the cache directory, and performs cache invalidations on other CPU blades. It also ensures that the requesting process has the appropriate permissions. If no CPU blade cache holds the page, the data plane translates the virtual address to a physical one and forwards the request to

the appropriate memory blade.

In this design, memory blades merely store the actual memory pages and serve RDMA requests for physical pages. Unlike earlier approaches that rely on RPC handlers and polling threads, MIND uses one-sided RDMA operations to eliminate the need for CPU cycles on disaggregated memory blades, moving towards true hardware resource disaggregation where memory blades do not need general-purpose CPUs. Placing memory management logic and metadata in the network enables simultaneous optimization for both memory performance and resource elasticity. We now explain how MIND optimizes for the goals of memory allocation and addressing, memory protection, and cache coherence, while adhering to the constraints of programmable switches. We also discuss how MIND handles failures.

4.1 Memory Allocation & Addressing Traditional virtual memory uses fixed-sized pages as basic units for translation and protection, which can lead to inefficiencies in storage due to memory fragmentation. Smaller pages reduce fragmentation but require more translation entries, and larger pages have the opposite effect. To address this, MIND decouples address translation from protection. MIND's translation is blade-based, while protection is virtual memory area (vma)-based.

Storage-efficient address translation: MIND avoids page-based protection and instead uses a single global virtual address space across all processes, allowing shared translation entries. MIND partitions the virtual address space across different memory blades, mapping each blade's portion to a contiguous physical address range. This approach reduces the storage needed for translation entries in the switch's data plane. The mapping is adjusted when memory blades are added, removed, or when memory is moved.

Balanced memory allocation & reduced fragmentation: The control plane tracks total memory allocation across blades and places new allocations on blades with the least allocation, achieving load balancing. Additionally, MIND minimizes fragmentation within each memory blade by using traditional virtual memory allocation schemes, resulting in virtual memory areas (vmas) that are non-overlapping, reducing fragmentation.

Isolation: MIND's global virtual address space does not compromise process isolation. The switch control plane intercepts all allocation requests and ensures that they do not overlap between processes. MIND's vma-based protection allows for flexible access control

within a global virtual address space.

Support for static virtual addresses: MIND supports unmodified applications with static virtual addresses embedded in their binaries or OS optimizations like page migration. It maintains separate range-based address translations for static virtual addresses or migrated memory, ensuring correctness through longest-prefix matching in the switch's TCAM.

4.2 Memory Protection MIND decouples translation from protection by using a separate table to store memory protection entries in the data plane. Applications can assign access permissions to vmas of any size, and the protection table stores entries for these vmas. This flexible protection system allows MIND to efficiently manage memory protection with a relatively small number of entries.

Fine-grained, flexible memory protection: MIND introduces two abstractions: protection domains and permission classes. Protection domains define which entities can access a memory region, while permission classes specify the types of access allowed. MIND's control plane provides APIs that allow applications to assign protection domains and permission classes to vmas. These entries are stored in the protection table, and MIND efficiently supports this matching using TCAM-based range matches in the switch ASIC.

Optimizing for TCAM storage: MIND ensures storage efficiency by aligning virtual address allocations to power-of-two sizes, allowing regions to be represented using a single TCAM entry. Adjacent entries with the same protection domain and permission class are coalesced to further reduce storage requirements.

4.3 Caching & Cache Coherence In MIND, caches reside on compute blades, while the coherence directory and logic are located in the switch. This placement reduces latency for coherence protocol execution. MIND addresses challenges in adapting traditional cache management to an in-network setting by decoupling cache and directory granularities and dynamically optimizing region sizes.

Decoupling cache access & directory entry granularities: MIND decouples cache access from directory entry granularity. Cache accesses and memory movements are performed at fine granularities (e.g., 4 KB pages), while directory entries are tracked at larger, variable-sized regions. Invalidation of a region triggers the invalidation of all dirty pages tracked by the CPU blade caches.

Storage & performance-efficient sizing of regions: MIND uses the global view of memory traffic at the switch control plane to dynamically adjust region sizes, balancing between performance (minimizing false invalidations) and directory storage efficiency.

4.4 Handling Failures MIND leverages prior work to handle CPU and memory blade failures. For switch failures, the control plane is consistently replicated at a backup switch, ensuring that data plane state can be reconstructed.

Communication failures: MIND uses ACKs and timeouts to detect packet losses. In case of a timeout during invalidation, the compute blade sends a reset message to the control plane, which flushes the data and removes the corresponding cache directory entry, preventing deadlocks during state transitions.

### In-Network Memory Management

### 3.2.4 MIND Implementation

MIND Implementation MIND integrates with the Linux memory and process management system call APIs and splits its kernel components across CPU blades and the programmable switch. We will now describe these kernel components, along with the RDMA logic required for the memory blades.

6.1 CPU Blade MIND uses a partial disaggregation model, where CPU blades have a small amount of local DRAM that acts as a cache. In our prototype, traditional servers are used for the CPU blades, with no hardware modifications. We implemented MIND's CPU blade kernel components as modifications to the Linux 4.15 kernel, providing transparent access to disaggregated memory by modifying how vmas and processes are managed and how page faults are handled.

Managing vmas: The kernel module intercepts process heap allocation and deallocation requests, such as brk, mmap, and munmap, forwarding them to the control plane at the switch over a reliable TCP connection. The switch creates new vma entries and returns the corresponding values (e.g., the virtual address of the allocated vma), ensuring transparency for user applications. Error codes like ENOMEM are returned for errors, similar to standard Linux system calls.

Managing processes: The kernel module also intercepts and forwards process creation and termination requests, such as exec and exit, to the switch control plane, which maintains internal process representations (i.e., Linux's task\_struct) and manages the mapping between compute blades and the processes they host. Threads across CPU blades are assigned the same PID if they belong to the same process, enabling them to share the same address space transparently through the memory protection and address translation rules installed at the switch. We place threads and processes across compute blades in a round-robin fashion without focusing on scheduling.

Page fault-driven access to remote memory: When a user application attempts to access a memory address not present in the CPU blade cache, a page fault handler is triggered. The CPU blade sends a one-sided RDMA read request to the switch with the virtual address and requested permission class (read or write). The page is registered to the NIC as the receiving buffer, eliminating the need for additional data copies. Once the page is received, the local memory structures are populated, and control is returned to the user. The CPU blade DRAM cache handles cache invalidations for coherence, tracking writable pages locally and flushing them when receiving invalidation requests.

This approach provides transparent access to disaggregated memory but restricts MIND to a stronger Total Store Order (TSO) memory consistency model. Weaker consistency models, such as Process Store Order (PSO), which allow asynchronous propagation of writes, are challenging to implement on traditional x86 and ARM architectures due to the inability to trigger page faults only on reads without also triggering them on writes. This limitation affects scalability for workloads with high read/write contention to shared memory regions.

6.2 Memory Blade MIND does not require any compute or data plane processing logic on memory blades, eliminating the need for general-purpose CPUs. In our prototype, memory blades are traditional Linux servers, so we use a kernel module to perform RDMA-specific initializations. When a memory blade comes online, its kernel registers physical memory addresses to the RDMA NIC and reports them to the global controller. After this, one-sided RDMA requests from CPU blades are handled directly by the memory blade NIC without CPU involvement. Ideally, future memory blades could be fully implemented in hardware, without requiring a CPU, to reduce costs and simplify design.

6.3 Programmable Switch MIND's programmable switch is implemented on a 32-port EdgeCore Wedge switch with a 6.4 Tbps Tofino ASIC, an Intel Broadwell processor, 8 GB of RAM, and 128 GB of SSD storage. The general-purpose CPU hosts the MIND control program, handling process, memory, and cache directory management, while the ASIC performs address translation, memory protection, directory state transitions, and virtualizes RDMA connections between compute and memory blades.

Process & memory management: The control plane hosts a TCP server to handle system call intercepts from CPU blades and maintains traditional Linux data structures for process and memory management. Upon receiving a system call, the control plane updates these structures and responds with system call return values to maintain transparency.

Cache directory management: MIND reserves SRAM at the switch's data plane for directory entries, partitioned into fixed-size slots, one per memory region. The control plane maintains a free list of available slots and a hash table mapping base virtual addresses of cache regions to their corresponding directory entries in the SRAM. When a directory entry is created or a region is split, slots are allocated or deallocated as needed. Directory state transitions are handled across multiple match-action units (MAUs) due to limited compute capabilities in each unit, with state transitions split between them and recirculating the packet within the switch data plane as needed.

Virtualizing RDMA connections: MIND virtualizes RDMA connections between all possible CPU and memory blade pairs by transforming and redirecting RDMA requests and responses. Once a request's destination is identified through address translation or cache coherence, the switch updates the packet header fields (IP/MAC addresses and RDMA parameters) before forwarding the request to the correct memory blade.

### 3.2.5 Evaluation

#### 3.2.6 Discussion and Conclusion

We start at a relatively modest scale, specifically within the context of rack-scale [45, 46]. Our perspective aligns with placing the operating system functionality for non-disaggregated resources within the interconnect, which serves as the network infrastructure in a rack-scale

system (or potentially utilizing CXL, as discussed in §??). The advantage of housing this functionality in the interconnect is it grants the system a global view, as every compute-memory operation must traverse the interconnect.

The network emerges as a compelling choice for an interconnect in memory disaggregation due to several key factors. First, the expansion of network bandwidth surpassing that of memory bandwidth [47] positions it as a prime candidate for serving as a disaggregation interconnect. Furthermore, advancements in programmable networking, exemplified by programmable switches [48–51], enable capabilities such as data storage (state-keeping) and processing at line-rate [52]. These capabilities empower the network to implement critical OS functionality effectively.

There are several essential requirements for memory management within a disaggregated architecture. Firstly, the interconnect operating system must operate without additional overhead, ensuring minimal latency and facilitating high-throughput access to remote memory. Additionally, given that programs may utilize various resources across compute and memory blades, the operating system should enable elastic scaling for both memory and computational resources. Another advantageous aspect of housing OS functionality within the interconnects is the ability to shield the application entirely from the OS logic, thereby promoting compatibility with unmodified applications.

To fulfill the three essential requirements, we have developed a system known as MIND [1], leveraging the capabilities of contemporary programmable switches to facilitate in-network memory management. Drawing inspiration from the similarity between memory address translation and network address lookups, we utilize the existing ingress/egress pipelines and Reconfigurable Match Action Tables (RMTs) [53] within programmable switches to implement address translation tables and protection entries. Additionally, we implement a directory-based MSI coherence protocol [54], as data may be accessed coherently by multiple compute nodes. These operations are performed at line rate, ensuring low-latency, high-throughput memory access. It's worth noting that our implementation is confined to the interconnect (programmable switch) and the compute node OS kernel, allowing applications to run seamlessly on MIND.

Figure ?? illustrates the fundamental structure of the MIND system. Compute nodes

house CPUs and a limited cache, while memory nodes exclusively contain memory resources. The programmable switch is situated atop the rack, with the control plane managing coarse-grained operations like memory allocation, permission assignment, and memory coherence directory management. Meanwhile, the data plane handles memory address translation, protection, and coherence lookup at line rate.

The dataflow(Figure ??) of memory access begins with a load/store instruction from the compute node CPU. When the compute node OS kernel detects that the required data isn't present on the node, it triggers a page fault and issues a network request to the switch for permission updates and data retrieval. This request traverses the switch's data plane, fetching the required data from the memory node. Simultaneously, the switch invalidates existing data from other compute nodes if the source node requests exclusive access.

We've faced two main challenges with programmable switch ASICs: limited on-chip memory and restricted computational power. The few megabytes of memory on switch ASICs are inadequate for traditional page tables managing terabytes of disaggregated memory. Moreover, the ASICs' computational constraints, necessary for maintaining line-rate processing, are evident in complex tasks like cache coherence. To counter these issues, we've separated memory addressing and protection to save hardware space. Additionally, we've utilized unique switch primitives like multicast operations to navigate computational limitations effectively.

# 3.3 Near Memory Processing

### 3.3.1 Introduction

Driven by increasing demands for memory capacity and bandwidth [55–61], poor scaling [62–64] and resource inefficiency [32,65] of DRAM, and improvements in Ethernet-based network speeds [47,66], recent years have seen significant efforts towards memory disaggregation [1,2,4,31,32]. Rather than scaling up a server's DRAM capacity and bandwidth, such proposals advocate disaggregating much of the memory over the network. The result is a set of CPU nodes equipped with a small amount of DRAM used as cache<sup>1</sup>, accessing memory across

<sup>1.</sup> Not to be confused with die-stacked hardware DRAM caches [67–69].



Fig. 3.1: **Need for accelerating pointer traversals.** (top) The performance of pointer traversals in disaggregated architectures is bottlenecked by slow memory interconnect. (bottom) Just as caches offer limited but fast caches near CPUs, we argue that memory needs a counterpart for traversal-heavy workloads: a lightweight but fast accelerator for cache-unfriendly pointer traversals.

a set of network-attached memory nodes with large DRAM pools (Fig. 3.1 (top)). With allocation flexibility across CPU and memory nodes, disaggregation enables high utilization and elasticity.

Despite drastic improvements in recent years, the limited bandwidth and latency to network-attached memory remain a hurdle in adopting disaggregated memory, with speed-of-light constraints making it impossible to improve network latency beyond a point. Even with near-terabit links and hardware-assisted protocols like RDMA [70], remote memory accesses are an order of magnitude slower than local memory accesses [3]. Emerging CXL interconnects [10] share a similar trend — around 300 ns of CXL memory latency compared to 10–20 ns of L3 cache latency [37]. Although efficient caching strategies at the CPU node can reduce average memory access latency and volume of network traffic to remote memory, the benefit of such strategies is limited by data locality and the size of the cache on the CPU node. In many cases, remote memory accesses are unavoidable, especially for applications that rely on efficient in-memory pointer traversals on linked data structures, such as lookups on index structures [71–81] in databases and key-value stores, and traversals in graph analytics [82–85] (Fig. 3.2, §3.3.2).



-WiredTiger-1GB

- BTrDB-1GB

Fig. 3.2: Time cloud applications spend in pointer traversals. See §3.3.2 for details.

Similar to how CPUs have small but fast memory (i.e., caches) for quick access to popular data, we argue that memory nodes should also include lightweight but fast processing units with high-bandwidth, low-latency access to memory to speed up pointer-traversals (Fig. 3.1 (bottom)). Moreover, the interconnect should facilitate efficient and scalable distributed traversals for deployments with multiple memory nodes that cater to large-scale linked data structures. Prior works have explored systems and API designs for such processing units under multiple settings, ranging from near-memory processing and processing-inmemory approaches [86–114] for single-server architectures, to the use of CPUs [28,115–119] or FPGAs [120,121] near remote/disaggregated memory, but have several key shortcomings.

Specifically, existing approaches are limited in scale and expose a three-way tradeoff between expressiveness, energy efficiency, and performance. First, and perhaps most crucially, none of the existing approaches can accelerate pointer traversals that span *multiple* network-attached memory nodes.

This limits memory utilization and elasticity since applications must confine their data to a single memory node to accelerate pointer traversals. Their inability to support distributed pointer traversals stems from complex management of address translation state that is required to identify if a traversal can occur locally or must be re-routed to a different memory node (§3.3.2). Second, existing single-node approaches use full-fledged CPUs for expressive and performant execution of pointer-traversals [28,115–117]. However, coupling large amounts of processing capacity with memory — which has utility in reducing data movement in PIM architectures [86–98] — goes against the very spirit of memory disaggregation since it leads to poor utilization of compute resources and, consequently, poor energy

efficiency.

Approaches that use wimpy processors at SmartNICs [122, 123] instead of CPUs retain expressiveness, but the limited processing speeds of wimpy nodes curtail their performance and, ultimately lead to lower energy efficiency due to their lengthened executions (§3.6.1, [120]). Lastly, FPGA-based [120, 121, 124] and ASIC-based [113, 114] approaches achieve performance and energy efficiency by hard-wiring pointer traversal logic for specific data structures, limiting their expressiveness.

We design PULSE<sup>2</sup>, a distributed pointer-traversal framework for rack-scale disaggregated memory, to meet all of the above needs — namely, expressiveness, energy efficiency, performance — via a principled redesign of near-memory processing for disaggregated memory. Central to PULSE's design is an expressive iterator interface that readily lends itself to a unifying abstraction across most pointer traversals in linked data structures used in key-value stores [24, 125], databases [76–78,80,126], and big-data analytics [82–85] (§3.4). PULSE's use of this abstraction not only makes it immediately useful in this large family of real-world traversal-heavy use cases, but also enables (i) the use of familiar compiler toolchains to support these use cases with little to no application modifications and (ii) the design of tractable hardware accelerators and efficient distributed traversal mechanisms that exploit properties unique to iterator abstractions.

In particular, PULSE enables transparent and efficient execution of pointer traversals for our iterator abstraction via a novel accelerator that employs a disaggregated architecture to decouple logic and memory pipelines, exploiting the inherently sequential nature of compute and memory accesses in iterator execution (§3.4.1). This permits high utilization by provisioning more memory and fewer logic pipelines to cater to memory-centric pointer traversal workloads. A scheduler breaks pointer traversal logic from multiple concurrent workloads across the two sets of pipelines and employs a novel multiplexing strategy to maximize their utilization. While our implementation leverages an FPGA-based SmartNIC due to the high cost and complexity of ASIC fabrication, our ultimate vision is an ASIC-based realization for improved performance and energy efficiency.

<sup>2.</sup> Processing Unit for Linked StructurEs.

We enable distributed traversals by leveraging the insight that pointer traversal across network-attached memory nodes is equivalent to packet routing at the network switch (§3.5). As such, PULSE leverages a programmable network switch to inspect the next pointer to be traversed within iterator requests and determine the next memory node to which the request should be forwarded — both at line rate. We implement a real-system prototype of PULSE on a disaggregated rack of commodity servers, SmartNICs, and a programmable switch with full-system effects. None of PULSE's hardware or software changes are invasive or overly complex, ensuring deployability. Our evaluation of end-to-end real-world workloads shows that PULSE outperforms disaggregated caching systems with 9–34× lower latency and 28–171× higher throughput. Moreover, our Xilinx XRT [127] and Intel RAPL [128]-based power analysis shows that PULSE consumes 4.5–5× less energy than RPC-based schemes (§3.6).

#### 3.3.2 Motivation and PULSE Overview

# Need for Accelerating Pointer Traversals

Memory-intensive applications [55–61] often require traversing linked structures like lists, hash tables, trees, and graphs. While disaggregated architectures provide large memory pools across network-attached memory nodes, traversing pointers over the network is still slow [3]. Recent proposals [1–3,31,32] alleviate this slowdown by using the DRAM at the CPU nodes to cache "hot" data, but such caches often fare poorly for pointer traversals, as we show next.

Pointer traversals in real-world workloads. Prior studies [61,84,125,129–132] have shown that real-world data-centric cloud applications spend anywhere from 21% to 97% of execution time traversing pointers. We empirically analyze the time spent in pointer traversals for three representative cloud applications — a WebService frontend [28], indexing on WiredTiger [126], and time-series analysis on BTrDB [133] — with swap-based disaggregated memory [32]<sup>3</sup>. We vary the cache size at the CPU node from 6.25%-100% of each application's working set size. Fig. 3.2(a) shows that (i) all three applications spend a significant

<sup>3.</sup> We defer the details of the data structures and workloads employed by these applications, as well as the disaggregated memory setup to  $\S 3.6$ .

fraction of their execution time (13.6%, 63.7%, and 55.8%, respectively) traversing pointers even when their entire working set is cached, and (ii) the time spent traversing pointers (and thus, the end-to-end execution time) increases with smaller CPU node caches. While the impact of access skew is application-dependent, pointer traversals dominate application execution times when more of the application's working set size is remote.

Distributed traversals. As the number of applications and the working-set size per application grows larger, disaggregated architectures must allocate memory across multiple memory nodes to keep up. Such approaches [1,2,31,32] tend to strive for the smallest viable allocation granularity with reasonable metadata overheads (e.g., 1 GB in [2], 2 MB in [1]) since smaller allocations permit better load balancing and high memory utilization. Unfortunately, finer-grained allocations may cause an application's linked structures to get fragmented across multiple network-attached memory nodes, necessitating many distributed traversals.

Fig. 3.2(b) illustrates this impact on a setup with 1 compute and 4 memory nodes: even with large 1 GB allocations, WiredTiger and BTrDB require over 97% and 75% of their requests, respectively, to cross memory node boundaries at least once, with the volume of cross-node traffic increasing at smaller granularities. Fig. 3.2(c) shows the CDF of requests that require a certain number of memory node crossings. While the randomly ordered data in WiredTiger necessitate many cross-node traversals even for large allocations, the time-ordered data in BTrDB reduce cross-node traversals for larger allocation granularities by confining large time windows to the same memory node. However, smaller to moderate allocation granularities — required for high memory utilization — still require many cross-node traversals.

# Shortcomings of Prior Approaches

No prior work achieves all four properties required for pointer traversals on disaggregated memory: distributed execution, expressiveness, energy efficiency, and performance. We focus on network-attached memory, although a similar analysis extends to in-memory processing [86–114].

No support for distributed execution. Distributed pointer traversals are required to ensure applications can efficiently access large pools of network-attached memory nodes. Unfortunately, to our knowledge, none of the prior works support efficient multi-node pointer traversals. Therefore, applications must confine their data to a single node for efficient traversals, exposing a tradeoff between application performance and scalability. Recent proposals [29, 134–139] explore specialized data structures that co-design partitioning and allocation policies to reduce distributed pointer traversals atop disaggregated memory. Such approaches complement our work since they still require efficient distributed traversals when their optimizations are not applicable, e.g., not many data structures benefit from such specialized co-designs.



Fig. 3.3: **PULSE Overview.** Developers use PULSE's iterator interface (§3.4) to express pointer traversals, translated to PULSE ISA by its dispatch engine (§3.4.1). During execution, PULSE accelerator ensures energy efficiency (§3.4.1) and in-network design enable distributed traversals (§3.5).

Poor utilization/power-efficiency in CPUs. Many prior works have explored remote procedure call (RPC) interfaces to enable offloading computation to CPUs on memory nodes [28,115–118]. While CPUs are performant and versatile enough to support most general-purpose computations, the same versatility makes them overkill for pointer traversal workloads in disaggregated architectures — the CPUs on memory nodes are likely to be underutilized and, consequently, waste energy (§3.6), since such workloads are memory-intensive and bounded by memory bandwidth rather than CPU cycles. Since inefficient power usage resulting from coupled compute and memory resources is the main problem disaggregation aims to resolve, leveraging CPUs at memory nodes essentially nullifies these benefits.

Limited expressiveness in FPGA/ASIC accelerators. Another approach explored in recent years uses FPGAs [120,121] or ASICs [113,114] at memory nodes for performance and energy efficiency. FPGA approaches exploit circuit programmability to realize performant on-path data processing, albeit only for specific data structures, limiting their expressiveness.

Although some FPGA approaches aim for greater expressiveness by serving RPCs [140], RPC logic must be pre-compiled before it is deployed and physically consumes FPGA resources. This limits how many RPCs can be deployed on the FPGA concurrently and also elides runtime resource elasticity for different pointer traversal workloads. ASIC approaches either support a single data structure or provide limited ISA specialized for a single data structure (e.g., linked-lists [113]), limiting their general applicability.

Poor performance/power efficiency in wimpy SmartNICs. The emergence of programmable SmartNICs has driven work on offloading computations to the onboard network processors. Some approaches utilize wimpy processors (e.g., ARM or RISC-V processors) [122] or RDMA processing units (PUs) [123] to support general-purpose computations near memory. While these wimpy processors can eliminate multiple network round trips in pointer traversal workloads, their processing speeds are far slower than CPU-based or FPGA-based accelerators. Often, such PUs can become a performance bottleneck, especially at high memory bandwidth (~500 Gbps) [3,123]. Moreover, wimpy processors tend not to be energy-efficient since their slower execution tends to waste more static power, resulting in higher energy per pointer traversal offload — an observation noted in prior work [120] and confirmed in our evaluation (§3.6).

#### PULSE Design Overview

PULSE innovates on three key design elements (Fig. 3.3). Central to PULSE's design is its iterator-based programming model (§3.4) that requires minimal effort to port real-world data structure traversals. PULSE supports *stateful* traversals using a *scratchpad* of pre-configured size, where developers can store and update arbitrary intermediate states (e.g., aggregators, arrays, lists, etc.) during the iterator's execution. Properties specific to iterator patterns enable tractable accelerator design and efficient distributed traversals in PULSE.

The iterator code provided by the data structure developer is translated into PULSE's instruction set architecture (ISA) to be executed by PULSE accelerators (§3.4.1). PULSE achieves energy efficiency and performance through a novel accelerator that employs disaggregated logic and memory pipelines and an ISA specifically designed for the iterator

pattern. Our accelerator employs a scheduler specialized for its disaggregated architecture to ensure high utilization and performance.

PULSE supports scalable distributed pointer traversals by leveraging programmable network switches to reroute any requests that must cross memory node boundaries (§3.5). PULSE employs hierarchical address translation in the network, where memory node-level address translation is performed at the switch (i.e., a request is routed to the memory node based on its target address), and the memory node accelerator performs translation and protection for local accesses. During traversal, a memory node accelerator can return a request to the switch if it determines the address is not local; the switch re-routes the request to the correct memory node.

Assumptions. Pulse does not offload synchronization to its accelerators but instead requires the application logic at the CPU node to explicitly acquire/release appropriate locks for the offloaded operation. Recent efforts enable locking primitives on NICs [29, 134] and programmable switches [141]; these are orthogonal to our work and can be incorporated into Pulse.d Finally, Pulse does not innovate on caching and adapts the caching scheme from prior work [28], which maintains a transparent cache within the data structure library.

# 3.4 PULSE Programming Model

We begin with PULSE's programming model since a carefully crafted interface is crucial to enable wide applicability for real-world traversal-heavy applications, as well as the design of tractable pointer traversal accelerators and efficient distributed traversal mechanisms. PULSE's interface is intended for data structure library developers to offload pointer traversals in linked data structures. Since PULSE code modifications are restricted to data structure libraries, existing applications utilizing their interfaces require no modifications.

We analyzed the implementations of a wide range of popular data structures [142–145] to determine the structures common to them in pointer traversals. We found that most traversals (1) initialize a start pointer using data structure-specific logic, (2) iteratively use data structure-specific logic to determine the next pointer to look up, and (3) check a data structure-specific termination condition at the end of each iteration to determine

```
1 class pulse_iterator {
      void init(void *) = 0; //
           Implemented by developer
      void *next() = 0; // Implemented by
           developer
      bool end() = 0; // Implemented by
           developer
      unsigned char *execute() { //
 6
          Non-modifiable logic
        unsigned int num_iter = 0;
        while (!end() && num_iter++ <</pre>
            MAX_ITER)
          cur_ptr = next();
9
        return scratch_pad;
10
      }
11
12
      uintptr_t cur_ptr;
13
      unsigned char
           scratch_pad[MAX_SCRATCHPAD_SIZE];
14 }
```

Listing 3.1: PULSE interface.

if the traversal should end. This structure resembles that of the *iterator* design pattern, establishing its universality as a design motif common to almost all languages [144]. This is precisely what makes it an ideal candidate for the interface between the hardware and software layers for pointer traversals. As such, PULSE allows developers to program their data structure traversals using the iterator interface shown in Listing 3.1.

The interface exposes three functions that must be implemented by the user: (1) init(), which takes as input arbitrary data structure-specific state to initialize the start pointer, (2) next(), that updates the current pointer to the next pointer it must traverse to, and, (3) end(), that determines if the pointer traversal should end (either in success or failure) based on the current pointer. PULSE then uses the provided implementations for these functions to execute the pointer traversal iteratively, using the execute() function. We discuss two key novel aspects of our iterator abstraction that were necessary to increase and limit the expressiveness of operations on linked data structures.

Stateful traversals. Pointer traversals in many data structures are stateful, and the nature of the state can vary widely. For instance, in hash table lookups, the state is the search key that must be compared against a linked list of keys in a hash bucket. In contrast, summing up values across a range of keys in a B-Tree requires maintaining a running variable for

storing the sum and updating it for each value encountered in the range. To facilitate this, PULSE iterators maintain a scratch\_pad that the developer can use to store an arbitrary state. The state is initialized in init(), updated in next(), and finalized in end(). Since execute() in PULSE's iterator interface returns the contents of scratch\_pad (Line 10), developers can place the data that they want to receive in it.

Bounded computations. Pulse accelerators support only lightweight processing in memory-intensive operations for high memory bandwidth utilization. While init() is executed on the CPU node, next() and end() are offloaded to Pulse accelerators; hence, Pulse limits what memory accesses and computations can be performed in them in two ways. Within each iteration, Pulse disallows nondeterministic executions, such as unbounded loops, i.e., loops that cannot be unrolled to a fixed number of instructions.

Across iterations, execute() in Listing 3.1 limits the maximum number of iterations that a single request is allowed to perform. This ensures that a particularly long traversal does not block other requests for a long time. If a request exceeds the maximum iteration count, PULSE terminates the traversal and returns the scratch\_pad value to the CPU node, which can issue a new request to continue the traversal from that point.

An illustrative example. We demonstrate how the find() operation on C++ STL unordered\_map can be ported to PULSE. Listing 3.2 shows a simplified version of its implementation in STL — the pointer traversal begins by computing a hash function and determining a pointer to the hash bucket corresponding to the hash. It then iterates through a linked list corresponding to the hash bucket, terminating if the key is found or the linked list ends without it being found.

Listing 3.3 shows the corresponding iterator implementation in PULSE. Much of the implementation is unchanged, with minor restructuring for init(), next(), and end() functions. The main changes are — how the state (the search key) is exchanged across the three functions and how the data is returned back to the user via the scratch\_pad (an error message if the key is not found, or its value if it is).

### 3.4.1 Accelerating Pointer Traversals on a Node

### PULSE Dispatch Engine

The dispatch engine is a software framework running at the CPU node for two purposes. First, it translates the iterator realization for pointer traversal provided by a data structure library developer (§3.4) into PULSE's ISA. Second, it determines if the accelerator can support the computations performed during the traversal, and if so, ships a request to the accelerator at the memory node. If not, the execution proceeds at the CPU node with regular remote memory accesses.

Translating iterator code to PULSE ISA. To be readily implementable, PULSE plugs into existing compiler toolchains. The dispatch engine generates PULSE ISA instructions using widely known compiler techniques [146]. PULSE's ISA is a stripped-down RISC ISA, only containing operations necessary for basic processing and memory accesses to enable a simple and energy-efficient accelerator design (Table 3.1). There are, however, a few notable aspects to our adapted ISA and the translation of iterator code to it. First, as noted in §3.4, PULSE does not support unbounded loops within a single iteration, i.e., the ISA only supports conditional jumps to points ahead in code. This is similar to eBPF programs [147], where only forward jumps are supported to prevent the program from running infinitely within the kernel. A backward jump can only occur when the next iteration starts; PULSE employs a special NEXT\_ITER instruction to explicitly mark this point so that the accelerator can begin scheduling the memory pipeline (§3.4.1). Second, again as noted in §3.4, developers can maintain state and return values using a scratch\_pad of pre-configured size; our ISA supports register operations directly on the scratch\_pad and provides special RETURN instruction that simply terminates the iterator execution and yields the contents of the scratch\_pad as the return value.

Finally, we found that the iterator traversal pattern typically can be broken down into two types of computation — fetching data<sup>4</sup> pointed to by cur\_ptr from memory, and processing the fetched data to determine what the next pointer should be, or if the iterator

<sup>4.</sup> While the rest of the section focuses only on describing data fetches from memory, we note that writing data to memory proceeds similarly.

| Class                | Instructions                        | Description                                                                    |  |  |
|----------------------|-------------------------------------|--------------------------------------------------------------------------------|--|--|
| Memory               | LOAD, STORE                         | ${ m Load/store~data} \ { m from/to~address}.$                                 |  |  |
| $\operatorname{ALU}$ | ADD, SUB, MUL, DIV,<br>AND, OR, NOT | Standard ALU operations.                                                       |  |  |
| Register             | MOVE                                | Move data b/w registers.                                                       |  |  |
| Branch               | COMPARE and JUMP_{EQ, NEQ, LT,}     | Compare values & jump ahead based on condition (e.g., equal, less than, etc.). |  |  |
| Terminal             | RETURN, NEXT_ITER                   | End traversal & return, or start next iteration.                               |  |  |

Table 3.1: PULSE adapts a restricted subset of RISC-V ISA (§3.4.1).

execution should terminate. If the translation from the iterator code to PULSE's ISA is done naively, it can result in multiple unnecessary loads within the vicinity of the memory location pointed to by cur\_ptr. For instance, the unordered\_map::find() realization shown in Listing 3.3 makes references to cur\_ptr->key, cur\_ptr->value and cur\_ptr->next at various points, and if each incurs a separate load, it will slow down execution and waste memory bandwidth. Consequently, PULSE's dispatch engine infers the range of memory locations accessed relative to cur\_ptr in the next() and end() functions via static analysis and aggregates these accesses into a single large LOAD (of up to 256 B) at the beginning of each iteration.

Bounding complexity of offloaded code. While PULSE's interface and ISA already limit the types of computation than can be performed per iteration, PULSE also needs to limit the amount of computation per iteration to ensure the operations offloaded to PULSE accelerators remain memory-centric. To this end, PULSE's dispatch engine analyzes the generated ISA for the iterator to determine the time required to execute computational logic  $(t_c)$  and the time required to perform the single data load at the beginning of the iteration  $(t_d)$ .

PULSE exploits the known execution time of its accelerators in terms of time per compute instruction,  $t_i$ , to determine  $t_c = t_i \cdot N$ , where N is the number of instructions per iteration. The CPU node offloads the iterator execution only if  $t_c \leq \eta \cdot t_d$ , where  $\eta$  is a predefined accelerator-specific threshold. Note that since we only want to offload memory-centric operations,  $\eta \leq 1$ . As we will show in §3.4.1, the choice of  $\eta$  allows PULSE to maximize the memory bandwidth utilization and ensure processing never becomes a bottleneck for pointer traversals.

Issuing network requests to accelerator. Once the dispatch engine decides to offload an

iterator execution, it encapsulates the ISA instructions (code) along with the initial value of cur\_ptr and scratch\_pad (initialized by init()) into a network request. It issues the request, leaving the network to determine which memory node it should be forwarded to (§3.5). To recover from packet drops, the dispatch engine embeds a request identifier (ID) with the CPU node ID and a local request counter in the request packets, maintains a timer per request, and retransmits requests on timeout.

Practical deployability. Our software stack is readily deployable due to its use of real-world toolchains. Our user library adapts implementations of common data structures used in key-value stores [24, 125], databases [76–78, 80, 126], and big-data analytics [82–85] to PULSE's iterator interface (§3.4). PULSE's dispatch engine is implemented on Intel DPDK-based [148] low-latency, high-throughput UDP stack. PULSE compiler adapts the Sparc backend of LLVM [149] since its ISA is close to PULSE's ISA. Our LLVM frontend applies a set of analysis and optimization passes [150] to enforce PULSE constraints and semantics: the analysis pass identifies code snippets that require offloading, while the optimization pass translates pointer traversal code to PULSE ISA.

#### PULSE Accelerator Design

The accelerator is at the heart of PULSE design and is key to ensuring high performance for iterator executions with high resource and energy efficiency. Our motivation for a new accelerator design stems from two unique properties of iterator executions on linked structures:

- **Property 1:** Each iteration involves two clearly separated but sequentially dependent steps: (i) fetching data from memory via a pointer (e.g., a list or tree node), followed by (ii) executing logic on the fetched data to identify the next pointer. The logic cannot be executed concurrently with or before the data fetch, and the next data fetch cannot be performed until the logic execution yields the next pointer.
- Property 2: Iterators that benefit from offload spend more time in data fetch  $(t_d)$  than logic execution  $(t_c)$ , i.e.,  $t_c < \eta \cdot t_d$ , where  $\eta \le 1$ , as noted in §3.4.1.

Any accelerator for iterator executions must have a memory pipeline and a logic pipeline

to support the execution steps (i) and (ii) above. The strict dependency between the steps (Property 1) renders many optimizations of traditional multi-core processors, such as out-of-order execution, ineffective. Moreover, since each core in such architectures has tightly coupled logic and memory pipelines, the memory-intensive nature of iterators (Property 2) results in the logic pipeline remaining idle most of the time. These two factors combined result in poor utilization and energy efficiency for such architectures. Fig. 3.4 (top) captures this through the execution of 3 iterators (A, B, C), each with 2 iterations (e.g., A1, A2, etc.), on a multi-core architecture. Since each iteration comprises a data fetch followed by a dependent logic execution, one of the pipelines remains idle while the other is busy. While thread-level parallelism permits iterator requests to be spread across multiple cores for increased overall throughput, per-core under-utilization of logic and memory pipelines persists, resulting in suboptimal resource and energy usage.

Disaggregated accelerator design. Motivated by the unique properties of iterators, we propose a novel accelerator architecture that disaggregates memory and logic pipelines, using a scheduler to multiplex corresponding components of iterators across them. First, such a decoupling permits an asymmetric number of logic and memory pipelines to maximize the utilization of either pipeline, in stark contrast to the tight coupling in multi-core architectures. In our design, if there are m logic and n memory pipelines, then the accelerator-specific threshold  $\eta < 1$  we alluded to in §3.4.1 is  $\frac{m}{n}$ , i.e., there are fewer logic pipelines than memory pipelines in keeping with Property 2. Fig. 3.4 (bottom) shows an example of our disaggregated accelerator design with one logic pipeline and two memory pipelines (i.e., m = 1, n = 2).

Even though data fetch and logic execution within each iterator must be sequential, the disaggregated design permits efficient multiplexing of data fetch and logic execution from different iterators across the disaggregated logic and memory pipelines to maximize utilization. To see how, recall that the logic execution time  $t_c$  for each offloaded iterator execution in PULSE is  $\leq \eta \cdot t_d$ , where  $t_d$  is its data fetch time (§3.4.1). Consider the extreme case where  $t_c = \eta \cdot t_d$  for all offloaded iterator executions — in this case, it is always possible to multiplex m + n concurrent iterator executions to fully utilize all m logic and n memory

pipelines. While we omit a theoretical proof for brevity, Fig. 3.4 (bottom) illustrates the multiplexed execution — orchestrated by a scheduler in our accelerator — for  $t_c = \frac{1}{2} \cdot t_d$  with 3 iterators. This is the ideal case — similar multiplexing is still possible if  $t_c \leq \eta \cdot t_d$  with complete utilization of memory pipelines, albeit with lower utilization of logic pipelines (since they will be idle for  $\frac{t_c - \eta \cdot t_d}{t_c}$  fraction of time). As such, we provision  $\eta = \frac{m}{n}$  to be as close to the expected  $\frac{t_c}{t_d}$  for the workload to maximize the utilization of logic pipelines. It is possible to improve the logic pipelines' energy efficiency by dynamically down-scaling frequency [151]; we leave such optimizations to future work.

While the memory pipeline is stateless, the logic pipeline must maintain the state for the iterator it executes. To multiplex several iterator executions, logic pipelines need efficient mechanisms for efficient context switching. To this end, we maintain a dedicated workspace corresponding to each iterator's execution. Each workspace stores three distinct pieces of state:  $cur\_ptr$  and  $scratch\_pad$  to track the iterator state described in §3.4, and data, which holds the data loaded from memory for  $cur\_ptr$ . A dedicated workspace per iterator allows the logic pipeline to switch to any iterator's execution without delay when triggered by the scheduler, although it requires maintaining multiple workspaces — a maximum of m+n to accommodate any possible schedule due to our bound on the number of concurrent iterators. We divide these workspaces equally across logic pipelines.

PULSE Accelerator Components. PULSE accelerator comprises n memory and m logic pipelines for executing iterator requests, a scheduler that multiplexes requests across the logic and memory pipelines, and a network stack for parsing pointer-traversal requests from the network (Fig. 3.5).

Memory pipeline: Each memory pipeline loads data from the attached DRAM to the corresponding workspace assigned by the scheduler at the start of each iteration. This involves (i) address translation and (ii) memory protection based on page access permissions. We realize range-based address translations (simulated in prior work [152]) in our real-world implementation using TCAM to reduce on-chip storage usage.

Once a memory access is complete, the memory pipeline signals the scheduler to continue the iterator execution or terminate it if there is a translation or protection failure. Logic pipeline: Each logic pipeline runs PULSE ISA instructions other than LOAD/STORE to determine the cur\_ptr value for the next iteration or, to determine if the termination condition has been met. Our logic pipeline comprises an ALU to execute the standard arithmetic and logic instructions, as well as modules to support register manipulation, branching, and the specialized RETURN instruction execution (Table 3.1). During a particular iterator's execution, the logic pipeline performs its corresponding instructions with direct reads and updates to its dedicated workspace registers. An iteration's logic can end in one of two possible ways:

(i) the cur\_ptr has been updated to the next pointer, and the NEXT\_ITER instruction is reached, or (ii) the pointer traversal is complete, and the RETURN instruction is reached. In either case, the logic pipeline notifies the scheduler with the appropriate signal.

Scheduler: The scheduler handles new iterator requests received over the network and schedules each iterator's data fetch and logic execution across memory and logic pipelines:

- 1. On receiving a new request over the network, it assigns the iterator an empty workspace at a logic pipeline and signals one of the memory pipelines to execute the data fetch from memory based on the state in the workspace.
- 2. On receiving a signal from the memory pipeline that a data fetch has successfully completed, it notifies the appropriate logic pipeline to continue iterator execution via the corresponding workspace.
- 3. On receiving a signal from the logic pipeline that the next iteration can be started (via the NEXT\_ITER instruction), it notifies one of the memory pipelines to execute LOAD via the corresponding workspace.
- 4. When it receives a signal from the memory pipeline that an address translation or memory protection failed or a signal from the logic pipeline that the iterator execution has met its terminal condition (via the RETURN instruction), it signals the network stack to prepare a response containing the iterator code, cur\_ptr and scratch\_pad.

While the scheduler assigns memory and logic pipelines to an iterator in steps 1 and 3 in a manner that maximizes utilization of all memory pipelines (i.e., Fig. 3.4 (bottom)), it is possible to implement other scheduling policies.

Network Stack: The network stack receives and transmits packets; when a new request arrives, it parses/deparses the payload to extract/embed the request ID, code, and state for the offloaded iterator execution (cur\_ptr, scratch\_pad).

The network stack uses the same format for both requests and responses, so a response can be sent back to the CPU node on traversal completion or rerouted as a request to a different memory node for continued execution (§3.5).

Implementation. We use an FPGA-based NIC (Xilinx Alveo U250) with two 100 Gbps ports, 64 GB on-board DRAM, 1,728K LUTs, and 70 MB BRAM. Since the board has two Ethernet ports and four memory channels, we partition its resources into two PULSE accelerators, each with a single Ethernet port and two memory channels. Our analysis of common data structures (§3.6) shows their  $t_c/t_d$  ratio tends to be < 0.75. As such, we set  $\eta = 0.75$ , i.e., there are four memory and three logic pipelines and a total of 7 workspaces on the accelerator. We use the Xilinx TCAM IP [153] (for page tables), 100 Gbps Ethernet IP, link-layer IPs [154], and burst data transfers [155] to improve memory bandwidth. The logic and memory pipelines are clocked at 250 MHz, while the network stack operates at 322 MHz for 100 Gbps traffic. Our FPGA prototype showcases PULSE's potential; we believe that ASIC implementations are the next natural step.

# 3.5 Distributed Pointer Traversals

By restricting pointer traversals to a single memory node (§3.3.2), prior approaches leave applications with two undesirable options. At one extreme, they can confine their data to a single memory, but sacrifice application scalability. Conversely, they can spread their data across multiple nodes but have to return the CPU node whenever the traversal accesses a pointer on another memory node. This affords scalability but costs additional network and software processing latency at the CPU node. To avoid the cost, one may replicate the entire translation and protection state for the cluster at every memory node so they can directly forward traversal requests to other memory nodes. This comes at the cost of increased space consumption for translation, which is challenging to contain within the accelerator's translation and protection tables. Moreover, duplicating this state across memory nodes

requires complex protocols for ensuring their consistency (e.g., when the state changes), which have significant performance overheads.

PULSE breaks this tradeoff between performance and scalability by leveraging a programmable network switch to support rack-scale distributed pointer traversals. In particular, if the PULSE accelerator on one memory node detects that the next pointer lies on a different memory node, it forwards the request to the network switch, which routes it to the appropriate memory node for continuing the traversal. This cuts the network latency by half a round trip time and avoids software overheads at the CPU node, instead performing the routing logic in switch hardware. Since continuing the traversal across memory nodes is similar to packet routing, the switch hardware is already optimized to support it.

Enabling rack-scale pointer traversals, however, requires addressing two key challenges, as we discuss next.

Hierarchical translation. For the switch to forward the pointer traversal request to the appropriate memory node, it must be able to locate which memory nodes are responsible for which addresses. To minimize the logic and state maintained at the switch due to its limited resources, PULSE employs hierarchical address translation as shown in Fig. 3.6. In particular, the address space is range partitioned across memory nodes; PULSE only stores the base address to memory node mapping at the switch, while each memory node stores its own local address translation and protection metadata at the accelerator (①), as outlined in §3.4.1. The routing logic at the switch inspects the cur\_ptr field in the request (②) and consults its mapping to determine the target memory node (③). At the memory node, the traversal proceeds until the accessed pointer is not present in the local table (as in ①); it then sends the request back to the switch (§3.4.1), which can re-route the request to the appropriate memory node (④-⑥), or notify the CPU node if the pointer is invalid.

Continuing stateful iterator execution. One challenge of distributing iterator execution in PULSE lies in its stateful nature: since PULSE permits the storage of intermediate state in the iterator's scratch\_pad, how can such stateful iterator execution be continued on a different memory node? Fortunately, our design choices of confining all of the iterator state in scratch\_pad and cur\_ptr and keeping the request and response formats identical make

| Application                 | Data Structure | $t_c/t_d$ | # Iterations |
|-----------------------------|----------------|-----------|--------------|
| WebService                  | Hash-table     | 0.06      | 48           |
| WiredTiger                  | B+Tree         | 0.63      | 25           |
| BTrDB $(1s \text{ to } 8s)$ | D+1166         | 0.71      | 38-227       |

Table 3.2: Workloads used in our evaluation (§3.6).  $t_c$  and  $t_d$  correspond to compute and memory access time at the PULSE accelerator.

this straightforward. The accelerator at the memory node simply embeds the up-to-date scratch\_pad within the response before forwarding it to the switch; when the switch forwards it to the next memory node, it can simply continue execution exactly as it would have if the last memory node had the pointer.

### 3.6 Evaluation

Compared systems. We compare PULSE against: (1) a Cache-based system that relies solely on caches at CPU nodes to speed up remote memory accesses; we use Fastswap [31] as the representative system, (2) an RPC system that offloads pointer-traversals to a CPU on memory nodes, (3) RPC-ARM, an RPC system that employs a wimpy ARM processors at memory nodes, and (4) a Cache+RPC approach that employs data structure-aware caches; we use AIFM [28] as the representative system. (1, 4) use a cache size of 2 GB, while (2, 3) use a DPDK-based RPC framework [156].

Our experimental setup comprises two servers, one for the CPU node and the other for memory nodes, connected via a 32-port switch with a 6.4 Tbps programmable Tofino ASIC. Both servers were equipped with Intel Xeon Gold 6240 Processors [157] and 100 Gbps Mellanox ConnectX-5 NICs. For a fair comparison, we limit the memory bandwidth of the memory nodes to 25 GB/s (FPGA's peak bandwidth) using Intel Resource Director [158] and report energy consumption of the minimum number of CPU cores needed to saturate the bandwidth. We use Bluefield-2 [159] DPU as our ARM-based SmartNICs with 8 Cortex-A72 cores and 16 GB DRAM. For PULSE, we placed two memory nodes on each FPGA NIC (one per port, a total of 4 memory nodes). Our results translate to larger setups since PULSE's performance or energy efficiency are independent of dataset size and cluster scale.

Applications & workloads. We consider 3 applications with varying data structure com-

plexity, compute/memory-access ratio, and iteration count per request (Table 3.2): (1) Web Service [28] that processes user requests by retrieving user IDs from an in-memory hash table, using these IDs to fetch 8KB objects, which are then encrypted, compressed and returned to the user. Requests are generated using YCSB A (50% read/50% update), B (95% read/5% update), and C (100% read) workloads with Zipf distribution [160]. (2) WiredTiger Storage Engine (MongoDB backend [161]) uses B+Trees to index NoSQL tables. Our frontend issues range query requests over the network to WiredTiger and plots the results. Similar to prior work [28,162], we model user queries using the YCSB E workload with Zipf distribution [160] on 8B keys and 240B values. (3) BTrDB Time-series Database [133] is a database designed for visualizing patterns in time-series data. BTrDB reads the data from a B+Tree-based store for a given user query and renders the time-series data through an interactive user interface [163]. We run stateful aggregations (sum, average, min, max) for time windows of different resolutions, from 1s to 8s, on the Open  $\mu$ PMU Dataset [164] with voltage, current, and phase readings from LBNL's power grid [133].

# 3.6.1 Performance for Real-world Applications

Since AIFM [28] does not natively support B+-Trees or distributed execution, we restrict the Cache+RPC approach to the Web Service application on a single node.

Single-node performance. Fig. 3.7 demonstrates the advantages of accelerating pointer-traversals at disaggregated memory. Compared to the Cache-based approach, PULSE achieves 9–34.4× lower latency and 28–171× higher throughput across all applications using only one network round-trip per request. RPC-based systems observe 1–1.4× lower latency than PULSE due to their 9× higher CPU clock rates. We believe an ASIC-based realization of PULSE has the potential to close or even overcome this gap. Cache+RPC incurs higher latency than RPC due to its TCP-based DPDK stack [28, 165] and does not outperform RPC, indicating that data structure-aware caching is not beneficial due to poor locality.

Latency depends on the number of nodes traversed during a single request and the response size. WebService experiences the highest latency due to large 8KB responses and long traversal length per request. In BTrDB, the latency increases (and the throughput decreases)

as the window size grows due to the longer pointer traversals (see Table 3.2). Interestingly, the Cache-based approach performs significantly better for BTrDB than WebService and WiredTiger due to the better data locality in time-series analysis of chronologically ordered data. However, its throughput remains significantly lower than both PULSE and RPC since it is bottlenecked by the swap system performance, which could not evict pages fast enough to bring in new data. This is verified in our analysis of resource utilization (deferred to Appendix for brevity); we find that RPC, RPC-ARM, Cache+RPC, and PULSE can utilize more than 90% of the memory bandwidth across the applications, while the Cache-based approach observes less than 1 Gbps network bandwidth. The other systems — PULSE, RPC, RPC-ARM, and Cache+RPC — can also saturate available memory bandwidth (around 25 GB/s) by offloading pointer traversals to the memory node, consuming only 0.5%–25% of the available network bandwidth.

**Distributed pointer traversals.** Fig. 3.7 shows that employing multiple memory nodes introduces two major changes in performance trends: (1) the latency increases when the pointer traversal spans multiple memory nodes, and (2) throughput increases with the number of nodes since the systems can exploit more CPUs or accelerators. WebService is an exception to the trend: since the hash table is partitioned across memory nodes based on primary keys, the linked list for a hash bucket resides in a single memory node.

PULSE observes lower latency than the compared systems due to in-network support for distributed pointer-traversals (§3.5). The latency increases significantly from one to two memory nodes for all systems since traversing to the next pointer on a different memory node adds 5–10 μs network latency. Also, even across two memory nodes, a request can trigger multiple inter-node pointer traversals incurring multiple network round-trips; for WiredTiger and BtrDB, 10%–30% of pointer traversals are inter-node. However, in-network traversals allow PULSE to reduce latency overheads by 33–98%, with 1.1–1.36× higher throughput than RPC.

**Energy consumption.** We compared energy consumed per request for PULSE and RPC schemes at a request rate that ensured memory bandwidth was saturated for both. We measure energy consumption using Xilinx XRT [127] for PULSE (all power rails) and Intel

RAPL tools [128] for RPC on CPUs [157] (CPU package and DRAM only). For RPC-ARM on ARM cores, since there is no power-related performance counter [166] or open-source tool available, we adapt the measurement approach from prior work [120]. Specifically, we calculate the CPU package's energy using application CPU cycle counts and DRAM power using Micron's estimation tool [167]. Finally, we conservatively estimate ASIC power using our FPGA prototype: we scale down the ASIC energy only for PULSE accelerator using the methodology employed in prior research [168] while using the unscaled FPGA energy for other components (DRAM, third-party IPs, etc.). As such, we measure an *upper bound* on PULSE and PULSE-ASIC energy use, and a *lower bound* for RPC, RPC-ARM, and Cache+RPC.

Fig. 3.8 shows that PULSE achieves a 4.5–5× reduction in energy use per operation compared to RPCs on a general-purpose CPU, due to its disaggregated architecture (§3.4.1). Our estimation shows that PULSE's ASIC realization can conservatively reduce energy use by an additional 6.3–7× factor. Finally, RPC-ARM's total energy consumption per request can exceed that of standard cores, as seen in the WebService workload. This observation aligns with prior studies [120], which attribute the increased energy use to their longer execution times, resulting in higher aggregate energy demands.

### 3.6.2 Understanding Pulse Performance

**Distributed pointer traversals.** We evaluate the impact of distributed pointer traversals (§3.5) by comparing PULSE against PULSE-ACC, a PULSE variant that sends requests back to the CPU node if the next pointer is not found on the memory node. Fig. 3.9 shows that while both have identical performance on a single memory node, PULSE-ACC observes 1.02–1.15× higher latency for two nodes. On the other hand, their throughput is the same since, under sufficient load, memory node bandwidth bottlenecks the system for both.

Latency breakdown for Pulse accelerator. Fig. 3.10 shows the latency contributions of various hardware components at the Pulse accelerator for the WebService application. The network stack first processes the pointer traversal request in about 430 ns, after which the WebService payload is processed by the scheduler and dispatched to an idle memory access

pipeline in 5.1 ns. Then, the memory pipeline takes  $\sim 132$  ns to perform address translation, memory protection, and data fetch from DRAM. Finally, the logic pipeline takes 10 ns to check the termination conditions and determine the next pointer to look up. This process repeats until the termination condition is met. The time to send a response back over the network stack is symmetric to the request path.

# 3.7 Future Trends and Research

While Pulse is implemented atop Ethernet, its design is interconnect-agnostic and could be realized in ASIC-based or FPGA-attached memory devices over emerging interconnects like CXL [10,124,169]. We have verified these benefits in simulation atop detailed memory access and processing traces of our evaluated applications and workloads. The simulator maintains 2GB of cache in local (CPU-attached) DRAM, while the entire working set is stored on remote CXL memory. Following prior work [37], we model 10–20ns L3 cache latency, 80ns local DRAM latency, 300ns CXL-attached memory latency, and 256B access granularity. We simulate both a four-memory-node setup, which uses a CXL switch with PULSE logic and a PULSE accelerator at each memory node, and a single-node setup with no switch. We assume a conservative overhead for PULSE, using our hardware programmable Ethernet switch and FPGA accelerator latencies.

Fig. 3.11 shows the average slowdown for executing our evaluated workloads on CXL memory relative to running it completely locally (i.e., the entire application working set fits in local DRAM) — with and without PULSE. In the four-node setup, PULSE reduces CXL's slowdown by 19–33% across all applications.

In the single-node setup, PULSE still reduces the slowdown by 19–23% by minimizing high-latency traversals over the CXL interconnect. While a real hardware realization is necessary to precisely quantify PULSE's benefits, our simulation (which models the lowest possible CXL latency and highest possible PULSE overheads) highlights its potential for improving performance in emerging interconnects.

```
1 struct node {
    key_type key;
    value_type value;
    struct node *next;
5 };
7 value_type find(key_type key) {
    for (struct node *cur_ptr =
        bucket_ptr(hash(key)); ; cur_ptr
        = cur_ptr->next) {
      if (key == cur_ptr->key) // Key found
9
        return cur_ptr->value;
10
      if (cur_ptr->next == nullptr) // Key
11
          not found
12
       break;
13
    }
    return KEY_NOT_FOUND;
14
15 }
                  Listing 3.2: C++ STL realization for unordered_map::find().
class unordered_map_find :
       pulse_iterator {
    init(void *key) {
2
      memcpy(scratch_pad, key,
3
          sizeof(key_type));
      cur_ptr =
          bucket_ptr(hash((key_type)*key));
5
    }
    void* next() { return cur_ptr->next; }
    bool end() {
9
      key_type key = *((key_type
10
          *)scratch_pad);
11
      if (key == cur_ptr->key) { // Key
          found
        *((value_type *)scratch_pad) =
            cur_ptr->value;
        return true;
13
      }
14
15
      if (cur_ptr->next == nullptr) { //
          Key not found
        *((unsigned int *)scratch_pad) =
16
            KEY_NOT_FOUND;
        return true;
17
18
19
      return false;
20
21 }
```

Listing 3.3: PULSE realization for unordered\_map::find().



Fig. 3.4: **PULSE accelerator architecture.** (top) Traditional multi-core architectures with tightly coupled logic and memory pipelines result in low utilization and longer execution times. (bottom) PULSE accelerator's *disaggregated* design with an unequal number of logic and memory pipelines efficiently multiplexes concurrent iterator executions across them for near-optimal utilization and performance.



Fig. 3.5: PULSE accelerator overview. See §3.4.1 for details.





Fig. 3.7: Application latency (top) & throughput (bottom) (§3.6.1). The darker color indicates the time spent on cross-node pointer traversals, which increases with the number of memory nodes in WiredTiger and BTrDB.



Fig. 3.8: Application energy consumption per operation (§3.6.1).



Fig. 3.9: Impact of distributed pointer traversals (§3.6.2).



Fig. 3.10: Latency breakdown for PULSE accelerator (§3.6.2).



Fig. 3.11: Slowdown with simulated CXL interconnect (§3.7).

# Chapter 4

# Hardware Layer

While network-based resource disaggregation has gained attention due to advancements in network bandwidth (§??), the inherent latency, limited by the speed of light, still imposes significant overheads. This section explores the potential of next-generation interconnects and their impact on resource disaggregation.

# 4.1 Next-generation Interconnects

Recent advancements in hardware have led to the development of new-generation interconnects by major hardware vendors, such as NVLink [170] from Nvidia and Compute Express Link (CXL) [10] from Intel. CXL, in particular, has been introduced as a promising solution to expand memory capacity and bandwidth by attaching external memory devices to PCIe slots, offering a dynamic and heterogeneous computing environment.

Compute Express Link (CXL). As depicted in Figure ??, CXL encompasses three key protocols: CXL.mem, CXL.cache, and CXL.io. CXL.io serves as the PCIe physical layer. CXL.mem enables processors to access memory over PCIe, while CXL.cache facilitates coherent memory access between processors and accelerators. These protocols allow for the construction of various CXL device types. The initial CXL 1.1 version serves as a memory expander for a single server. Subsequent versions, like CXL 2.0, extend this capability to multiple servers, incorporating CXL switches that coordinate access from different servers and enable various compute nodes to share a large memory pool. The forthcoming CXL 3.0

aims to scale up further, with cache coherency managed by hardware.

Despite extensive research on CXL [169,171,172], practical, commercial CXL hardware implementations remain in development, posing challenges in fully understanding performance and system support design for such hardware. Most studies have relied on simulations or FPGA-based CXL hardware [172,173], lacking empirical evaluations on ASIC-based CXL hardware. Moreover, existing research often focuses on single aspects of CXL, like capacity or bandwidth, using synthetic benchmarks and neglecting a comprehensive evaluation that includes cost considerations. To gauge the performance of real CXL hardware and assess its suitability for resource disaggregation, we evaluated the latest hardware available: Intel's 4<sup>th</sup> generation scalable processor (Sapphire Rapids) and Asteralabs's CXL 1.1 memory expander (Type-3 device). Using Intel Memory Latency Checker (MLC) [174], we measured the latency of reading data from the CXL device and local memory equipped with the same amount of DDR5 channels for local and cross-socket access. Figure?? reveals that the latest CXL hardware exhibits a latency of more than 2.5× higher than local memory. However, this gap narrows for cross-socket access, suggesting CXL as another memory tier. This raises questions about whether and how this information should be exposed to applications. Previous research [175] has investigated promoting hot pages from slower-tiered memory at the kernel level to enhance performance while maintaining application transparency.

This study represents the first available evaluation of real CXL 1.1 ASICs. The performance of CXL 2.0 and 3.0 remains to be explored in future work.

# 4.2 Introduction

In an age marked by the surge of memory-intensive applications, such as machine learning tasks and High-Performance Computing (HPC) applications, there is an urgent need for expanding the memory capacity and bandwidth [176–178]. For instance, a machine learning application with 175 B model requires 700 GB of memory to hold its parameters only, not to mention memory requirements for intermediate results and others. That is, the memory requirements of modern applications could easily exceed the memory capability of a single machine due to physical constraints, such as availability of DDR DIMM slots and thermal



Fig. 4.1: **CXL Overview.** In this study, we focus on commercial CXL 1.1 Type-3 devices, leveraging CXL io and CXL mem protocols for memory expansion in single-server environments. issues, as well as cost considerations of employing high-density DIMMs [177, 178].

To meet such urgent demands, Compute Express Link (CXL) [10, 169, 171, 178] is introduced as a groundbreaking interconnect technology. CXL promises significant expansion of memory capacity and bandwidth by attaching external memory devices (e.g., DRAM, Flash or persistent memory) to PCIe slots. Unlike its predecessors, CXL enables a more dynamic and heterogeneous computing environment, leading to various design trade-offs for performance and cost gains. Commercially debuting with version 1.1, CXL allows direct attachment of external memory devices to the host machine, enabling a unified and coherent memory address space. In such configuration, CXL is predominantly used as a way of memory expansion. For example, AsteraLabs' A1000 [179] CXL memory expansion card supports up to 4xDDR5 RDIMMs, enabling up to 2 TB of additional memory for a single server.

Although substantial studies on CXL memory have been performed in the past [37,169, 171,172,175,178,180,181], there remains a significant gap of employing these studies to guide the integration of CXL practically. In particular, we observe the following issues: (1) Much of the current literature has focused on evaluating CXL hardware through simulations [37,171] or using FPGA-based setups [172,181]. Although a limited number of studies have begun to assess the raw performance of ASIC-based CXL hardware [172,182], there remains a

gap in understanding how different system configurations influence the performance of data center applications using CXL memory. Furthermore, the specific applications that could substantially benefit from CXL memory expansion are not yet fully identified. (2) While existing studies have begun to explore the cost implications of employing CXL technology, such as the work on memory pooling cost models presented in [183], a critical gap remains in understanding the cost-effectiveness of migrating particular types of applications or services to memory expansions facilitated by CXL. (3) Given the restricted availability of CXL ASIC hardware, the research community faces a notable scarcity of open-source empirical data. This limitation hinders efforts to fully comprehend the performance capabilities of such hardware or to develop performance models based on empirical evidence.

Our study aims to fill existing knowledge gaps by conducting detailed evaluations of CXL 1.1 for memory-intensive applications, leading to several *intriguing observations*: Contrary to the common perception that CXL memory, due to its higher latency, should be considered a separate, slower tier of memory [37,175], we find that shifting some workloads to CXL memory can significantly enhance performance, even if local memory's capacity and bandwidth are underutilized. This is because using CXL memory can decrease the overall memory access latency by alleviating bandwidth contention on DDR channels, thereby improving application performance. From our analysis of application performance, we have formulated an abstract cost model (§4.7) that predicts substantial cost savings in practical deployments.

In summary, the major contributions of this paper are:

- Empirical Evaluation of ASIC CXL Hardware: Our study comprehensively examines the performance of ASIC-based CXL hardware and system configurations in data center applications, offering insights on optimizing CXL memory utilization.
- Cost-Benefit Analysis: We undertake a comprehensive cost-benefit analysis and develop an Abstract Cost Model to evaluate how CXL memory could substantially reduce real-world applications' TCO (Total Cost of Ownership).
- Open-source data on CXL ASIC performance: We open source all data and testing configurations under https://github.com/bytedance/eurosys24-artifacts.



Fig. 4.2: **CXL Experimental Platform.** (a) Each CXL server is equipped with two A1000 memory expansion cards. SNC-4(§4.4.1) is enabled only for the raw performance benchmarks(§4.4) and bandwidth-bound benchmarks(§4.6), and each SNC Domain is equipped with two DDR5 channels. (a) illustrates Socket 0; Socket 1 shares a similar setup except for the absence of CXL memory. (b) Our platform comprises two CXL servers and one baseline server. The baseline server replicates the same configuration but lacks any CXL memory cards.

The paper organizes as follows. §4.3 introduces basic information of CXL and environment setup for the evaluations. §4.4 presents basic performance characteristic of CXL memory expansion. §4.5 and §4.6 presents findings and suggestions of using CXL as the expansion of memory capacity and bandwidth on data center workloads. §4.7 provides a detailed analysis on the potential cost benefits brought by CXL. §?? discusses how our insights are applicable to future generations of CXL. §?? describes related work, and §?? concludes the paper.

# 4.3 Background and Methodology

This section presents an overview of CXL technology, followed by our experimental setup and methodologies.

# 4.3.1 Compute Express Link (CXL) Overview

Compute Express Link (CXL) [?] is a standardized interconnect technology that facilitates communication between processors and various devices, including accelerators, memory expansion units, and smart I/O devices. CXL is built upon the physical layer of PCI Express® (PCIe®) 5.0 [184], providing native support for x16, x8, and x4 link widths with data rates of 32.0 GT/s and 64.0 GT/s. The CXL transaction layer is implemented through three

protocols: CXL.io, CXL.cache, and CXL.mem, as depicted in Fig. 4.1. *CXL.io* protocol is based on PCIe 5.0 and handles device discovery, configuration, initialization, I/O virtualization, and direct memory access (DMA). *CXL.cache* enables CXL devices to access the host processor's memory. *CXL.mem* allows the host to access memory attached to devices using load/store commands.

CXL devices are categorized into three types, each associated with specific use cases: (1) Type-1 devices like SmartNICs utilize CXL.io and CXL.cache for DDR memory communication. (2) Type-2 devices, including GPUs, ASICs, and FPGAs, employ CXL.io, CXL.cache, and CXL.mem to share memory with the processor, enhancing various workloads in the same cache domain. (3) Type-3 devices leverage CXL.io and CXL.mem for memory expansion and pooling. This allows for increased DRAM capacity, enhanced memory bandwidth, and the addition of persistent memory without sacrificing DRAM slots. Type-3 devices complement DRAM with CXL-enabled solutions, benefiting high-speed, low-latency storage.

The commercially available version of CXL is 1.1, where a CXL 1.1 device can only serve as a single logical device accessible by one host at a time. Future generations of CXL, like CXL 2.0, are expected to support the partitioning of devices into multiple logical units, enabling up to 16 different hosts to access different portions of memory [185]. In this paper, our focus is on commercially available CXL 1.1 Type-3 devices, specifically addressing single-host memory expansion.

# 4.3.2 Hardware Support for CXL

Recent announcements have introduced CXL 1.1 support for Intel Sapphire Rapids processors (SPR) [186] and AMD Zen 4 EPYC "Genoa" and "Bergamo" processors [187]. While commercial CXL memory modules are provided by vendors such as Asteralabs [179], Montage [188], Micron [167], and Samsung [182], CXL memory expanders are predominantly in prototype stages, with only limited samples available, making access difficult for university labs. Consequently, due to the scarcity of CXL hardware, research into CXL memory has largely depended on NUMA-based emulation [37,175] and FPGA implementations [172,181], each with inherent limitations:

**NUMA-based emulation.** Given the cache coherent nature and comparable transfer speed of CXL and UPI/xGMI interconnects, NUMA-based emulation [37, 175] is widely adopted to enable fast application performance analysis and software prototyping as the CXL memory is exposed as a remote NUMA node. However, NUMA-based emulation fails to accurately capture the performance characteristics of CXL memory due to differences from CXL and UPI/xGMI interconnects [189], as shown in previous research [172].

FPGA-based implementation. Intel and other hardware vendors use FPGA hardware to implement CXL protocols [173], bypassing the performance inconsistencies of NUMA-based emulation. However, FPGA-based CXL memory falls short in fully utilizing memory chip performance due to its lower operating frequency compared to ASICs [190]. FPGAs prioritize flexibility over performance and are suitable for early-stage CXL memory validation but not production deployment. Intel's recent evaluation [172] uncovered performance issues in FPGA implementations, including reduced memory bandwidth during concurrent thread execution. This hampers rigorous evaluations for memory capacity- and bandwidth-bound applications, which are key use cases for CXL memory expanders. Further discussion on the performance disparity between CXL ASIC and FPGA controllers is in §4.4.

To the best of our knowledge, we are one of the pioneers in uncovering the performance characteristics of actual ASIC prototypes designed for CXL memory expansion. The ASIC CXL memory controller we have employed is the A1000 [179] developed by AsteraLabs, which implements the CXL interface at speeds of up to 32 GT/s per lane, supporting up to 16 lanes in total. This controller has the capability to accommodate up to 4 DDR5-5600 RDIMM slots, providing a total memory capacity of 2TB.

#### 4.3.3 Software Support for CXL

While hardware vendors are actively advancing CXL production, a notable deficiency exists in software and OS kernel support for CXL memory. This deficiency has prompted the utilization of specific software enhancements. We summarize the most recent patches in the Linux Kernel that add CXL-aware support, namely: (1) the interleaving policy support (unofficial) and (2) the hot page selection support (official since Linux Kernel v6.1).

#### N:M Interleave Policy for Tiered Memory Nodes.

Traditional memory interleave policies distribute data evenly across memory banks, often using a 1:1 ratio. However, the advent of tiered memory systems, which feature CPU-less memory nodes with diverse performance traits, demands more nuanced strategies for optimizing memory bandwidth, especially for bandwidth-heavy applications. The interleave patch [191] introduces an innovative N:M interleave policy to address this, allowing for an allocation scheme where N pages are directed to high-performance (top-tier) nodes and M pages to lower-tier nodes. For example, using a 4:1 ratio directs 80% of traffic to top-tier nodes and 20% to low-tier nodes, adjustable through the vm.numa\_tier\_interleave parameter. While the patch showcases compelling evaluation results [191], it's crucial to note that optimal memory distribution depends on specific hardware and application characteristics. Given the higher latency of CXL memory, as demonstrated in §4.4, performance-sensitive applications should undergo thorough profiling and benchmarking to maximize the advantages of interleaving and mitigate potential performance trade-offs.

#### NUMA Balancing & Hot Page Selection.

The memory subsystem, now termed a memory tiering system, accommodates various memory types like PMEM and CXL Memory, each with differing performance characteristics. To optimize system performance, "hot pages" (frequently accessed) should reside in faster memory tiers like DRAM, while "cold pages" (less frequently accessed) should be in slower tiers like CXL memory. Recent Linux Kernel patches address this:

- 1. The *NUMA-balancing* patch [192] uses a latency-aware page migration strategy, focusing on promoting recently accessed pages (MRU). It scans NUMA balancing page tables and hints page faults. However, it may not accurately identify high-demand pages due to extended scanning intervals, potentially causing latency issues for some workloads.
- 2. The Hot Page Selection patch" [?] introduces a Page Promotion Rate Limit (RPRL) mechanism to control the rate of page promotions and demotions. While this extends promotion/demotion times, it improves workload latency. The hot page threshold is dynamically adjusted to align with the promotion rate limit.

Additionally, research prototypes like TPP [175] share a similar concept with optimiza-

tions and are being considered for integration into the Linux Kernel [193]. However, we faced challenges with TPP when running memory-bandwidth-intensive applications, resulting in unexplained performance degradation. Hence, we rely on the well-tested kernel patches integrated into Linux Kernel since version 6.1.

#### 4.3.4 Experimental Platform Description

The evaluation testbed, as illustrated in Fig. 4.2(b), consists of three servers. Two of these servers are designated as CXL experiment servers. Each of these servers is equipped with dual Intel Xeon 4th Generation CPUs (Sapphire Rapids, or SPR), 1 TB of 4800 MHz DDR5 memory, two 1.92 TB SSDs, and a pair of A1000 CXL Gen5 x16 ASIC memory expanders modules from AsteraLabs, each with 256 GB of 4800MHz memory (resulting in a total of 512 GB memory per server). Both A1000 memory modules are attached to socket 0. The third server serves as the baseline and is configured identically to the CXL experiment servers, except for the absence of the CXL memory expanders. It is designated for initiating client requests and running workloads that strictly utilize the main memory during the application assessments. All servers are interconnected via 100 Gbps Ethernet links.

### 4.4 CXL 1.1 Performance Characteristics

In this section, we assess the performance of the CXL memory expander and compare it directly with main memory, which we designate as **MMEM** for clarity against CXL memory. We analyze workload patterns and evaluate performance differences between local and remote socket scenarios.

#### 4.4.1 Experimental Configuration

For each dual-channel A1000 ASIC CXL memory expander [179], we connect two DDR5-4800 memory channels, achieving a total capacity of 256 GB. To provide a fair comparison between MMEM and CXL-attached DDR5 memory, we utilize the Sub-NUMA Clustering (SNC) [194] feature to ensure the number of memory channels is the same in both settings. Sub-NUMA Clustering (SNC). Sub-NUMA Clustering (SNC) serves as an enhancement



Fig. 4.3: Overall effect of read-write ratio on MMEM and CXL across different distances. The workloads are represented by read:write ratios (e.g., 0:1 for write-only, 1:0 for read-only). Accessing CXL memory locally incurs higher latency compared to MMEM but is more comparable to accessing MMEM on a remote socket. MMEM bandwidth peaks at 67 GB/s, versus 54.6 GB/s for CXL memory. Performance significantly declines when accessing CXL memory on a remote socket (§4.4.2). In specific scenarios, such as the write-only workload (0:1) in (b), the plot may show instances where bandwidth decreases and latency increases with heavier loads. The Y-axis is on a logarithmic scale.

over the traditional NUMA architecture. It decomposes a single NUMA node into multiple smaller semi-independent sub-nodes (domains). Each sub-NUMA node possesses its own dedicated local memory, L3 caches, and CPU cores. In our experimental setup (Fig. 4.2(a)), we partition each CPU into four sub-NUMA nodes. Each sub-NUMA node is equipped with two DDR5 memory channels connected to two 64 GB DDR5-4800 DIMMs. Enabling SNC requires setting the IMC (Integrated Memory Controllers) to 1-way interleaving. According to the specifications, a single DDR5-4800 channel has a theoretical peak bandwidth of 38.4 GB/s [171]. Therefore, each sub-NUMA node has a combined memory bandwidth of up to 76.8 GB/s.

Intel Memory Latency Checker (MLC). We leverage Intel's Memory Latency Checker (MLC) to examine loaded-latency for various read-write workloads, adopting a 64-byte access size same as prior work [172]. We deploy 16 MLC threads, and it's important to note that while the thread count is a configurable parameter in MLC, it doesn't directly dictate memory request concurrency. MLC assigns separate memory segments for each thread to access simultaneously. Specifically, when evaluating loaded latency, MLC incrementally increases the operation rate of each thread. Our findings indicate that employing 16 threads with MLC precisely measures both the idle and loaded latency and the point at which bandwidth becomes saturated. MLC accommodates a broad spectrum of workloads including those with varied read-write mixes and non-temporal writes.



Fig. 4.4: A detailed comparison of MMEM versus CXL over diverse NUMA/socket distances and workloads. (a)-(f) shows the latency-bandwidth trend difference of accessing data from different distances in sequential access pattern, sorted by the proportion of write. We refer to main memory as MMEM, with MMEM-r and CXL-r representing remote socket MMEM and cxl memory access, respectively. The Y-axis is on a logarithmic scale.

Our study is focused on addressing the following research questions:

- How is the performance of the CXL-attached memory compared to that of local-socket/remote-socket main memory?
- What is the performance impact of the CXL memory under different read-write ratios and access patterns (random vs. sequential)?
- How do main memory and CXL memory behave under high memory load conditions?

### 4.4.2 Basic Latency and Bandwidth Characteristics

This section outlines our findings on memory access latency and bandwidth for different memory configurations: local-socket main memory (MMEM), remote-socket main memory (MMEM-r), CXL memory (CXL), and remote-socket CXL memory (CXL-r). Figure 4.3(a) shows the loaded latency curve for MMEM under varied read-write mixes. The read-only workload hits a peak bandwidth of roughly 67 GB/s, reaching 87% of its theoretical maximum. Yet, as write operations increase, bandwidth dips, with write-only tasks dropping to

54.6 GB/s. We note an initial memory latency of about 97 ns, which spikes exponentially as bandwidth nears full capacity, a sign of bandwidth contention [195, 196]. Interestingly, latency starts to significantly increase at 75%-83% of bandwidth utilization, surpassing prior estimates of 60% from earlier studies [195].

Figure 4.3(b) illustrates the latency differences when accessing MMEM via a remote socket. For read-only tasks, latency begins at approximately 130 ns, contrasting sharply with just 71.77 ns for write-only operations. This reduced latency for write-only workloads results from non-temporal writes, which proceed asynchronously without awaiting confirmation. Despite read-only tasks achieving maximum bandwidth comparable to that of local MMEM, incorporating more write operations significantly diminishes bandwidth, attributed to the additional UPI traffic necessitated by cache coherence protocols. Interestingly, the write-only workload generate minimal UPI traffic but suffer the lowest bandwidth as it utilize only one direction of the UPI's bidirectional capabilities. Moreover, latency escalation occurs earlier in remote socket memory accesses than in local ones, primarily due to queue contention at the memory controller.

Fig. 4.3(c) illustrates the latency curve for CXL memory expansion, demonstrating a minimum latency of 250.42 ns. Interestingly, despite additional PCIe and CXL memory controller overhead on the datapath, accessing CXL follows the same "Bandwidth contention" trend as MMEM. The latency of accessing CXL on the same socket remains relatively stable as bandwidth increases, with a maximum bandwidth of around 56.7 GB/s, achieved when the workload is 2:1 read-write ratio. The reduction in maximum bandwidth compared to DRAM is attributed to PCIe overhead, such as extra headers. The maximum bandwidth for read-only workloads is smaller due to PCIe bi-directionality, preventing full bandwidth utilization. Fig. 4.3(d) reveals the latency-bandwidth plot for accessing CXL from a remote socket, incurring an exceptionally high idle latency of 485 ns. In addition, the maximum memory bandwidth is unexpectedly halved, reaching just 20.4 GB/s for 2:1 read-write ratio, which is a much more severe performance drop compared to accessing MMEM from the remote NUMA node in Fig. 4.3(d). Since running a read-only towards a CXL Type-3 device on the remote socket does not generate substantial coherence traffic, initial speculation regarding cache coherence is ruled out. Further investigation utilizing the Intel Performance

Counter Monitor (PCM) [197] also confirms that the UPI utilization is consistently below 30%. Discussions with Intel suggest this performance bottleneck is likely due to limitations in the Remote Snoop Filter (RSF) on the current CPU platform, anticipated to be addressed in the next-generation processors [198].

#### 4.4.3 Different Read-Write Ratios & Access Pattern

Fig. 4.4(a)-4.4(f) present a performance comparison for a specific workload with varying read-write ratios. The results align with our observation that accessing CXL from a remote socket introduces exceptionally high latency and low bandwidth. When accessing CXL from the same socket, latency is 2.4-2.6 × that of local DDR and 1.5-1.92 × that of remote socket DDR. This suggests that running applications directly on CXL may significantly drop performance. However, when workloads span multiple NUMA nodes within the same socket, accessing CXL locally is comparable to accessing remote NUMA node memory. Additionally, the latency-bandwidth knee-point shifts to the left as the proportion of write operations in the workload increases. Fig. 4.4(g) and 4.4(h) display the results of running both read-only and write-only workloads, utilizing random access patterns instead of sequential access. Notably, we do not observe any significant performance disparities under these conditions.

#### 4.4.4 Key insights

Avoiding Remote Socket CXL Access. CXL memory expansion is commonly utilized for applications that are demanding in terms of memory, particularly those limited by memory capacity or bandwidth. In such contexts, accessing memory across sockets is not uncommon. It is important for software developers to recognize the potential decline in performance when CXL memory is accessed from a remote socket and to strategize against cross-socket CXL memory accesses in their applications. Additionally, hardware vendors should perform cooperative testing and validation of their products to ensure compatibility between CXL memory modules and the processors' CXL support. With adequate support for the CXL 1.1 protocol, we expect that the maximum bandwidth attainable when accessing CXL memory across sockets could approximate the bandwidth seen when accessing MMEM across sockets.

Bandwidth Contention Previous research [171, 196] has brought attention to issues related to bandwidth contention. We further examine how memory latency varies with varying read-write ratios under bandwidth contention. While latency remains relatively stable at low to moderate bandwidth utilization levels, it increases exponentially as bandwidth approaches higher levels, primarily due to queuing delays in the memory controller [195]. Furthermore, the knee-point in latency shifts to lower memory bandwidth when there is a higher proportion of write operations in the workload. Interestingly, CXL-attached memory has often been characterized by industry and research community as 'tiered memory' [172, 191, 193], suggesting that it serves as a slower and less performant memory layer to be considered only when MMEM is fully utilized. However, we argue against this simplistic view of CXLmemory. Allocators and kernel-level page placement policies should consider the available bandwidth in MMEM. Even if a substantial portion of memory bandwidth in MMEM remains unused, e.g., 30%, offloading a portion of the workload, e.g., 20%, to CXL memory can lead to overall performance improvements. Our recommendation is to regard CXL memory as a valuable resource for load balancing, even when local DRAM bandwidth is not fully utilized. Subsequent real-world evaluations support these insights ( $\S4.6$ ).

Comparison with FPGA-based CXL implementations. Intel recently disclosed latency and bandwidth performance metrics for their FPGA-based CXL prototype [172]. While they provided insights into relative latency and bandwidth efficiency for soft and hard IP implementations, performance under load was not shared. Our measurements indicate that the ASIC CXL solution only introduces a less than 2.5x overhead in access latency compared to MMEM, surpassing most of Intel's measurements. However, the FPGA-based solution achieved only 60% of the PCIe bandwidth due to the inefficiency of the memory controller, while the Asteralabs A1000 prototype reached an impressive 73.6% bandwidth efficiency, clearly outperforming Intel's FPGA-based solution.



Fig. 4.5: **KeyDB YCSB latency and throughput under different configurations.** (a) Average throughput of four YCSB workload under different system configuration. (b) Tail latency of YCSB-A (c) Tail latency CDF of YCSB-C, both reported by the YCSB client [199].

# 4.5 Memory Capacity-bound Applications

One of the most significant advantages of integrating CXL memory into modern computing systems is the opportunity for significantly larger memory capacities. To elucidate the potential benefits, we focus on three particular use cases (1) key-value stores, a commonly used application in data centers. (2) Big data analytical application. (3) Elastic computing from cloud providers.

#### 4.5.1 In-memory key-value stores

Redis [24] is an open-source in-memory key-value store and one of the most popular NoSQL databases. Redis employs a user-defined parameter, maxmemory, to limit its memory allocation for storing user data. Like traditional memory allocators (e.g., malloc()), Redis may not return memory to the system after key deletion, particularly if deleted keys were on a memory page with active ones. This necessitates memory provisioning based on peak demand, making memory capacity the major bottleneck for Redis deployments [200] in data centers. Google Cloud suggests keeping memory usage below 80% [201], whereas other sources recommend a limit of 75% [200].

Due to the substantial infrastructure costs for memory-only deployment, Redis Enterprise [202] is the commercial variant extensively supported by leading cloud platforms (e.g., AWS, Google Cloud, or Azure). It introduces "Auto Tiering" [203] to allow data overflow to SSDs, offering an economically viable option for database expansion beyond the limits of

RAM capacity. Given that Redis Enterprise is not accessible on our experiment platform, we employ KeyDB as an alternative. KeyDB extends Redis's capabilities by adding KeyDB Flash, which uses RocksDB for persistent storage. The FLASH feature enables all data is written to the disk for persistence, with hot data remaining in memory as well as disk.

#### Methodology and Software Configurations.

In our study, we investigate the performance effects of maximizing memory utilization on a KeyDB server. We deploy a single KeyDB instance on a CXL-enabled server configured with seven server-threads. Unlike Redis's single-threaded approach, KeyDB enhances performance by operating multiple threads to run the standard Redis event loop, akin to running several Redis instances simultaneously. We disable SNC and Transparent Hugepages and enable memory overcommitting within the kernel to minimize potential overhead from OS configurations. For KeyDB FLASH, we deactive all forms of compression in RocksDB to minimize software overhead. Our empirical analysis uses the YCSB benchmark with four distinct workloads: (1) YCSB-A (50% read, 50% update) for update-intensive scenarios; (2) YCSB-B (95% read, 5% update) for read-heavy operations; (3) YCSB-C (100% read) for read-only tasks; and (4) YCSB-D (95% read, 5% insert) to simulate reading the most recent data. These workloads are tested under various system configurations as detailed in Table 4.1. Note that we use the term "MMEM" for main memory in order to separate it from CXL memory. For configurations utilizing SSD data spillover, we set the maxmemory parameter according to the portion of the workload expected to remain in memory. For Hot-Promote, we applied numactl to distribute half of the dataset across CXL memory while limiting the total main memory usage to half the dataset size. The experiments are conducted using a 1 KB key-value size, the YCSB default, with a Zipfian distribution for workloads A-C and the latest distribution for workload D. The total amount of working set data is 512 GB.

#### Analysis.

Fig. 4.5 provides insights into the variations in throughput across different configurations. Notably, regardless of the specific workload, running the entire workload on MMEM consis-

| Configuration | Description                                             |
|---------------|---------------------------------------------------------|
| MMEM          | Entire working set in main memory.                      |
| MMEM-SSD-0.2  | 20% of the working set is spilled to SSD.               |
| MMEM-SSD-0.4  | 40% of the working set is spilled to SSD.               |
| 3:1           | Entire working set in memory (75% MMEM + 25% CXL, 3:1   |
|               | interleaved).                                           |
| 1:1           | Entire working set in memory (50% MMEM + 50% CXL, 1:1   |
|               | interleaved).                                           |
| 1:3           | Entire working set in memory (25% MMEM + 75% CXL, 1:3   |
|               | interleaved).                                           |
| Hot-Promote   | Entire working set in memory (50% MMEM + 50% CXL), with |
|               | hot page promotion kernel patches discussed in §4.3.    |

Table 4.1: Configurations used in capacity experiments.

tently yields the highest throughput. This outcome can be attributed to the nature of our workload, primarily constrained by memory capacity rather than memory bandwidth. The Hot-Promote configuration, which leverages the Zipfian distribution to identify frequently accessed keys as hot pages and migrates them from CXL to MMEM, performs nearly as well as running the workload entirely on MMEM. This demonstrates the effectiveness of the Hot-Promote approach in optimizing performance. In contrast, interleaving data access between CXL and MMEM leads to a noticeable performance decrease, resulting in a 1.2x to 1.5x slowdown compared to running the workload directly in MMEM. This performance drop is primarily due to the higher access latency, as evident in the tail latency plots for workload A and workload C (Fig. 5(b)(c)). MMEM-SSD-0.2 and MMEM-SSD-0.4 configurations perform the poorest, exhibiting nearly a 1.8x slowdown compared to the pure MMEM solution and a 1.55x slowdown compared to the CXL interleaving solution. This poor performance is mainly attributed to the high access latency required to retrieve data from the SSD. It's worth noting that our choice of a Zipfian distribution ensures that the working set is largely cached in MMEM. If the keys were distributed uniformly, we anticipate worse performance due to increased SSD access times.

#### Insights.

Our study shows that the additional memory capacity provided by CXL can be a gamechanger for applications like key-value stores constrained by traditional MMEM's capacity. Intelligent scheduling policies further accentuate the benefits, offering avenues for optimizing systems that leverage multiple memory types and simultaneously saving operation costs.



Fig. 4.6: **Spark memory layout and shuffle spill.** Each Spark executor possesses a fixed-size On-Heap memory, which is dynamically divided between execution and storage memory. If there is insufficient memory during shuffle operations, the Spark executor will spill the data to the disk.

#### 4.5.2 Spark SQL

Big Data plays a crucial role in the workloads managed by data centers. Due to the scale of data involved in Big Data analytical applications, memory capacity often becomes a bottleneck to the performance [26]. Take Spark [61], one of the common Big Data platforms, as an example: A typical query requires shuffling data from multiple tables for processing in the next stage. Operations like reduce ByKey() first partition the data according to the key and then execute reduce operators on each key. Such shuffling operation involves disk I/O and network communication between multiple nodes, posing significant overhead on the query. In some cases, the performance of shuffling could dominate the performance of the workload [204]. During the shuffling process(Fig. 4.6), memory usage could grow beyond the



Fig. 4.7: **Spark execution time and shuffle percentage.** (a) Execution time of each TPC-H query normalized to the execution time running on MMEM. (b) The percentage of time spent of shuffle operation for each query. The solid bars represent shuffle writes, while hollow bars represent shuffle reads.

capacity or certain threshold (e.g. **spark.shuffle.memoryFraction**). When this happens, Spark can be configured to spill data to disk to avoid the risk of out-of-memory failure. Since disk I/O is of magnitudes slower than memory, this could significantly impact the workload's performance.

### Methodology and Software Configurations.

In our experiment, we aim to test if we could reduce the number of servers needed for a specific workload with minimal effect on overall performance. Therefore, we compared the performance of Spark running TPC-H [205] on three servers without CXL memory expansion vs. on two servers but with CXL memory expansion. We assume the maximum amount of MMEM that could be used on each server is 512 GB, therefore with three servers, we have 1.5 TB MMEM and 1 TB CXL memory in total. In order to trigger data spill within the workload, we configured 150 Spark executors. Each Spark executor contains 1 core and 8 GB of memory. Therefore the total Spark application occupies 150 cores and 1.2 TB of memory. We generate a total of 7 TB TPC-H initial dataset. We continue to adhere to the configuration settings detailed in Table 4.1 as follows:

- MMEM only: We allocate 50 Spark executor and 400 GB on each of the three servers.
   In this case there is no data spilled to disk as each executor have sufficient amount of memory.
- MMEM/CXL interleaving: We distributed the same number of executors (150) across

the **two** cxl servers, which has 1 TB (512 GB from each of the two CXL cards) plus 1 TB of MMEM (512 GB each). For example, in a configuration where MMEM and CXL memory usage is balanced (1:1 ratio), we allocated 75 Spark executors to use 600 GB MMEM while another 75 Spark executors to 600 GB CXL memory. In this case, there is also negligible amount of data spilled to the disk.

- Spill to SSD: To simulate conditions where executors would run out of memory and need to spill data to SSD storage, we restrict the memory allocation of the Spark executors to either 80% or 60% of entire 1.2 TB MMEM. In this case, there will be around 320 GB and 500 GB data spilled to the disk respectively.
- Hot-Promote: same as prior experiment (§4.5.1).

We chose four specific queries (Q5, Q7, Q8, and Q9) from the TPC-H benchmark [205], recognized for their intensive data shuffling demands from prior studies [204], to evaluate our setup. Importantly, our measurements focused solely on the time to execute these queries, excluding any data preparation or server setup durations. We disabled SNC on all servers.

#### Analysis.

Figure 4.7 illustrates variations in total execution time across different configurations. To provide a clear comparison, we normalized the total execution time against the best-case scenario, which involves running the entire workload in MMEM. Similar to the KeyDB experiments, the interleaving approach still exhibits a performance slowdown, ranging from 1.4x to 9.8x compared to the optimal MMEM-only scenario while using less number of servers. This performance degradation becomes worse as a larger proportion of memory is allocated to CXL. Nevertheless, it's crucial to note that even with this slowdown, the interleaving approach remains significantly faster than spilling data to SSDs. Figure 4.7(b) illustrates that shuffling overshadows the total execution time due to the intensification of data spill issues.

A notable difference between the KeyDB and Spark experiments is the performance of HotPromote. While it performs better in KeyDB, the Spark SQL experiment shows a

| Year           | CPU                     | Max vCPU   | Memory channels | Max memory | Required Memory |
|----------------|-------------------------|------------|-----------------|------------|-----------------|
|                |                         | per server | per socket      | \TB        | $(1:4) \ \ TB$  |
| 2021           | IceLake-SP [206]        | 160        | 8xDDR4-3200     | 4          | 0.64            |
| 2022 (delayed) | Sapphire Rapids [207]   | 192        | 8xDDR5-4800     | 4          | 0.768           |
| 2023 (delayed) | Emerald Rapids [208]    | 256        | 8xDDR5-6400     | 4          | 1               |
| 2024+          | Sierra Forest [209]     | 1152       | 12              | 4          | 4.5             |
| 2025+          | Clearwater Forest [210] | 1152       | TBD             | 4          | 4.5             |

Table 4.2: Intel Processor Series.

more than 34% slowdown compared to MMEM. Unlike the Zipfian distribution in which the hottest keys are moved from CXL to DDR, there is a considerable amount of thrashing behavior within the kernel in the Spark SQL tests. We identify the root cause after thoroughly investigating the kernel patch implementation. In the initial version of the hot page selection patch [?], a sysctl knob "kernel.numa\_balancing\_promote\_rate\_limit\_MBps" is used to control the maximum promoting/demoting throughput. Subsequent versions introduced an automatic threshold adjustment feature to this patch, aiming to strike a balance between the speed of promotion and migration costs. Nevertheless, this automatic adjustment mechanism appears to fall short in our Spark SQL evaluations. The TPC-H workload on Spark, which demonstrates reduced data locality, challenges the kernel's efficiency in promoting frequently accessed pages. This finding aligns with similar issues highlighted in prior research [172].

#### Insights.

Our research indicates that utilizing CXL memory expansion offers a cost-efficient approach for data-center applications. We postpone our detailed theoretical examination of the Abstract Cost Model to §4.7. Concurrently, although the hot-promote patch demonstrates significant advantages in key-value store workloads, its performance is notably lacking in Spark experiments. As system developers begin to enhance software support for CXL within the kernel, it is crucial to proceed with caution. System-wide policies can have varied impacts on applications, depending on their unique characteristics.

### 4.5.3 Spare Cores for Virtual Machine

One widely-used application within Infrastructure-as-a-Service (IAAS) is Elastic Computing [211]. Here, cloud service providers (CSPs) offer computational resources to users through virtual machines or container instances. Given the diverse needs of users, CSPs traditionally offer a variety of instance types, each characterized by different configurations of CPU cores, memory, disk, and network capacities. Generally, an "optimal" CPU-to-memory ratio, often cited as 1:4, is employed to balance computational and memory requirements (as per AWS guidelines [212,213]). For example, an instance with 128 vCPUs would typically feature 512 GB of DDR memory. Advancements in server processor architecture and chiplet technology have spurred rapid increases in the number of cores available in a single processor package, driven in large part by the CSPs' aim to lower per-core costs. Consequently, 2-socket servers have seen their vCPU counts grow from 160 to 256 within the past two years (Table 4.2). This trend is projected to continue, reaching as many as 1152 vCPUs per server by 2025.

The surge in vCPUs exacerbates memory capacity bottlenecks, constrained by DDR slot limits, DRAM density, and the cost of high-density DIMMs. Intel's Sierra Forest Xeon, for example, supports 1152 vCPUs but is limited by motherboard design to less than 4 TB of memory, falling short of the typical 4.5 TB needed for VM provisioning [214]. This discrepancy makes maintaining a cost-effective vCPU-to-memory ratio challenging, resulting in underutilized vCPUs and lost revenue for CSPs. CXL memory expansion provides a solution by enabling memory capacity to scale beyond DDR limitations, ensuring optimal vCPU utilization and mitigating revenue losses for CSPs.

#### Methodology and Software Configurations.

To assess the performance impact when an application operates exclusively on CXL memory, we replicate the KeyDB configuration from previous experiments (§4.5.1). We utilize numactl to allocate the KeyDB instance exclusively to MMEM or CXL memory. For our evaluation, the workload employed is YCSB-C, characterized by 1 KB key-value pairs and a total dataset size of 100 GB. SNC is disabled.



Fig. 4.8: KeyDB Performance with YCSB-C on CXL/MMEM.

#### Analysis.

The CDF of read latency (Fig. 4.8(a)) indicates that applications running on CXL experience a latency penalty of 9% - 27% which is less than the raw data fetching numbers in our previous measurements in §4.4. This is due to the processing latency within Redis. The throughput of running the entire workload on CXL memory is around 12.5% less compared to MMEM as show in Fig. 4.8(b).

Now consider a server operating at a sub-optimal vCPU-to-memory ratio of 1:3: (1) Due to inadequate memory, only 75% of the vCPUs can be sold at the optimal 1:4 ratio, resulting in a 25% revenue loss. Implementing CXL memory expansion enables the CSP to sell the remaining 25% of vCPUs at the optimal ratio. (2) Our benchmarks indicate that instances running on CXL memory perform 12.5% slower than those on DDR for common workloads such as Redis. Assuming a 20% price discount on such instances, CSPs could still recover approximately 80% of the lost revenue, equating to a 27% improvement in total revenue (20/75 = 26.77%).

#### Insights.

Given the sheer scale of Elastic Computing Service (ECS) applications in public clouds, the potential benefits of CXL memory expansion could be substantial. However, the challenge of maintaining an optimal virtual CPU (vCPU) to memory ratio, traditionally at 1:4, becomes more complex with the rapid increase in processor cores. This ratio, although standard, is under scrutiny for its applicability in future cloud computing paradigms. Notably,

Bytedance's Volcano Engine Cloud [215] illustrates the variability in resource allocation by offering different ratios: 1:4 for general purposes, 1:2 for compute-intensive tasks, and 1:8 for memory and storage-intensive workloads. The impact of CXL memory expansion and pooling on these established ratios presents an intriguing avenue for exploration, raising questions about the adaptability of cloud providers to evolving hardware capabilities and the subsequent effect on resource allocation standards.

# 4.6 Memory Bandwidth-Bound applications

The other advantage of CXL memory expansion is its extra memory bandwidth. We use Large Language Model inference as an example to showcase how this can benefit real-world applications.

Recent work on LLM [216] shows that LLM inference is hungry for memory capacity and bandwidth. The limited capacity of GPU memory restricts the batch size of the LLM inference job and reduces computing efficiency since LLM models are memory-demanding. On the other hand, while CPU memory is high in capacity, it has lower bandwidth than GPU memory. The extra bandwidth and capacity offered by CXL memory make it a promising option for alleviating this bottleneck. For example, a CPU-based LLM inference job can benefit from the extra bandwidth brought by CXL memory, and a CXL-enabled GPU device can also use the extra memory capacity from a disaggregated memory pool. Due to the lack of CXL support in current GPU devices, we experiment with LLM inference on CPU to study the implications of CXL memory's extra bandwidth. We also note that as LLM inference applications are agnostic to the underlying memory technologies, the findings and implications from our experiments are also applicable to the upcoming CXL 2.0/3.0 devices. LLM Inference Framework. Mainstream Large Language Model (LLM) inference frameworks, such as vLLM [217] and LightLLM [218], do not support CPU inference. Recently, Intel introduced an LLM model named Q8chat [219], trained using their 4th Generation Intel Xeon® Scalable Processors. However, the inference code for Q8chat is not yet publicly available. To address this gap, we have developed our inference framework based on

the open-source LightLLM framework [218] by replacing the backend with a CPU inference



Fig. 4.9: **LLM inference framework.** The Httpserver receive requests and forward the tokenized requests to the CPU inference backend. The CPU inference backend serves the requests and reply the next token.



(a) LLM inference serving rate vs. (b) Memory bandwidth vs. number number of threads of threads for a single backend



(c) Memory bandwidth vs. KVcache size for a single backend

Fig. 4.10: CPU LLM inference.

backend. Figure 4.9 illustrates our implementation. In our framework, the HTTPserver frontend receives LLM inference requests and forwards the tokenized requests to a router. The router is responsible for distributing these requests to different CPU backend instances. Each CPU backend instance is equipped with a Key-Value (KV) cache [220], a widely used technique in large language model inference. It's worth noting that KV caching, despite

its name, differs from the traditional 'key-value store' in system architecture. KV caching occurs during multiple token generation steps, specifically within the decoder. During the decoding process, the model starts with a sequence of tokens, predicts the next token, appends it to the input, and repeats this generation process. This is how models like GPT [216] generate responses. The KV cache stores key and value projections used as intermediate data within this decoding process to avoid recomputation for each token generation. Prior research [220] has shown that KV caching is typically memory-bandwidth bound, as it is unique for each sequence in the batch, and different requests typically do not share the KV cache since the sequences are stored in separate contiguous memory spaces [221].

### 4.6.1 Methodology and Software Configurations

To investigate the benefits of CXL memory extension for applications with high memory bandwidth demands and limited MMEM bandwidth availability, we employ the SNC-4 configuration to divide a single CPU into four sub-NUMA nodes. Each node is equipped with two DDR5-4800 memory channels, facilitating an early memory bandwidth saturation of 67 GB/s (§4.4). We examine three distinct interleaving policies (3:1, 1:1, 1:3), detailed in Table 4.1. The CPU inference backend is configured with 12 CPU threads, and memory allocation is strictly bound to a single sub-NUMA domain. This domain includes two DDR5-4800 channels and a 256 GB A1000 CXL memory expansion module via PCIe. By binding allocations to a single node, we ensure the initial saturation of the DDR5 channels. Our experiments utilize the Alpaca 7B model [222], an advancement of the LLaMA 7B model, requiring 4.1GB of memory. The workload, derived from the LightLLM framework [218], includes a wide range of chat-oriented questions. A single-threaded client machine on a baseline server sends HTTP requests with various LLM queries to mimic real-world conditions. The client ensures continuous operation of the CPU inference backends by maintaining a constant stream of requests. The prompt context is set to 2048 bytes to guarantee a minimum inference response size. We progressively increase the CPU inference backend count to monitor the LLM inference serving rate (in tokens/s).

#### 4.6.2 Analysis

Fig. 4.10(a) displays the inference serving rates across various memory configurations as the thread count, i.e., the number of CPU inference backends, increases. Initially, the serving rate improves almost linearly with available memory bandwidth. However, at 48 threads, MMEM bandwidth saturation limits the serving rate, whereas the interleaving configurations leverage additional CXL bandwidth for continued scaling. With a significant number of inference threads (60), an MMEM:CXL = 3:1 interleaving significantly surpasses the MMEM-only approach by 95%.

Interestingly, among the interleaving policies, configurations with a higher proportion of data in main memory demonstrate superior inference performance. Contrary to expectations, we observe that operating entirely on main memory is 14% less effective than a MMEM:CXL ratio of 1:3 beyond 64 threads. This outcome is notable given CXL's inherently higher latency and reduced memory bandwidth (§ 4.4). Fig. 4.10(b) charts the memory bandwidth utilization, as measured by the Intel Performance Counter Monitor (PCM) [197], with increasing CPU thread counts within a single CPU inference backend. Initially, bandwidth utilization grows linearly with thread count, plateauing at 24.2 GB/s for 24 threads. This trend allows us to estimate a bandwidth of approximately 63 GB/s at 60 threads, reaching 82% of the theoretical maximum. Our microbenchmark findings, as detailed in §4.4, indicate that this level of bandwidth utilization may lead to significant latency spikes. These results corroborate the hypothesis that bandwidth contention plays a crucial role in the observed performance degradation.

Bandwidth contention may stem from either loading the LLM model or accessing the KV cache. Adjusting the prompt context to infinity enables the LLM model to continuously generate new tokens for storage in the KV cache. Fig. 4.10(c) illustrates the correlation between KV cache size and memory bandwidth consumption. The initial memory bandwidth of approximately 12 GB/s originates from I/O threads loading the model from memory. When storing information for a larger sequence of tokens in the KV cache, memory usage initially increases linearly. However, bandwidth utilization stops increasing beyond roughly 21 GB/s.

| Parameter      | Description                                                                                          |
|----------------|------------------------------------------------------------------------------------------------------|
| $P_s$          | Throughput when (almost) entire working set is spilled to SSD on a server.                           |
|                | Normalized to 1 in the cost model.                                                                   |
| $R_d$          | Relative throughput when the entire working set is in main memory on a server, normalized to $P_s$ . |
| $R_c$          | Relative throughput when the entire working set is in CXL memory on a server, normalized to $P_s$ .  |
| $\overline{D}$ | The MMEM capacity allocated to each server. For completeness only, not used in cost model.           |
| C              | The ratio of main memory to CXL capacity on a CXL server.                                            |
|                | E.g. 2 means the server has 2x MMEM capacity than CXL memory.                                        |
| $N_{baseline}$ | Number of servers in the baseline cluster.                                                           |
| $N_{cxl}$      | Number of servers in the cluster with CXL memory to deliver the same performance as the baseline.    |
| $R_t$          | Relative TCO comparing a server equipped with CXL memory vs. baseline server.                        |
|                | E.g. If a server with CXL memory costs 10% more than the baseline server, this parameter is 1.1.     |

Table 4.3: Parameters of our Abstract Cost Model.

#### 4.6.3 Insights

Interestingly, existing tiered memory management in the kernel does not consider memory bandwidth contention. Considering a workload that uses high main memory bandwidth(e.g., 70%), existing page migration policy(§4.3) tends to move data from slower tiered-memory (CXL) into MMEM, supposing that there is still enough memory capacity. As more data is written into the main memory, the memory bandwidth will continue to increase (e.g., 90%). In this case, the access latency will grow exponentially, resulting in an actual slowdown of the workload. This scenario will not be uncommon, especially for memory-bandwidth-bound applications (e.g., LLM inference). Therefore, the definition of tiered memory requires rethinking.

# 4.7 Cost Implications

Our comprehensive analysis in prior sections (§4.5, §4.6) reveals that the adoption of CXL memory expansion offers substantial benefits for data center applications, including comparable performance with operational cost savings. However, a significant hurdle in embracing such innovative technology as CXL lies in determining its Return on Investment (ROI). Despite having access to detailed technical specifications and benchmark performance results, accurately forecasting the Total Cost of Ownership (TCO) savings remains challenging. The complexity of simulating benchmarks at production scale, compounded by the limited availability of CXL hardware, exacerbates this issue. Traditional cost models in prior work [183], which could offer such forecasts, demand extensive internal and sensitive information that is

often inaccessible. To overcome this barrier, we propose an Abstract Cost Model designed to estimate TCO savings independently of internal or sensitive data. This model leverages a select set of metrics obtainable through microbenchmarks, alongside a handful of empirical values that are simpler to approximate or access, providing a viable means to evaluate the economic viability of CXL technology implementation.

We use a capacity-bound application (Spark SQL) as an example to demonstrate how we develop our Abstract Cost Model, but our methodology can be extended to other types of workloads as well. For Spark SQL applications, the additional capacity enabled by CXL memory reduces the amount of data spilled to SSD and results in higher performance (throughput). This means fewer servers will be needed to meet the same performance target.

Given that the workload maintains a relatively consistent memory footprint (the size of the active dataset) during execution, we can approximate the execution time of the workload by dividing it into three distinct segments: (1) The segment processed using data stored in MMEM, (2) The segment processed using data stored in CXL memory, and (3) The segment processed using data that has been offloaded to SSD storage.

We first make these measurements from microbenchmarks on a single server:

- Baseline performance  $(P_s)$ : Measure the throughput when (almost) all working set is spilled to SSD. The absolute number is not used in our cost model. Instead, we then normalize it to 1 in our cost model.
- Relative performance when the entire working set is in MMEM  $(R_d)$ : Using the same workload, we measure the throughput when the entire working set is in MMEM and normalize it to  $P_s$  to get the relative performance (i.e., how much faster compared to the baseline).
- Relative performance when the entire working set is in CXL memory  $(R_c)$ : Using the same workload, we measure the throughput when the entire working set is in CXL memory, and normalize it to  $P_s$  to get the relative performance.

We then formulate our cost model using the parameters outlined in Table 5.1. For a working set size of W, the execution time of the baseline cluster could be approximated

as the sum of two segments: 1) the segment that is executed with data in MMEM; 2) the segment that is executed with data spilled onto SSD.

$$T_{baseline} = \frac{N_{baseline}D}{R_d} + (W - N_{baseline}D)$$

The execution time of the cluster with CXL memory could be approximated in a similar way. It includes the segment that is executed with data in main memory, in CXL memory, and spilled to SSD respectively.

$$T_{cxl} = \frac{N_{cxl}D}{R_d} + \frac{N_{cxl}D}{CR_c} + (W - N_{cxl}D - \frac{N_{cxl}D}{C})$$

To meet the same performance target,  $T_{baseline} = T_{cxl}$ :

$$\frac{N_{baseline}D}{R_d} - N_{baseline}D = \frac{N_{cxl}D}{R_d} + \frac{N_{cxl}D}{CR_c} - N_{cxl}D - \frac{N_{cxl}D}{C}$$

With some simple transformations, we get the ratio between  $N_{cxl}$  and  $N_{baseline}$ :

$$\frac{N_{cxl}}{N_{baseline}} = \frac{CR_c(R_d - 1)}{R_cR_d(C + 1) - CR_c - R_d}$$

TCO saving can then be formulated as follows.

$$TCO_{saving} = 1 - \frac{TCO_{cxl}}{TCO_{baseline}} = 1 - \frac{N_{cxl}R_t}{N_{baseline}}$$

For example, suppose  $R_d = 10$ ,  $R_c = 8$ , C = 2, we get  $\frac{N_{cxl}}{N_{baseline}} = 67.29\%$  from the cost model. This means that by using CXL memory, we may reduce the number of servers by 32.71%. And if we further assume  $R_t = 1.1$  (a server with CXL memory costs 10% more than the baseline server), the TCO saving is estimated to be 25.98%.

Our Abstract Cost Model provides an easy and accessible way to estimate the benefit from using CXL memory, providing important guidance to the design of the next-generation infrastructure.

Extending Cost Model for more realistic scenarios. In line with previous research [183], our Abstract Cost Model is designed to be adaptable, allowing for the inclusion of

additional practical infrastructure expenses such as the cost of CXL memory controllers, CXL switches (applicable in CXL 2.0/3.0 versions), PCBs, cables, etc., as fixed constants. However, a notable constraint of our current model is its focus on only one type of application at a time. This becomes a challenge when a data center provider seeks to evaluate cost savings for multiple distinct applications, each with unique characteristics, especially in environments where resources are shared (for instance, through CXL memory pools). This scenario introduces complexity and presents an intriguing challenge, which we acknowledge as an area for future investigation.

## Chapter 5

## Future Work

#### 5.1 Introduction

Autoregressive large language models generate output tokens sequentially, where the generation of each token involves computing attention with key-value (KV) data of all the preceding tokens [223–225]. This sequential dependency has make LLM inference both compute- and memory-intensive [216]. LLM inference typically has two stages: (1) the prefill stage, where all input tokens are processed (in parallel) to generate the initial output token, and (2) the decode stage, where the rest of the output tokens are generated one by one until the model generates an end-of-sequence token.

For applications such as chatbot and coding assistant, LLM serving systems aim to minimize the time to finish the prefill stage, or time to first token (TTFT). In production, service-level objective (SLO) for TTFT is typically 400ms [226]. To meet such SLO, LLM serving systems often cache the previously-computed KV data of preceding tokens (i.e., prefix) in GPU memory, to avoid re-computing them for future requests that have the same prefix [226–228]. Storing KV cache reduces the overall computational load and significantly improves throughput by trading memory for computation.

In production chatbot applications that support large context window, the demand for KV cache storage grows rapidly by the amount of inference requests from users, which cannot be accommodated sheerly by the limited, expensive GPU memory. Researchers thus developed techniques to offload KV cache to CPU memory, leveraging the larger CPU

memory capacity to reduce GPU memory pressure [227–229]. However, as larger LLMs and support for long-context inference requests continue to emerge, the approach of offloading to CPU memory becomes more limited. For example, in LLaMA-2-7B, KV cache of token in FP32 precision is 1024KB; KV cache of a single request with 4096 tokens (maximum context length) is 4GB [230]. The memory demand from serving many concurrent long-context requests can easily overwhelm even high-end memory servers [227,231].

Practitioners increasingly turn to more scalable memory architectures, such as Compute Express Link (CXL) memory [37, 232, 233], to address the growing memory demands of large-scale systems. CXL expands memory capacity by connecting additional DRAM to servers via PCIe, while maintaining low-latency access. It offers a promising solution to the KV cache storage demand in LLM serving.

In this paper, we propose to leverage CXL memory for offloading KV cache, with the goal to improve serving throughput while retaining SLO on TTFT, and reduce memory pressure for the upper-level LLM serving system. This paper makes the following contributions:

- We present the first measurement of CXL-GPU interconnect and evaluate its feasibility for KV cache offloading. We show that XXX.
- We describe our initial design of CXL-based KV cache offloading interface and present its performance evaluation for LLM serving, on our hardware platform that is the first to successfully integrate ASIC-CXL device and GPU.
- We examine the trade-offs in using CXL for KV cache offloading (§??) with Return on Investment (ROI) modeling, and identify promising areas for future research and development.

### 5.2 CXL-based KV Cache Offloading

We now present the design and implementation of our CXL-based KV cache offloading interface for LLM serving. We also describe the hardware platform used to evaluate our design.

**Design and implementation.** Our goal is to develop a CXL storage interface, named KVExpress, which can be integrated into existing LLM serving systems for saving and loading KV cache of inference requests. KVExpress provides two external APIs to its upper-level serving system: save and load. The save takes an unique identifier of a token chunk as input, and copies its KV cache from GPU to CXL memory. The load takes an unique identifier of a token chunk as input, and finds if its KV cache exists in CXL memory, if so, copies the KV cache from CXL memory to GPU. A token chunk can consist of one or more tokens. The unique identifier of a token chunk  $t_i$  for a sequence is the hash of the content of  $t_i$  and the hash of its prefix  $\langle t_0, ..., t_{i-1} \rangle$ . If the prefix of a sequence of a current request has been computed and saved into CXL, KVExpress will load the KV cache of the prefix from CXL and use it when computating for this request [227].

To avoid calling **save** and **load** too frequently and incurring unnecessary overhead to the upper-level serving system, **save** is called only when a request finished so the KV cache of all the tokens for that request is saved at once; **load** is called for a request prior to its prefill computation.

We implement our design of *KVExpress* in gpt-fast [234], a simple and low-latency text generation system with support on a number of widely-used inference optimizations and LLM. We further modify gpt-fast to support our evaluation on batched inference.

Hardware platform. Our single socket server is equipped with Intel Xeon Platinum processors [235], 1TB of 4800 MHz DDR5 memory, an NVIDIA H100 GPU with 96GB HBM, and a CXL memory expansion card with 256 GB of DDR5 memory at 4800 MHz. While prior works [236–238] have explored utilizing CXL for accelerators, to our knowledge, our work is the first implementation to successfully integrate a real ASIC-CXL device and a GPU within a single inference server.

#### 5.3 Performance Evaluation

In Section 5.3.1, we measure the latency and bandwidth of CXL-GPU interconnect for data transfer to assess the feasibility of storing KV cache on CXL devices. In Section 5.3.2, we compare the TTFT of KV re-compute, prefix caching with CXL, and prefix caching with

GPU, to understand if *KVExpress* can achieve similar TTFT as existing approaches for prefill requests under varying context lengths. In Section 5.3.3, we study the maximum batch size achieved while retaining a given SLO on TTFT between KV re-compute and prefix caching with CXL.



Fig. 5.1: Experiment results. Please use the same y-axis title for (b) and (c)

#### 5.3.1 Measurements on CXL-GPU interconnect performance

KV cache storage requires low-latency access (e.g., from host memory to GPU memory). Although prior studies [232, 233] show that accessing CXL memory from the host CPU is over 2× slower than accessing local memory, none of their measurements involves any interaction with the GPU. In this paper, we evaluate the performance characteristics of the CXL-GPU interconnect by measuring the latency and bandwidth of copying data from CXL memory to the GPU. Transferring in the reverse direction yields similar results. Since CXL memory devices are exposed to the system as NUMA nodes without CPUs by default [233], we allocate a set of host buffers on the CXL NUMA node and use cudaMemcpyAsync to copy data between the host buffers and GPU device buffers allocated via the CUDA API. We evaluated transferring data of size ranging from 1KB to 256MB.

Figure 5.1(a) shows our experiment results: the performance of the CXL-GPU interconnect is *unexpectedly* on par with traditional CPU-GPU memory transfers, exhibiting no significant slowdown. Latency remains low for smaller access sizes but increases exponentially once the size exceeds 64KB. Meanwhile, bandwidth increases almost linearly with data size and saturates around 4MB. This indicates that, while the CPU oversees the data transfer, the data path actually bypasses the host's local memory, flowing directly from CXL memory to GPU buffers via PCIe. Our results demonstrate that the CXL-GPU interconnect operates efficiently with minimal latency overhead, positioning it as a promising storage expansion in addition to CPU memory for KV cache offloading.

#### 5.3.2 Evaluation on TTFT under varying input context length

Given that CXL-GPU interconnect performs nearly the same as CPU-GPU interconnect, we further study if CXL-based KV cache offloading can achieve similar TTFT as existing approaches in completing the prefill stage computation for an inference request. We evaluate three approaches:

- KV re-compute: compute KV data of all input tokens for the request with GPU.
- Prefix caching with CXL: load KV cache of the prefix tokens for the request from CXL to GPU.
- Prefix caching with GPU: store and use KV cache in GPU for the prefix tokens for the request.
- Prefix caching with SSD: load KV cache of the prefix tokens for the request from SSD to GPU. REMOVE IF THIS APPROACH DOES NOT EXIST IN PRIOR WORK

We measure the TTFT of the aforementioned approaches on conversation requests of input length ranging from 256 to 2048 tokens from the ShareGPT-Vicuna-Unfiltered dataset [239]. We use the LLaMA-2-7B as the underlying model for all our experiment. Figure 5.1(b) shows the TTFT (y-axis in log-scale) achieved by the evaluated approaches for requests of varying input context length (x-axis).

Compared to the other approaches, prefix caching with GPU (denoted as "PC-GPU" in Figure 5.1(b)) achieves the smallest TTFT (XXX ms to XXX ms) constantly across different input context lengths. Such performance is expected as there is no data transfer latency

and computation of KV data is only needed for tokens after the prefix. This approach is an optimal baseline that is however difficult to achieve in practice due to limited memory capacity of existing GPU models. Prefix caching with SSD (denoted as "PC-SSD") yields the worse performance, with achieved TTFT ranging from XXX ms to XXX ms, because loading KV cache takes over a second due to the slow access speed of SSD.REMOVE IF THIS APPROACH DOES NOT EXIST IN PRIOR WORK

Comparing prefix caching with CXL (denoted as "PC-CXL") and KV re-compute, prefix caching with CXL performs at least as good as computing KV data on GPU from scratch. Prefix caching with CXL achieve TTFT ranging from XXX ms to XXX ms, with slight increase in latency as input size length grows. The close performance gap between storing prefix KV cache in CXL memory and full KV re-computation indicates that there is a potential opportunity to reduce GPU compute cost with adaptation of CXL devices for memory capacity expansion in LLM inference.

#### 5.3.3 Evaluation on serving throughput while adhering SLO

By storing the KV cache of the inference request prefix in CXL memory and thus reducing re-computation during the prefill stage, we can effectively reduce the computational load on the GPU. The saved GPU compute can be re-allocated to handle a larger number of concurrent inference requests. In other words, the LLM serving system can achieve a higher serving throughput, by handling a larger batch size of inference requests using the saved GPU compute, while maintaining the same SLO on TTFT [226].

Figure 5.1(c) shows the TTFT achieved by KV re-compute and prefix caching with CXL under varying batch size. The horizontal red-dashed line indicates our SLO limit—the maximum TTFT that can be tolerant in production. The typical SLO is 400ms used for LLaMA-2 [240]. As shown in Figure 5.1(c), with KV re-compute, the evaluated serving system (§5.2) can handle a maximum batch size of 44 before hitting the SLO limit. On the other hand, when leveraging CXL for storing KV cache, the system can handle a maximum batch size of 57, which is a 30% increase compared to KV re-compute. Our initial evaluation on SLO-adhering serving throughput highlights the performance benefits of utilizing CXL memory for KV cache offloading, particularly in scenarios that require efficient scaling under

strict latency requirements.

### 5.4 Cost-Efficiency Modeling

We develop a model to estimate the Return on Investment (ROI) of deploying *KVExpress* in production. Conceptually, each prefill request consists of two distinct parts: 1) Loading KV cache data for the prefix (i.e., the history context) from CXL memory; 2) Performing computation on the new prompt (i.e., the follow-up prompt in multi-round conversations).

By replacing computation with memory accesses, we reduce the overall computational load, thereby lowering the demand for FLOP/s while still meeting the same SLO. This results in significant cost savings for LLM inference (Figure 5.2):

• Assumption: Assume a GPU has a Fig. 5.2: Example of computational power of 100 TFLOP/s, computation with mem an average prefill request requires 25

TFLOP of computation, and the SLO for prefill is 0.5 seconds.

- Baseline: To complete the prefill request within the SLO, each request demands 50 TFLOP/s (25 TFLOP/0.5s), meaning a single GPU can serve 2 prefill requests.
- **KVExpress:** By spending 0.1s loading KV cache data of the history context, we reduce the computational demand to 2.5 TFLOP (assuming the new prompt accounts for 10%). To meet the same SLO, the remaining computation must

Baseline: 100% compute

History KV saved in CXL:
load + compute

T\_load

T\_load

T\_load

Start prefill

Start p

Fig. 5.2: Example of ROI modeling: replace computation with memory access

be finished within 0.4s, requiring 6.25 TFLOP/s (2.5 TFLOP/0.4s). In this case, a single GPU can serve 16 prefill requests, yielding an 8x improvement over the baseline.

This allows for a reduction of 87.5% in the number of GPUs required for the same prefill SLO, resulting in substantial cost savings for LLM inference applications (more details in Appendix [?]).

Table 5.1: ROI Modeling

| $\overline{N_{cxl}}$ | where h is the ratio of multi-round requests.  Number of GPUs needed using our CXL memory scheme. $N_{cxl} = \lceil R/R_{qpu} \rceil =$ |
|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------|
| $R_{gpu}$            | Request per second (RPS) a single GPU can support. $R_{gpu} = P/(P_0(1-h) + P_1h)$ ,                                                    |
| $P_1$                | FLOP/s needed for the new prompt. $P_1 = C_1/(T_{slo} - T_{load})$                                                                      |
| $P_0$                | FLOP/s needed for the initial request. $P_0 = C_0/T_{slo}$                                                                              |
| P                    | Computation power (FLOP/s) of the GPU.                                                                                                  |
| $T_{load}$ $P$       | Avg. time to load KV cache from memory (e.g., 0.1s).                                                                                    |
| $T_{slo}$            | SLO of prefill (e.g., 0.5s).                                                                                                            |
|                      | where $r$ is the avg. ratio of the new prompt (e.g., $10\%$ ).                                                                          |
| $C_1$                | FLOPs needed by new prompt in a follow-up request. Can be estimated as $C_1 = rC_0$ ,                                                   |
|                      | $C_0 = 2ML$ , where M is the model parameters and L is the avg. sequence length.                                                        |
| $C_0$                | Avg. FLOPs needed by a prefill request in an initial request. Can be estimated as                                                       |

#### 5.5 Conclusion

Storing KV caches in GPU memory for large language model (LLM) inference can quickly lead to memory saturation, limiting scalability and performance. To address this, we propose leveraging CXL memory, which offers expanded capacity with low-latency access, as a solution for offloading KV caches.

Our preliminary results show that CXL memory provides comparable performance to traditional GPU memory in terms of latency, while also supporting larger workloads. Specifically, using CXL memory for KV cache storage increased the maximum batch size by 30%, from 44 to 57, while still meeting the same SLO. This demonstrates the potential for CXL

memory to significantly reduce the computational burden on GPUs by offloading memory-intensive tasks.

Looking ahead, future work will explore the integration of CXL memory with multi-GPU systems, focusing on maintaining cache coherence across GPUs. This could further enhance the scalability and efficiency of LLM inference, unlocking new possibilities for large-scale AI workloads.

## Appendix A

## Appendix

### Supplementary Materials

## A. Multiplexing M + N Iterator Executions for Maximizing Pipeline Utilization

We claimed in §3.4.1 that if  $t_c = \eta \cdot t_d$  for all offloaded iterator executions, it is always possible to multiplex m+n concurrent iterator executions and fully utilize all memory and logic pipelines. We prove our claim by providing a staggered scheduling algorithm (Algorithm 1) that ensures such multiplexing across m+n iterator executions. The scheduler processes m+n iterator execution requests, assigning each a memory pipeline, a logic pipeline, and staggered start times. These requests are then executed in the respective memory pipelines. Through this staggered scheduling approach, PULSE fully utilizes the n memory pipelines and m logic pipelines, ensuring no resources are wasted. Note that this algorithm is a simplified version to illustrate the potential for full pipeline saturation under the given condition. PULSE's scheduler implements a real-time algorithm to multiplex incoming requests on the fly.

#### Algorithm 1 Staggered-Scheduling

- 1:  $m, n \leftarrow$  number of logic, memory pipelines
- 2:  $L_i, M_j \leftarrow i^{th}$  logic pipeline,  $j^{th}$  memory pipeline
- 3:  $t_d \leftarrow \text{data fetch time per pointer traversal iteration}$
- 4: while true do
- 5: Dequeue n + m requests from network stack
- 6: **for**  $i \leftarrow 1$  **to** m + n **do**
- 7: Assign request  $R_i$  to  $(M_{i \bmod n}, L_{i \bmod m})$
- 8: Schedule  $R_i$  to start at time  $(i-1) \cdot \frac{t_d}{n}$
- 9: Start requests as scheduled at memory pipelines

#### A.1 PULSE Empirical Analysis

Prior studies have shown that real-world data-centric cloud applications spend a significant fraction of time traversing pointers, as summarized in Fig. A.1.

### B. Pulse Supported Data Structures

We adapt 13 data structures across 4 popular open-sourced libraries to PULSE's iterator abstraction (§3.4). In particular, we outline how the data structure implementations for certain operations can be expressed using init(), next(), and end(). For simplicity and readability, (i) we assume that the data structure developer defines a macro, SP\_PTR\_(variable\_name), as the address of the variable resides on the scratch\_pad, and (ii) we omit obvious type conversions for de-referenced pointers.

We analyze two widely used categories of data structures: lists and trees. In our analysis, we find that the top-level data structure APIs (i.e., the APIs used by applications) use the same base function under the hood. For instance, list and forward list in the STL library

| Application      | % of time spent in pointer traversal |  |  |  |  |
|------------------|--------------------------------------|--|--|--|--|
| GraphChi [84]    | $\sim 93\%$                          |  |  |  |  |
| MonetDB [129]    | 70% - 97%                            |  |  |  |  |
| GC in Spark [61] | $\sim 72\%$                          |  |  |  |  |
| VoltDB [130]     | Up to 49.55%                         |  |  |  |  |
| MemC3 [131]      | Up to 21.15%                         |  |  |  |  |
| DBx1000 [132]    | $\sim 9\%$                           |  |  |  |  |
| Memcached [125]  | $\sim 7\%$                           |  |  |  |  |

(a) Survey from prior studies

Fig. A.1: Time cloud applications spend in pointer traversals based on prior studies

share the same internal function, std::find(). We summarize our findings in Table A.1, including the data structure libraries, their category, the top-level data structure APIs, and the internal base function.

List structures. Our surveyed list structures already follow the execution flow of PULSE iterator: init(), next(), and end().

These data structures generally have compute-intensive end() functions to check multiple termination conditions, while their next() function simply dereferences a single pointer to the next node. Listing A.1 and Listing A.2 demonstrate a linked list with two termination conditions: (i) value is found or (ii) search reaches the end. To indicate which condition is met, a special flag (e.g., KEY\_NOT\_FOUND) is written on the scratch\_pad. Listing A.3 and Listing A.4 describe a bitmap that uses a hashtable internally, where colliding entries are stored in linked lists within the same bucket. As such, the PULSE iterator interface resembles that of std::list quite closely.

Tree-like data structures. Compared to list structures, tree data structures require more computation in the next() function, as the next pointer is determined based on the value in the child node. For instance, in Btree (Listing A.5, A.6), the next function iterates through internal node keys, comparing them to the search key. Interestingly, std::map (Listing A.7, A.8) and Boost AVL trees (Listing A.9, A.10) share the same offload function structure, with only minor implementation and naming differences.

Table A.1: Additional data structure supported by PULSE.

| Data Struc-<br>ture     | Cate-<br>gory | Li-<br>brary | Data structure API       | Internal function                                    | Original<br>code | PULSE<br>code                    |
|-------------------------|---------------|--------------|--------------------------|------------------------------------------------------|------------------|----------------------------------|
| List                    | 87            | STL          |                          |                                                      |                  |                                  |
| Forward list            | List          | SIL          | std::find(start, end, va | luse)d::find(start, end, va                          | luke)sting A.1   | Listing A.2                      |
| Bimap                   |               | Boost        | find(key, hash)          | find(key, hash)                                      |                  |                                  |
| Unordered               | 1             | Book         | 11111 (110)              | 11114(110), 114011)                                  | Listing A.3      | Listing A.4                      |
| map<br>Unordered        | _             |              |                          |                                                      |                  |                                  |
| set                     |               |              |                          |                                                      |                  |                                  |
| Btree                   |               | Google       |                          | <pre>internal_locate_plain _compare(key, iter)</pre> | Listing<br>A.5   | $rac{	ext{Listing}}{	ext{A.6}}$ |
| Map                     |               |              |                          |                                                      |                  | _                                |
| Set Multimap Multiset   | Tree          | STL          | find(&key)               | _M_lower_bound(x, y, key                             | ) Listing A.7    | Listing A.8                      |
| AVL tree                |               | Boost        |                          |                                                      |                  |                                  |
| Splay tree<br>Scapegoat |               |              |                          | lower_bound_loop(x, y, k                             | eyDisting A.9    | Listing A.10                     |
| tree                    |               |              |                          |                                                      |                  |                                  |

# B.1 List data structure in STL library

**Listing A.1:** C++ STL realization for

```
std::find()
struct node {
     value_type value;
     struct node* next;
4 };
6 node* find(node* first, node* last,
      const value_type& value)
7 {
     for (; first != last;
          first=first->next)
         if (first->value == value)
            return first;
     return last;
11
12 }
       Listing A.2: PULSE realization for
                std::find()
class list_find : chase_iterator {
     init(void *value, void* first) {
                                           4
         *SP_PTR_VALUE = value;
```

cur\_ptr = first;

# B.2 List data structure in Boost library

Listing A.3: Boost realization for bimap::find()

```
struct node {
      key_type key;
      struct node* next;
      value_type value;
5 };
6 void* find(const key_type& key, const
      hash_type& hash) const
7 {
      // The bucket start pointer can be
          pre-computed before offloading
      std::size_t buc =
          buckets.position(hash(key));
      node_ptr start = buckets.at(buc)
10
      for(node_ptr x = start; x != NULL; x
11
          = x->next){
         if(key == x->key){
12
            return x;
13
         }
15
      }
     return NULL;
17 }
       Listing A.4: PULSE realization for
                bimap::find()
class bimap_find : chase_iterator {
public:
      key_type key;
      init(void *key, void* start) {
         *SP_PTR_KEY = key;
         cur_ptr = start;
      }
      void* next() {
                                            5
         return cur_ptr->next;
11
```

12

}

## B.3 Tree data structure in Google library

**Listing A.5:** Google realization for

```
btree::internal_locate_plain_compare()
#define kNodeValues 8
2 struct btree_node {
      bool is_leaf;
      int num_keys;
      key_type keys[kNodeValues];
      btree_node* child[kNodeValues + 1];
7 };
8 IterType
       btree::internal_locate_plain_compare(const
       key_type &key, IterType iter) const
       {
      for (;;) {
9
         int i;
10
         for(int i = 0; i <</pre>
              iter->num_keys; i++) {
             if(key <= iter->keys[i]) {
12
                break;
13
             }
         }
15
         if (iter.node->is_leaf) {
             break;
17
         }
         iter.node = iter.node->child(i);
19
20
      }
      return iter;
21
22 }
       Listing A.6: PULSE realization for
  btree::internal_locate_plain_compare()
```

# B.4 Tree data structure in STL library

**Listing A.7:** C++ STL realization for

```
map::find()
struct node {
     key_type key;
     node* left;
     node* right;
5 };
7 _M_lower_bound(node* x, node* y, const
      key_type& key)
8 {
      while (x != 0) {
9
         if (x->key <= key) {
            y = x;
11
            x = x->left;
12
         } else {
13
            x = x->right;
         }
15
      }
16
     return y;
18 }
       Listing A.8: PULSE realization for
                map::find()
class map_find : chase_iterator {
      init(void *key, void* x, void* y) {
         *SP_PTR_KEY = key;
         *SP_PTR_Y = y;
         cur_ptr = x;
      }
      void* next() {
         if (cur_ptr->key <= *SP_PTR_KEY) {</pre>
9
             *SP_PTR_Y = cur_ptr;
10
             cur_ptr = cur_ptr->left;
         } else {
12
                                             7
             cur_ptr= cur_ptr->right;
```

}

14

# B.5 Tree data structure in Boost library

Listing A.9: Boost realization for

```
avltree::find()
static node_ptr lower_bound_loop
2 (node_ptr x, node_ptr y, const KeyType
      &key)
3 {
      while(x){
         if(x->key >= key)) {
            x = x->right;
         }
         else{
            y = x;
            x = x->left;
         }
11
12
13
     return y;
14 }
```

Listing A.10: PULSE realization for

```
avltree::find()
class avltree_find : chase_iterator {
public:
      key_type key;
      void* y;
      init(void *key, void* x, void* y) {
         *SP_PTR_KEY = key;
         *SP_PTR_Y = y;
         cur_ptr = x;
      }
10
      void* next() {
12
         if(cur_ptr->key >= *SP_PTR_KEY) {
13
             cur_ptr = cur_ptr->right;
14
         }
         else{
16
                                             8
             *SP_PTR_Y = cur_ptr;
             cur_ptr = cur_ptr->left;
18
```



Fig. A.2: Network and memory bandwidth utilization. PULSE and RPC utilize over 90% of the available memory bandwidth, while the cache-based approach suffers from swap system overhead. In Webservice, the network bandwidth becomes the bottleneck due to large 8 KB data transfers.

#### C. Pulse Additional Evaluation Results

In this section, we provide additional evaluation results for PULSE.

#### C.1 Traditional Core Architecture vs. Pulse

We evaluate the impact of the PULSE architectural design (§3.4.1) by comparing PULSE against PULSE-CORE, an in-order processor built on PULSE's components. We denote  $C_x$  as in tightly-coupled core architecture, where x is the number of cores. We denote  $P_x$  as PULSE architecture, where x is the number of logic pipelines and y is the number of memory pipelines. Table A.2 shows the power, performance, and area usage of various configurations. The performance metrics are obtained by executing the WebService application with various configurations. In PULSE's disaggregated architecture, when the number of logic and memory pipelines is equal to that of a traditional core architecture, power and area usage are higher due to additional logic and buffering in the interconnect and scheduler. However, due to the nature of pointer traversal operations (§3.4.1), PULSE requires fewer logic pipelines to achieve similar performance. For example, to fully saturate the memory bandwidth of a single node, PULSE uses only one logic pipeline and four memory pipelines, while a traditional core architecture requires four cores. As a result, PULSE saves 20.12% in power with only a 7.2% latency overhead, primarily due to the additional scheduler and data movement between workspaces.

| Config           | Pwr (W) | LUT % | BRAM % | Tpt (Mops/s) | Lat (us) |
|------------------|---------|-------|--------|--------------|----------|
| C_1              | 67.76   | 14.73 | 14.57  | 0.41         | 33.25    |
| $C^{-}2$         | 75.47   | 20.46 | 18.73  | 0.63         | 33.73    |
| $C^{-3}$         | 84.57   | 28.66 | 31.83  | 0.87         | 34.66    |
| $C^{-}4$         | 89.77   | 37.10 | 34.17  | 1.20         | 35.11    |
| P 1 1            | 56.74   | 11.76 | 16.34  | 0.51         | 37.57    |
| P = 1 = 2        | 59.47   | 14.87 | 18.38  | 0.73         | 36.74    |
| $P^{-1}_{1}_{3}$ | 64.78   | 16.64 | 22.37  | 1.01         | 38.46    |
| $P^{-}1^{-}4$    | 72.47   | 18.37 | 25.84  | 1.24         | 38.37    |
| $P^{-}2^{-}1$    | 67.37   | 17.73 | 20.37  | 0.48         | 40.27    |
| $P^{-}2^{-}2$    | 77.37   | 21.38 | 22.38  | 0.76         | 39.47    |
| $P^{-}2^{-}3$    | 81.21   | 26.22 | 26.76  | 0.99         | 41.37    |
| $P^{-3}_{3}$     | 86.15   | 37.21 | 30.12  | 1.03         | 40.98    |
| $P^{-}2^{-}4$    | 83.21   | 30.13 | 31.21  | 1.19         | 40.37    |
| P_4_4            | 95.64   | 46.42 | 39.84  | 1.21         | 41.47    |

Table A.2: Comparison between traditional core architecture and Pulse architecture.



Fig. A.3: (a) PULSE latency is up to  $1.3 \times$  lower for skewed than uniform access patterns due to caching. (b) Offloaded allocations in PULSE improve the WebService request latencies as the proportion of writes increases by reducing the number of round trips per allocation.



Fig. A.4: Sensitivity to traversal length and the number of memory pipelines. (a) PULSE latency scales linearly with the length of traversal. (b) PULSE accelerator can saturate memory bandwidth with just two PULSE memory pipelines.

#### C.2 Network and Memory Bandwidth Utilization

We evaluate the network and memory bandwidth utilization of the three applications in Fig. A.2. For WiredTiger, PULSE and RPC utilize over 90% of the available memory bandwidth, while the Cache-based approach suffers from low network bandwidth and memory utilization due to swap system overhead. For WebService, the large 8 KB data transfers saturate the maximum bandwidth that the DPDK stack can sustain [156]. As a result, network bandwidth becomes the bottleneck, reducing PULSE and RPC memory bandwidth utilization under 3 and 4 memory nodes. The memory bandwidth is normalized, where 1.0 corresponds to 25 GB/s per node.

#### C.3 PULSE Sensitivity Analysis

We evaluate PULSE's sensitivity to workload characteristics and system parameters: access pattern, data structure modifications, traversal length, allocation policy, and the number of PULSE memory pipelines.

Impact of access pattern. While our evaluation so far has been confined to Zipfian workloads, we evaluate the impact of skewed access patterns on PULSE performance for all three applications. Our setup comprises a single 32GB memory node with a 2GB CPU node cache. Figure A.3(a) shows caching at the CPU node reduces the number of iterator requests offloaded to the PULSE accelerator for the skewed (Zipfian) workload, improving PULSE performance for such workloads by up to 1.33× relative to uniform ones.



Fig. A.5: Allocation policy. PULSE performs better with the partitioned allocation since it minimizes cross-node traversals.



Fig. A.6: Application performance using workload with uniform distribution.

Impact of data structure modifications. Operations that modify data structures can require new memory allocations during traversal. Instead of returning control to the CPU node for allocations, Pulse populates the scratchpad for every request with a fixed number of pre-allocated memory regions. When a new allocation is initiated at the Pulse accelerator, it uses a pre-allocated memory region on the scratchpad. If all such regions (16 in our implementation) are used up in a single request, the traversal interrupts and returns to the CPU node. Pulse periodically replenishes pre-allocated entries, ensuring that allocation-triggered traversal interruptions are rare.

We evaluate the impact of data structure modifications in PULSE (§??) by increasing the proportion of writes for the WebService application on a single memory node. Figure A.3(b) shows that as the proportion of writes increases, PULSE without offloaded allocations experiences higher latencies (up to  $1.4\times$ ) since each new node allocation requires two additional round trips; offloaded allocations reduce the allocation overhead to < 1.1%.

**Length of traversal.** For simplicity, we evaluate traversal queries on a single linked list with varying numbers of nodes traversed per query. As expected, Fig. A.4(a) shows that

the end-to-end execution latency for a linked list traversal scales linearly with the number of nodes traversed.

Allocation policy. We find that the allocation policy used for a data structure has a significant impact on application performance specifically for distributed traversals (Figs. A.5(a) and A.5(b)). We evaluated the WiredTiger and BTrDB workloads (that employ B+-Tree as their underlying data structure) with two allocation policies: one that partitions allocations in a way that ensures all nodes in half the subtree are placed on one memory node and the other half on another, and another that allocates memory uniformly across the two nodes (as in glibc allocator). The average latency for random allocations is  $3.7 - 10.8 \times$  higher than partitioned allocation since it incurs significantly more cross-node traversals. This shows that while uniformly distributed allocations can enable better system-wide resource utilization, it may be preferable to exploit application-specific partitioned allocations for workloads where performance is the primary concern.

Number of PULSE memory pipelines. We evaluate the number of PULSE memory access pipelines required to saturate PULSE's memory bandwidth on a single memory node. We used the same linked list as our traversal-length experiment due to its relatively low  $\eta$  value ( $\sim$ 0.06), which allows us to stress the memory access pipeline without saturating the logic pipeline. Fig. A.4(b) shows that just 2 memory pipelines can saturate PULSE's the pernode memory bandwidth of 25 GB/s. We note that our 25 GB/s limit does not match the hardware-specified memory channel bandwidths; this is primarily due to our use of the vendor-supplied memory interconnect IP, required to connect all memory pipelines to all memory channels. Indeed, if we remove the IP and measure memory bandwidth when each memory pipeline is connected to a dedicated memory channel, PULSE can achieve a memory bandwidth up to 34 GB/s (shown as PULSE w/o Interconnect in Fig. A.4(b)).

PULSE performance with uniform workload. As illustrated in Fig. A.6, while sharing a similar trend as Zipfian distribution, all approaches experience higher latency compared to Zipfian distribution due to the ineffectiveness of caching. PULSE provides lower (vs. Cachebased, RPC-ARM, and Cache+RPC) or comparable (vs. RPC) latency for a single memory node and 2.2–29% lower latency (vs. RPC) for multi-memory nodes.

## **Bibliography**

- S.-s. Lee, Y. Yu, Y. Tang, A. Khandelwal, L. Zhong, and A. Bhattacharjee. MIND: In-Network Memory Management for Disaggregated Data Centers. In SOSP, 2021.
- [2] Y. Shan, Y. Huang, Y. Chen, and Y. Zhang. LegoOS: A Disseminated, Distributed OS for Hardware Resource Disaggregation. In OSDI, 2018.
- [3] P. X. Gao, A. Narayan, S. Karandikar, J. Carreira, S. Han, R. Agarwal, S. Ratnasamy, and S. Shenker. Network Requirements for Resource Disaggregation. In *OSDI*, 2016.
- [4] K. Asanović. FireBox: A Hardware Building Block for 2020 Warehouse-Scale Computers. 2014.
- [5] S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot. Scale-out NUMA. In ASPLOS, 2014.
- [6] L. Liu, W. Cao, S. Sahin, Q. Zhang, J. Bae, and Y. Wu. Memory Disaggregation: Research Problems and Opportunities. In ICDCS, 2019.
- [7] K. Lim, J. Chang, T. Mudge, P. Ranganathan, S. K. Reinhardt, and T. F. Wenisch. Disaggregated Memory for Expansion and Sharing in Blade Servers. In *ISCA*, 2009.
- [8] K. Lim, Y. Turner, J. R. Santos, A. AuYoung, J. Chang, P. Ranganathan, and T. F. Wenisch. System-level Implications of Disaggregated Memory. In HPCA, 2012.
- [9] A. Samih, R. Wang, C. Maciocco, M. Kharbutli, and Y. Solihin. *Collaborative Memories in Clusters: Opportunities and Challenges*. 2014.
- [10] Compute Express Link (CXL). https://www.computeexpresslink.org/.

- [11] Y. Tang, P. Zhou, W. Zhang, H. Hu, Q. Yang, H. Xiang, T. Liu, J. Shan, R. Huang, C. Zhao, C. Chen, H. Zhang, F. Liu, S. Zhang, X. Ding, and J. Chen. Exploring performance and cost optimization with asic-based cxl memory. In *Proceedings of the Nineteenth European Conference on Computer Systems*, EuroSys '24, page 818–833, New York, NY, USA, 2024. Association for Computing Machinery.
- [12] P. S. Rao and G. Porter. Is memory disaggregation feasible? a case study with spark sql. In *Proceedings of the 2016 Symposium on Architectures for Networking* and Communications Systems, ANCS '16, page 75–80, New York, NY, USA, 2016. Association for Computing Machinery.
- [13] A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at google with borg. In *Proceedings of the Tenth European Conference on Computer Systems*, EuroSys '15, New York, NY, USA, 2015. Association for Computing Machinery.
- [14] C. Reiss, A. Tumanov, G. R. Ganger, R. H. Katz, and M. A. Kozuch. Heterogeneity and dynamicity of clouds at scale: Google trace analysis. In *Proceedings of the Third ACM Symposium on Cloud Computing*, SoCC '12, New York, NY, USA, 2012. Association for Computing Machinery.
- [15] H. S. Gunawi, M. Hao, R. O. Suminto, A. Laksono, A. D. Satria, J. Adityatama, and K. J. Eliazar. Why does the cloud stop computing? lessons from hundreds of service outages. In *Proceedings of the Seventh ACM Symposium on Cloud Computing*, SoCC '16, page 1–16, New York, NY, USA, 2016. Association for Computing Machinery.
- [16] T. Benson, A. Akella, and D. A. Maltz. Network traffic characteristics of data centers in the wild. In *Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement*, IMC '10, page 267–280, New York, NY, USA, 2010. Association for Computing Machinery.
- [17] M. R. Hines, A. Gordon, M. Silva, D. Da Silva, K. Ryu, and M. Ben-Yehuda. Applications know best: Performance-driven memory overcommit with ginkgo. In 2011

- IEEE Third International Conference on Cloud Computing Technology and Science, pages 130–137, 2011.
- [18] D. Gupta, S. Lee, M. Vrable, S. Savage, A. C. Snoeren, G. Varghese, G. M. Voelker, and A. Vahdat. Difference engine: Harnessing memory redundancy in virtual machines. In 8th USENIX Symposium on Operating Systems Design and Implementation (OSDI 08), San Diego, CA, December 2008. USENIX Association.
- [19] P. Bodík, I. Menache, M. Chowdhury, P. Mani, D. A. Maltz, and I. Stoica. Surviving failures in bandwidth-constrained datacenters. SIGCOMM Comput. Commun. Rev., 42(4):431-442, aug 2012.
- [20] A. Ghodsi, M. Zaharia, B. Hindman, A. Konwinski, S. Shenker, and I. Stoica. Dominant resource fairness: Fair allocation of multiple resource types. In 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), Boston, MA, March 2011. USENIX Association.
- [21] R. Grandl, G. Ananthanarayanan, S. Kandula, S. Rao, and A. Akella. Multi-resource packing for cluster schedulers. SIGCOMM Comput. Commun. Rev., 44(4):455–466, aug 2014.
- [22] E. Amaro, S. Wang, A. Panda, and M. K. Aguilera. Logical memory pools: Flexible and local disaggregated memory. In Proceedings of the 22nd ACM Workshop on Hot Topics in Networks, pages 25–32, 2023.
- [23] C. Hu, H. Huang, J. Hu, J. Xu, X. Chen, T. Xie, C. Wang, S. Wang, Y. Bao, N. Sun, et al. Memserve: Context caching for disaggregated llm serving with elastic memory pool. arXiv preprint arXiv:2406.17565, 2024.
- [24] Redis. https://redis.io/.
- [25] J. Ousterhout, P. Agrawal, D. Erickson, C. Kozyrakis, J. Leverich, D. Mazières, S. Mitra, A. Narayanan, G. Parulkar, M. Rosenblum, et al. The Case for RAMClouds: Scalable High-performance Storage Entirely in DRAM. SIGOPS OSR, 2010.

- [26] X. Zhang, U. Khanal, X. Zhao, and S. Ficklin. Making sense of performance in inmemory computing frameworks for scientific data analysis: A case study of the spark system. J. Parallel Distrib. Comput., 120(C):369–382, oct 2018.
- [27] A. Dragojević, D. Narayanan, O. Hodson, and M. Castro. FaRM: Fast Remote Memory. In NSDI, 2014.
- [28] Z. Ruan, M. Schwarzkopf, M. K. Aguilera, and A. Belay. AIFM: High-Performance, Application-Integrated far memory. In OSDI, 2020.
- [29] Q. Wang, Y. Lu, and J. Shu. Sherman: A write-optimized distributed b+tree index on disaggregated memory. In SIGMOD, 2022.
- [30] P. Zuo, J. Sun, L. Yang, S. Zhang, and Y. Hua. One-sided {RDMA-Conscious} extendible hashing for disaggregated memory. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 15–29, 2021.
- [31] E. Amaro, C. Branner-Augmon, Z. Luo, A. Ousterhout, M. K. Aguilera, A. Panda, S. Ratnasamy, and S. Shenker. Can Far Memory Improve Job Throughput? In EuroSys, 2020.
- [32] J. Gu, Y. Lee, Y. Zhang, M. Chowdhury, and K. G. Shin. Efficient Memory Disaggregation with Infiniswap. In *NSDI*, 2017.
- [33] I. Calciu, M. T. Imran, I. Puddu, S. Kashyap, H. A. Maruf, O. Mutlu, and A. Kolli. Rethinking software runtimes for disaggregated memory. In *Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems*, pages 79–92, 2021.
- [34] C. Wang, H. Ma, S. Liu, Y. Li, Z. Ruan, K. Nguyen, M. D. Bond, R. Netravali, M. Kim, and G. H. Xu. Semeru: A {Memory-Disaggregated} managed runtime. In 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20), pages 261–280, 2020.
- [35] A. Khandelwal, Y. Tang, R. Agarwal, A. Akella, and I. Stoica. Jiffy: Elastic farmemory for stateful serverless analytics. In *Proceedings of the Seventeenth European*

- Conference on Computer Systems, EuroSys '22, page 697–713, New York, NY, USA, 2022. Association for Computing Machinery.
- [36] CHASE: Accelerating Distributed Pointer-Traversals on Disaggregated Memory. https://arxiv.org/pdf/2305.02388.pdf, 2023.
- [37] H. Li, D. S. Berger, S. Novakovic, L. R. Hsu, D. Ernst, P. Zardoshti, M. Shah, S. Rajadnya, S. Lee, I. Agarwal, M. D. Hill, M. Fontoura, and R. Bianchini. Pond: Cxl-based memory pooling systems for cloud platforms. Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, 2022.
- [38] Y. Tang, P. Zhou, W. Zhang, H. Hu, Q. Yang, H. Xiang, T. Liu, J. Shan, R. Huang, C. Zhao, C. Chen, H. Zhang, F. Liu, S. Zhang, X. Ding, and J. Chen. Exploring performance and cost optimization with asic-based cxl memory. In *Proceedings of the Nineteenth European Conference on Computer Systems*, EuroSys '24, page 818–833, New York, NY, USA, 2024. Association for Computing Machinery.
- [39] A. Klimovic, Y. Wang, P. Stuedi, A. Trivedi, J. Pfefferle, and C. Kozyrakis. Pocket: Elastic ephemeral storage for serverless analytics. In USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 427–444, 2018.
- [40] M. Perron, R. Castro Fernandez, D. DeWitt, and S. Madden. Starling: A scalable query engine on cloud functions. In *Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data*, SIGMOD '20, page 131–141, New York, NY, USA, 2020. Association for Computing Machinery.
- [41] Q. Pu, S. Venkataraman, and I. Stoica. Shuffling, fast and slow: Scalable analytics on serverless infrastructure. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 193–206, Boston, MA, February 2019. USENIX Association.
- [42] J. Carreira, P. Fonseca, A. Tumanov, A. Zhang, and R. Katz. Cirrus: A serverless framework for end-to-end ml workflows. In *Proceedings of the ACM Symposium on*

- Cloud Computing, SoCC '19, page 13–24, New York, NY, USA, 2019. Association for Computing Machinery.
- [43] M. Vuppalapati, J. Miron, R. Agarwal, D. Truong, A. Motivala, and T. Cruanes. Building an elastic query engine on disaggregated storage. In 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI 20), pages 449–462, Santa Clara, CA, February 2020. USENIX Association.
- [44] K. Mahajan, M. Chowdhury, A. Akella, and S. Chawla. Dynamic query Re-Planning using QOOP. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 253–267, Carlsbad, CA, October 2018. USENIX Association.
- [45] Intel Rack Scale Design: Just what is it? https://www.datacenterdynamics.com/en/opinions/intel-rack-scale-design-just-what-is-it/.
- [46] Rack-scale Computing. https://www.microsoft.com/en-us/research/project/rack-scale-computing/.
- [47] Terabit Ethernet: The New Hot Trend in Data Centers. https://www.lanner-america.com/blog/terabit-ethernet-new-hot-trend-data-centers/, 2019.
- [48] Intel. Barefoot Networks Unveils Tofino 2, the Next Generation of the World's First Fully P4-Programmable Network Switch ASICs, 2018. https://bit.ly/3gmZkBG.
- [49] EX9200 Programmable Network Switch Juniper Networks. https://www.juniper.ne t/us/en/products-services/switching/ex-series/ex9200/.
- [50] Disaggregation and Programmable Forwarding Planes. https://www.barefootnetworks.com/blog/disaggregation-and-programmable-forwarding-planes/.
- [51] Intel Ethernet Switch FM6000 Series. https://www.intel.com/content/dam/www/public/us/en/documents/product-briefs/ethernet-switch-fm6000-series-brief.pdf.

- [52] A. Sivaraman, S. Subramanian, M. Alizadeh, S. Chole, S.-T. Chuang, A. Agrawal, H. Balakrishnan, T. Edsall, S. Katti, and N. McKeown. Programmable Packet Scheduling at Line Rate. In SIGCOMM, 2016.
- [53] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz. Forwarding Metamorphosis: Fast Programmable Match-Action Processing in Hardware for SDN. In SIGCOMM, 2013.
- [54] MSI Protocol. https://en.wikipedia.org/wiki/MSI\_protocol.
- [55] L. Abraham, J. Allen, O. Barykin, V. Borkar, B. Chopra, C. Gerea, D. Merl, J. Metzler, D. Reiss, S. Subramanian, J. L. Wiener, and O. Zed. Scuba: Diving into data at facebook. *PVLDB*, 6(11), 2013.
- [56] B. Berg, D. S. Berger, S. McAllister, I. Grosof, S. Gunasekar, J. Lu, M. Uhlar, J. Carrig, N. Beckmann, M. Harchol-Balter, and G. R. Ganger. The CacheLib caching engine: Design and experiences at scale. In OSDI, 2020.
- [57] N. Bronson, Z. Amsden, G. Cabrera, P. Chakka, P. Dimov, H. Ding, J. Ferris, A. Giardullo, S. Kulkarni, H. Li, M. Marchukov, D. Petrov, L. Puzar, Y. J. Song, and V. Venkataramani. TAO: Facebook's distributed data store for the social graph. In ATC, 2013.
- [58] R. Nishtala, H. Fugal, S. Grimm, M. Kwiatkowski, H. Lee, H. C. Li, R. McElroy, M. Paleczny, D. Peek, P. Saab, et al. Scaling memcache at facebook. In NSDI, 2013.
- [59] X. Shi, S. Pruett, K. Doherty, J. Han, D. Petrov, J. Carrig, J. Hugg, and N. Bronson. FlightTracker: Consistency across Read-Optimized online stores at facebook. In OSDI, 2020.
- [60] J. Yang, Y. Yue, and K. V. Rashmi. A large scale analysis of hundreds of in-memory cache clusters at twitter. In OSDI, 2020.
- [61] Apache Spark. Unified engine for large-scale data analytics. https://spark.apache.org/.

- [62] U. Kang, H.-S. Yu, C. Park, H. Zheng, J. Halbert, K. Bains, S. Jang, and J. S. Choi. Co-architecting controllers and dram to enhance dram process scaling. In *The memory forum*, volume 14, 2014.
- [63] S.-H. Lee. Technology scaling challenges and opportunities of memory devices. In International Electron Devices Meeting (IEDM), 2016.
- [64] S. Shiratake. Scaling and performance challenges of future dram. In *International Memory Workshop (IMW)*, 2020.
- [65] C. Reiss. Understanding Memory Configurations for In-Memory Analytics. PhD thesis, EECS Department, University of California, Berkeley, 2016.
- [66] M. K. Aguilera, N. Amit, I. Calciu, X. Deguillard, J. Gandhi, P. Subrahmanyam, L. Suresh, K. Tati, R. Venkatasubramanian, and M. Wei. Remote Memory in the Age of Fast Networks. In SoCC, 2017.
- [67] D. Jevdjic, S. Volos, and B. Falsafi. Die-stacked dram caches for servers: Hit ratio, latency, or bandwidth? have it all with footprint cache. ACM SIGARCH Computer Architecture News, 41(3):404-415, 2013.
- [68] D. Jevdjic, G. H. Loh, C. Kaynak, and B. Falsafi. Unison cache: A scalable and effective die-stacked dram cache. In 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pages 25–37. IEEE, 2014.
- [69] V. Young, C. Chou, A. Jaleel, and M. Qureshi. Accord: Enabling associativity for gigascale dram caches by coordinating way-install and way-prediction. In 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), pages 328–339. IEEE, 2018.
- [70] RoCE vs. iWARP Competitive Analysis. https://www.mellanox.com/related-docs/whitepapers/WP\_RoCE\_vs\_iWARP.pdf, 2017.
- [71] MySQL: Adaptive Hash Index. https://dev.mysql.com/doc/refman/8.0/en/innodb-adaptive-hash.html.

- [72] SQLServer: Hash Indexes. https://docs.microsoft.com/en-us/sql/database-engin e/hash-indexes?view=sql-server-2014.
- [73] Teradata: Hash Indexes. https://docs.teradata.com/reader/RtERtp\_2wVEQWNxcM3k88 w/HmFinSvPP6cTIT6o9F8ZAg.
- [74] R. Agarwal, A. Khandelwal, and I. Stoica. Succinct: Enabling Queries on Compressed Data. In USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2015.
- [75] N. Askitis and R. Sinha. HAT-trie: A Cache-conscious Trie-based Data Structure for Strings. In ACSC, 2007.
- [76] R. Bayer and E. McCreight. Organization and Maintenance of Large Ordered Indices.
  In ACM-SIGMOD Workshop on Data Description, Access and Control, 1970.
- [77] A. Braginsky and E. Petrank. A lock-free b+tree. In SPAA, 2012.
- [78] S. Heinz, J. Zobel, and H. E. Williams. Burst tries: a fast, efficient data structure for string keys. *TOIS*, 2002.
- [79] A. Khandelwal, R. Agarwal, and I. Stoica. Blowfish: Dynamic storage-performance tradeoff in data stores. In NSDI, 2016.
- [80] D. R. Morrison. PATRICIA Practical Algorithm To Retrieve Information Coded in Alphanumeric. JACM, 1968.
- [81] H. Zhang, H. Lim, V. Leis, D. G. Andersen, M. Kaminsky, K. Keeton, and A. Pavlo. Surf: Practical range query filtering with fast succinct tries. In SIGMOD, 2018.
- [82] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. In OSDI, 2012.
- [83] J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph Processing in a Distributed Dataflow Framework. In OSDI, 2014.
- [84] A. Kyrola, G. E. Blelloch, and C. Guestrin. GraphChi: Large-Scale Graph Computation on Just a PC. In *OSDI*, 2012.

- [85] L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, 1999.
- [86] J. Ahn, S. Hong, S. Yoo, O. Mutlu, and K. Choi. A scalable processing-in-memory accelerator for parallel graph processing. In *Proceedings of the 42nd Annual International* Symposium on Computer Architecture, pages 105–117, 2015.
- [87] H. Asghari-Moghaddam, Y. H. Son, J. H. Ahn, and N. S. Kim. Chameleon: Versatile and practical near-dram acceleration architecture for large memory systems. In 2016 49th annual IEEE/ACM international symposium on Microarchitecture (MI-CRO), pages 1–13. IEEE, 2016.
- [88] G. Dai, T. Huang, Y. Chi, J. Zhao, G. Sun, Y. Liu, Y. Wang, Y. Xie, and H. Yang. Graphh: A processing-in-memory architecture for large-scale graph processing. *IEEE Transactions on Computer-Aided Design of Integrated Circuits and* Systems, 38(4):640-653, 2018.
- [89] F. Schuiki, M. Schaffner, F. K. Gürkaynak, and L. Benini. A scalable near-memory architecture for training deep neural networks on large in-memory datasets. *IEEE Transactions on Computers*, 68(4):484–497, 2018.
- [90] O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun. Processing data where it makes sense: Enabling in-memory computation. *Microprocessors and Microsystems*, 67:28–41, 2019.
- [91] E. Lockerman, A. Feldmann, M. Bakhshalipour, A. Stanescu, S. Gupta, D. Sanchez, and N. Beckmann. Livia: Data-centric computing throughout the memory hierarchy. In Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pages 417–433, 2020.
- [92] F. Tu, Y. Wang, Z. Wu, L. Liang, Y. Ding, B. Kim, L. Liu, S. Wei, Y. Xie, and S. Yin. Redcim: Reconfigurable digital computing-in-memory processor with unified fp/int pipeline for cloud ai acceleration. *IEEE Journal of Solid-State Circuits*, 58(1):243– 255, 2022.

- [93] A. Devic, S. B. Rai, A. Sivasubramaniam, A. Akel, S. Eilert, and J. Eno. To PIM or not for emerging general purpose processing in DDR memory systems. In ISCA, pages 231–244, 2022.
- [94] Z. Wang, J. Weng, S. Liu, and T. Nowatzki. Near-stream computing: General and transparent near-cache acceleration. In *HPCA*, pages 331–345, 2022.
- [95] X. Xie, P. Gu, Y. Ding, D. Niu, H. Zheng, and Y. Xie. Mpu: Memory-centric simt processor via in-dram near-bank computing. ACM Transactions on Architecture and Code Optimization, 20(3):1–26, 2023.
- [96] O. Mutlu, S. Ghose, J. Gómez-Luna, and R. Ausavarungnirun. A modern primer on processing in memory. In *Emerging Computing: From Devices to Systems: Looking Beyond Moore and Von Neumann*, pages 171–243. Springer, 2022.
- [97] G. F. Oliveira, J. Gómez-Luna, S. Ghose, A. Boroumand, and O. Mutlu. Accelerating neural network inference with processing-in-dram: from the edge to the cloud. *IEEE Micro*, 42(6):25–38, 2022.
- [98] C. Eckert, A. Subramaniyan, X. Wang, C. Augustine, R. Iyer, and R. Das. Eidetic: An in-memory matrix multiplication accelerator for neural networks. *IEEE Transactions* on Computers, 2022.
- [99] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie. Prime: A novel processing-in-memory architecture for neural network computation in reram-based main memory. ACM SIGARCH Computer Architecture News, 44(3):27–39, 2016.
- [100] V. Seshadri and O. Mutlu. Simple operations in memory to reduce data movement. In Advances in Computers, volume 106, pages 107–166. Elsevier, 2017.
- [101] Y. Kwon, Y. Lee, and M. Rhu. TensorDIMM: A practical near-memory processing architecture for embeddings and tensor operations in deep learning. In MICRO, pages 740–753, 2019.

- [102] A. Boroumand, S. Ghose, M. Patel, H. Hassan, B. Lucia, R. Ausavarungnirun, K. Hsieh, N. Hajinazar, K. T. Malladi, H. Zheng, and O. Mutlu. CoNDA: Efficient cache coherence support for near-Data accelerators. In ISCA, pages 629–642, 2019.
- [103] B. Y. Cho, Y. Kwon, S. Lym, and M. Erez. Near data acceleration with concurrent host access. In ISCA, pages 818–831, 2020.
- [104] L. Ke, U. Gupta, B. Y. Cho, D. Brooks, V. Chandra, U. Diril, A. Firoozshahian, K. Hazelwood, B. Jia, H.-H. S. Lee, M. Li, B. Maher, D. Mudigere, M. Naumov, M. Schatz, M. Smelyanskiy, X. Wang, B. Reagen, C.-J. Wu, M. Hempstead, and X. Zhang. RecNMP: Accelerating personalized recommendation with near-memory processing. In ISCA, pages 790–803, 2020.
- [105] Z. Wang, J. Weng, J. Lowe-Power, J. Gaur, and T. Nowatzki. Stream floating: Enabling proactive and decentralized cache optimizations. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 640–653. IEEE, 2021.
- [106] X. Xie, Z. Liang, P. Gu, A. Basak, L. Deng, L. Liang, X. Hu, and Y. Xie. Spacea: Sparse matrix vector multiplication on processing-in-memory accelerator. In 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pages 570-583. IEEE, 2021.
- [107] L. Ke, X. Zhang, J. So, J.-G. Lee, S.-H. Kang, S. Lee, S. Han, Y. Cho, J. H. Kim, Y. Kwon, et al. Near-memory processing in action: Accelerating personalized recommendation with axdimm. *IEEE Micro*, 42(1):116–127, 2021.
- [108] G. Singh, M. Alser, D. S. Cali, D. Diamantopoulos, J. Gómez-Luna, H. Corporaal, and O. Mutlu. Fpga-based near-memory acceleration of modern data-intensive applications. *IEEE Micro*, 41(4):39–48, 2021.
- [109] A. Olgun, J. G. Luna, K. Kanellopoulos, B. Salami, H. Hassan, O. Ergin, and O. Mutlu. Pidram: A holistic end-to-end fpga-based framework for processing-indram. ACM Transactions on Architecture and Code Optimization, 20(1):1–31, 2022.

- [110] G. Dai, Z. Zhu, T. Fu, C. Wei, B. Wang, X. Li, Y. Xie, H. Yang, and Y. Wang. Dimmining: pruning-efficient and parallel graph mining on near-memory-computing. In Proceedings of the 49th Annual International Symposium on Computer Architecture, pages 130–145, 2022.
- [111] P. Gu, X. Xie, Y. Ding, G. Chen, W. Zhang, D. Niu, and Y. Xie. ipim: Programmable in-memory image processing accelerator using near-bank architecture. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 804–817. IEEE, 2020.
- [112] J. Gómez-Luna, Y. Guo, S. Brocard, J. Legriel, R. Cimadomo, G. F. Oliveira, G. Singh, and O. Mutlu. Evaluating machine learning workloads on memory-centric computing systems. In 2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pages 35–49. IEEE, 2023.
- [113] O. Kocberber, B. Grot, J. Picorel, B. Falsafi, K. Lim, and P. Ranganathan. Meet the walkers: Accelerating index traversals for in-memory databases. In *Proceedings of the* 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO-46, page 468-479, New York, NY, USA, 2013. Association for Computing Machinery.
- [114] K. Hsieh, S. Khan, N. Vijaykumar, K. K. Chang, A. Boroumand, S. Ghose, and O. Mutlu. Accelerating pointer chasing in 3d-stacked memory: Challenges, mechanisms, evaluation. In *International Conference on Computer Design (ICCD)*, 2016.
- [115] A. Bhardwaj, C. Kulkarni, and R. Stutsman. Adaptive placement for in-memory storage functions. In *ATC*, 2020.
- [116] C. Kulkarni, S. Moore, M. Naqvi, T. Zhang, R. Ricci, and R. Stutsman. Splinter: Bare-Metal extensions for Multi-Tenant Low-Latency storage. In OSDI, 2018.
- [117] J. You, J. Wu, X. Jin, and M. Chowdhury. Ship Compute or Ship Data? Why Not Both? In NSDI, pages 633–651, 2021.

- [118] S. Novakovic, Y. Shan, A. Kolli, M. Cui, Y. Zhang, H. Eran, B. Pismenny, L. Liss, M. Wei, D. Tsafrir, and M. Aguilera. Storm: A Fast Transactional Dataplane for Remote Data Structures. In SYSTOR, page 97–108, 2019.
- [119] Q. Zhang, X. Chen, S. Sankhe, Z. Zheng, K. Zhong, S. Angel, A. Chen, V. Liu, and B. T. Loo. Optimizing data-intensive systems in disaggregated data centers with TELEPORT. In SIGMOD, pages 1345–1359, 2022.
- [120] Z. Guo, Y. Shan, X. Luo, Y. Huang, and Y. Zhang. Clio: A hardware-software codesigned disaggregated memory system. In ASPLOS, 2022.
- [121] D. Sidler, Z. Wang, M. Chiosa, A. Kulkarni, and G. Alonso. Strom: Smart remote memory. In *EuroSys*, 2020.
- [122] E. Amaro, Z. Luo, A. Ousterhout, A. Krishnamurthy, A. Panda, S. Ratnasamy, and S. Shenker. Remote memory calls. In *Proceedings of the 19th ACM Workshop on Hot Topics in Networks*, pages 38–44, 2020.
- [123] W. Reda, M. Canini, D. Kostić, and S. Peter. RDMA is turing complete, we just did not know it yet! In NSDI, 2022.
- [124] Y. Sun, Y. Yuan, Z. Yu, R. Kuper, I. Jeong, R. Wang, and N. S. Kim. Demystifying cxl memory with genuine cxl-ready systems and devices, 2023.
- [125] MemCached. http://www.memcached.org.
- [126] WiredTiger Storage Engine. https://www.mongodb.com/docs/manual/core/wiredtiger/.
- [127] Xilinx Runtime Library (XRT). https://www.xilinx.com/products/design-tools/vitis/xrt.html.
- [128] Running Average Power Limit RAPL. https://01.org/blogs/2014/running-average-power-limit-%E2%80%93-rapl.

- [129] S. Idreos, F. Groffen, N. Nes, S. Manegold, S. Mullender, and M. Kersten. Monetdb: Two decades of research in column-oriented database architectures. *IEEE Data Eng. Bull.*, 35, 01 2012.
- [130] VoltDB. http://voltdb.com/downloads/datasheets\_collateral/technical\_overview.pdf.
- [131] B. Fan, D. G. Andersen, and M. Kaminsky. Memc3: Compact and concurrent memcache with dumber caching and smarter hashing. In *Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation*, nsdi'13, page 371–384, USA, 2013. USENIX Association.
- [132] X. Yu, G. Bezerra, A. Pavlo, S. Devadas, and M. Stonebraker. Staring into the abyss: An evaluation of concurrency control with one thousand cores. *Proceedings of the VLDB Endowment*, 8, 11 2014.
- [133] M. P. Andersen and D. E. Culler. BTrDB: Optimizing storage system design for timeseries processing. In 14th USENIX Conference on File and Storage Technologies (FAST 16), pages 39–52, Santa Clara, CA, February 2016. USENIX Association.
- [134] S.-Y. Tsai, Y. Shan, and Y. Zhang. Disaggregating persistent memory and controlling them remotely: An exploration of passive disaggregated key-value stores. In ATC, 2020.
- [135] J. Shen, P. Zuo, X. Luo, T. Yang, Y. Su, Y. Zhou, and M. R. Lyu. FUSEE: A fully Memory-Disaggregated Key-Value store. In USENIX FAST, 2023.
- [136] P. Li, Y. Hua, P. Zuo, Z. Chen, and J. Sheng. ROLEX: A scalable RDMA-oriented learned Key-Value store for disaggregated memory systems. In 21st USENIX Conference on File and Storage Technologies (FAST 23), pages 99–114, Santa Clara, CA, February 2023. USENIX Association.
- [137] H. An, F. Wang, D. Feng, X. Zou, Z. Liu, and J. Zhang. Marlin: A concurrent and write-optimized b+-tree index on disaggregated memory. In *Proceedings of the 52nd*

- International Conference on Parallel Processing, ICPP '23, page 695–704, New York, NY, USA, 2023. Association for Computing Machinery.
- [138] X. Min, K. Lu, P. Liu, J. Wan, C. Xie, D. Wang, T. Yao, and H. Wu. Sephash: A write-optimized hash index on disaggregated memory via separate segment structure. Proc. VLDB Endow., 17(5):1091-1104, 2024.
- [139] J. Shen, P. Zuo, X. Luo, Y. Su, J. Gu, H. Feng, Y. Zhou, and M. R. Lyu. Ditto: An elastic and adaptive memory-disaggregated caching system. In *Proceedings of the 29th Symposium on Operating Systems Principles*, SOSP '23, page 675–691, New York, NY, USA, 2023. Association for Computing Machinery.
- [140] D. Korolija, T. Roscoe, and G. Alonso. Do OS abstractions make sense on FPGAs? In OSDI, 2020.
- [141] Z. Yu, Y. Zhang, V. Bravermann, M. Chowdhury, and X. Jin. NetLock: Fast, Centralized Lock Management Using Programmable Switches. In SIGCOMM, 2009.
- [142] Standard containers. https://cplusplus.com/reference/stl/.
- [143] Boost library. https://www.boost.org/.
- [144] Java iterator. https://www.w3schools.com/java/java\_iterator.asp.
- [145] C++ std::iterator. https://en.cppreference.com/w/cpp/iterator/iterator.
- [146] The LLVM Compiler Infrastructure. https://llvm.org/.
- [147] A. Rivitti, R. Bifulco, A. Tulumello, M. Bonola, and S. Pontarelli. ehdl: Turning ebpf/xdp programs into hardware designs for the nic. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, pages 208–223, 2023.
- [148] DPDK. https://www.dpdk.org/.
- [149] C. Lattner and V. Adve. Llvm: a compilation framework for lifelong program analysis & transformation. In *International Symposium on Code Generation and Optimization*, 2004. CGO 2004., pages 75–86, 2004.

- [150] LLVM's Analysis and Transform Passes. https://llvm.org/docs/Passes.html#introduction.
- [151] K. Koukos, D. Black-Schaffer, V. Spiliopoulos, and S. Kaxiras. Towards more efficient execution: A decoupled access-execute approach. In *Proceedings of the 27th Inter*national ACM Conference on International Conference on Supercomputing, ICS '13, page 253–262, New York, NY, USA, 2013. Association for Computing Machinery.
- [152] J. Gandhi, V. Karakostas, F. Ayar, A. Cristal, M. D. Hill, K. S. McKinley, M. Nemirovsky, M. M. Swift, and O. S. Ünsal. Range translations for fast virtual memory. IEEE Micro, 36(3):118–126, 2016.
- [153] Xilinx Content Addressable Memory (CAM). https://www.xilinx.com/products/intellectual-property/ef-di-cam.html.
- [154] XUP Vitis Network Example (VNx). https://github.com/Xilinx/xup\_vitis\_network\_example.
- [155] AXI4 Protocol Burst size. https://bit.ly/3Bxh35b.
- [156] A. Kalia, M. Kaminsky, and D. Andersen. Datacenter RPCs can be general and fast. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 1–16, Boston, MA, February 2019. USENIX Association.
- [157] Intel Xeon Gold 6240 Processor datasheet. https://ark.intel.com/content/www/us/en/ark/products/192443/intel-xeon-gold-6240-processor-24-75m-cache-2-60-ghz.html.
- [158] Intel(R) RDT Software Package. https://github.com/intel/intel-cmt-cat.
- [159] NIVIDIA MELLANOX BLUEFIELD-2. https://network.nvidia.com/files/doc-202 0/pb-bluefield-2-smart-nic-eth.pdf.
- [160] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking Cloud Serving Systems with YCSB. In Proceedings of the 1st ACM Symposium on

- Cloud Computing, SoCC '10, page 143–154, New York, NY, USA, 2010. Association for Computing Machinery.
- [161] WiredTiger storage engine. https://docs.mongodb.com/manual/core/wiredtiger/.
- [162] Y. Zhong, H. Li, Y. J. Wu, I. Zarkadas, J. Tao, E. Mesterhazy, M. Makris, J. Yang, A. Tai, R. Stutsman, and A. Cidon. XRP: In-Kernel storage functions with eBPF. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 375–393, Carlsbad, CA, July 2022. USENIX Association.
- [163] A Multi-Resolution Plotter that is compatible with BTrDB. https://github.com/BTr DB/mr-plotter.
- [164] E. M. Stewart, A. Liao, and C. Roberts. Open  $\mu$ pmu: A real world reference distribution micro-phasor measurement unit data set for research and application development. 2016.
- [165] A. Ousterhout, J. Fried, J. Behrens, A. Belay, and H. Balakrishnan. Shenango: Achieving high CPU efficiency for latency-sensitive datacenter workloads. In NSDI, pages 361–378, Boston, MA, February 2019. USENIX Association.
- [166] armv8registers. https://developer.arm.com/documentation/100095/0002/system-control/aarch64-register-summary/aarch64-performance-monitors-registers.
- [167] CZ120 memory expansion module. https://www.micron.com/products/memory/cxl-memory.
- [168] I. Kuon and J. Rose. Measuring the gap between fpgas and asics. In Proceedings of the 2006 ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays, FPGA '06, page 21–30, New York, NY, USA, 2006. Association for Computing Machinery.
- [169] H. Li, D. S. Berger, S. Novakovic, L. Hsu, D. Ernst, P. Zardoshti, M. Shah, I. Agarwal, M. Hill, M. Fontoura, et al. First-generation memory disaggregation for cloud platforms. arXiv preprint arXiv:2203.00241, 2022.

- [170] A. Li, S. L. Song, J. Chen, J. Li, X. Liu, N. R. Tallent, and K. J. Barker. Evaluating modern gpu interconnect: Pcie, nvlink, nv-sli, nvswitch and gpudirect. IEEE Transactions on Parallel and Distributed Systems, 31(1):94-110, 2019.
- [171] A. Cho, A. Saxena, M. Qureshi, and A. Daglis. A case for cxl-centric server processors, 2023.
- [172] Y. Sun, Y. Yuan, Z. Yu, R. Kuper, I. Jeong, R. Wang, and N. S. Kim. Demystifying cxl memory with genuine cxl-ready systems and devices, 2023.
- [173] Intel Corporation. Intel Agilex® 7 FPGA and SoC FPGA I-Series. https://www.intel.com/content/www/us/en/products/details/fpga/agilex/7/i-series.html.
- [174] V. Viswanathan, K. Kumar, and T. Willhalm. "Intel® Memory Latency Checker v3.10". https://www.intel.com/content/www/us/en/developer/articles/tool/intelr-memory-latency-checker.html.
- [175] H. A. Maruf, H. Wang, A. Dhanotia, J. Weiner, N. Agarwal, P. Bhattacharya, C. Petersen, M. Chowdhury, S. Kanaujia, and P. Chauhan. Tpp: Transparent page placement for cxl-enabled tiered-memory. In Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3, ASPLOS 2023, page 742–755, New York, NY, USA, 2023. Association for Computing Machinery.
- [176] D. Blanchfield. The cloud native convergence: A new era of data-intensive applications. https://elnion.com/2023/06/05/the-cloud-native-convergence-a-new-era-of-data-intensive-applications/.
- [177] A. Abulila, V. S. Mailthody, Z. Qureshi, J. Huang, N. S. Kim, J. Xiong, and W.-m. Hwu. Flatflash: Exploiting the byte-accessibility of ssds within a unified memory-storage hierarchy. In Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '19, page 971–985, New York, NY, USA, 2019. Association for Computing Machinery.

- [178] S.-P. Yang, M. Kim, S. Nam, J. Park, J. yong Choi, E. H. Nam, E. Lee, S. Lee, and B. S. Kim. Overcoming the memory wall with CXL-Enabled SSDs. In 2023 USENIX Annual Technical Conference (USENIX ATC 23), pages 601–617, Boston, MA, July 2023. USENIX Association.
- [179] Leo Memory Connectivity Platform for CXL 1.1 and 2.0. https://www.asteralabs.com/wp-content/uploads/2022/08/Astera\_Labs\_Leo\_Aurora\_Product\_FINAL.pdf.
- [180] D. S. Berger, D. Ernst, H. Li, P. Zardoshti, M. Shah, S. Rajadnya, S. Lee, L. Hsu, I. Agarwal, M. D. Hill, and R. Bianchini. Design tradeoffs in cxl-based memory pools for public cloud platforms. *IEEE Micro*, 43(2):30–38, 2023.
- [181] D. Gouk, S. Lee, M. Kwon, and M. Jung. Direct access, High-Performance memory disaggregation with DirectCXL. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 287–294, Carlsbad, CA, July 2022. USENIX Association.
- [182] K. Kim, H. Kim, J. So, W. Lee, J. Im, S. Park, J. Cho, and H. Song. Smt: Software-defined memory tiering for heterogeneous computing systems with cxl memory expander. *IEEE Micro*, 43(2):20–29, 2023.
- [183] D. S. Berger, D. Ernst, H. Li, P. Zardoshti, M. Shah, S. Rajadnya, S. Lee, L. Hsu, I. Agarwal, M. D. Hill, and R. Bianchini. Design tradeoffs in cxl-based memory pools for public cloud platforms. *IEEE Micro*, 43(2):30–38, 2023.
- [184] What Are PCIe 4.0 and 5.0? https://www.intel.com/content/www/us/en/gaming/res ources/what-is-pcie-4-and-why-does-it-matter.html.
- [185] D. D. Sharma, R. Blankenship, and D. S. Berger. An introduction to the compute express link (cxl) interconnect, 2023.
- [186] Intel Corporation. Intel launches 4<sup>th</sup> gen xeon scalable processors, max series cpus. https://www.intel.com/content/www/us/en/newsroom/news/.
- [187] AMD Unveils Zen 4 CPU Roadmap: 96-Core 5nm Genoa in 2022, 128-Core Bergamo in 2023. https://wccftech.com/intel-clearwater-forest-e-core-xeon-cpus-up-t o-288-cores-higher-ipc-more-cache/.

- [188] Montage Technology. Cxl memory expander controller (mxc). https://www.montage-tech.com/MXC,accessedin2023.
- [189] M. Ahn, A. Chang, D. Lee, J. Gim, J. Kim, J. Jung, O. Rebholz, V. Pham, K. Malladi, and Y. S. Ki. Enabling cxl memory expansion for in-memory database management systems. In *Proceedings of the 18th International Workshop on Data Management on New Hardware*, DaMoN '22, New York, NY, USA, 2022. Association for Computing Machinery.
- [190] I. Kuon and J. Rose. Measuring the gap between fpgas and asics. In Proceedings of the 2006 ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays, FPGA '06, page 21–30, New York, NY, USA, 2006. Association for Computing Machinery.
- [191] J. Weiner. [PATCH] mm: mempolicy: N:M interleave policy for tiered memory nodes. https://lore.kernel.org/linux-mm/YqD0%2FtzFwXvJ1qK6@cmpxchq.org/T/.
- [192] NUMA balancing: optimize memory placement for memory tiering system. https://lore.kernel.org/linux-mm/20220221084529.1052339-1-ying.huang@intel.com/.
- [193] Transparent Page Placement for Tiered-Memory. https://lore.kernel.org/all/cover.1637778851.git.hasanalmaruf@fb.com/.
- [194] David L Mulnix. Intel® Xeon® Processor Scalable Family Technical Overview. ht tps://www.intel.com/content/www/us/en/developer/articles/technical/xeon-processor-scalable-family-technical-overview.html.
- [195] A. Cho and et al. A Case for CXL-Centric Server Processors. https://arxiv.org/abs/2305.05033.
- [196] J. Yi, B. Dong, M. Dong, R. Tong, and H. Chen. MT2: Memory bandwidth regulation on hybrid NVM/DRAM platforms. In 20th USENIX Conference on File and Storage Technologies (FAST 22), pages 199–216, Santa Clara, CA, February 2022. USENIX Association.

- [197] Intel Corporation. Intel® Performance Counter Monitor (Intel® PCM). https://github.com/intel/pcm.
- [198] Intel Corporation. Intel Unveils Future-Generation Xeon with Robust Performance and Efficiency Architectures. https://www.intel.com/content/www/us/en/newsroom/news/intel-unveils-future-generation-xeon.html.
- [199] B. F. Cooper, A. Silberstein, E. Tam, R. Ramakrishnan, and R. Sears. Benchmarking cloud serving systems with yesb. In *Proceedings of the 1st ACM Symposium on Cloud Computing*, SoCC '10, page 143–154, New York, NY, USA, 2010. Association for Computing Machinery.
- [200] Tecton.ai. Managing your Redis Cluster. https://docs.tecton.ai/docs/0.5/setting-up-tecton/setting-up-other-components/managing-your-redis-cluster.
- [201] Google Cloud. Memory management best practices. https://cloud.google.com/memorystore/docs/redis/memory-management-best-practices.
- [202] Redis enterprise. https://redis.io/docs/about/redis-enterprise/, 2023.
- [203] Auto Tiering Extend Redis Enterprise databases beyond DRAM limits. https://redis.com/redis-enterprise/technology/auto-tiering/#:~:text=Redis%20Enterprise's% 20auto%20tiering%20lets,compared%20to%20only%20DRAM%20deployments.
- [204] C. Zou, H. Zhang, A. A. Chien, and Y. Seok Ki. Psacs: Highly-parallel shuffle accelerator on computational storage. In 2021 IEEE 39th International Conference on Computer Design (ICCD), pages 480–487, 2021.
- [205] TPC-H is a Decision Support Benchmark. https://www.tpc.org/tpch/.
- [206] Ice Lake SP: Overview and technical documentation. (n.d.). Intel. https://www.intel.com/content/www/us/en/products/platforms/details/ice-lake-sp.html.
- [207] 4th Gen Intel Xeon Processor Scalable Family, sapphire rapids. (n.d.). Intel. https://www.intel.com/content/www/us/en/developer/articles/technical/fourth-generation-xeon-scalable-family-overview.html#gs.3m5uv2.

- [208] McDowell, S. (2023, December 18). Intel launches 5th generation "Emerald Rapids" Xeon processors. Forbes. https://www.forbes.com/sites/stevemcdowell/2023/12/17/intel-launches-5th-generation-emerald-rapids-xeon-processors/.
- [209] Kennedy, Patrick. "Intel Shows Granite Rapids and Sierra Forest Motherboards at OCP Summit 2023." ServeTheHome, 26 Oct. 2023,. www.servethehome.com/intel-s hows-granite-rapids-and-sierra-forest-motherboards-at-ocp-summit-2023-qct-w istron.
- [210] Mujtaba, H. (2023, December 1). Intel Clearwater Forest E-Core Only Xeon CPUs to offer up to 288 cores. https://wccftech.com/intel-clearwater-forest-e-core-xeon-cpus-up-to-288-cores-higher-ipc-more-cache/.
- [211] S. Yi, D. Kondo, and A. Andrzejak. Reducing costs of spot instances via checkpointing in the amazon elastic compute cloud. In 2010 IEEE 3rd International Conference on Cloud Computing, pages 236–243, 2010.
- [212] Amazon EC2 M7a Instances. https://aws.amazon.com/ec2/instance-types/m7a/, 2023.
- [213] Amazon EC2 M7i Instances. https://aws.amazon.com/ec2/instance-types/m7i/, 2023.
- [214] Intel Shows Granite Rapids and Sierra Forest Motherboards at OCP Summit 2023. https://www.servethehome.com/intel-shows-granite-rapids-and-sierra-forest-motherboards-at-ocp-summit-2023-qct-wistron/.
- [215] Elastic Compute Service, Volcano Engine, Bytedance. https://www.volcengine.com/product/ecs.
- [216] G. W. D. Patel. GPT-4 Architecture, Infrastructure, Training Dataset, Costs, Vision, MoE. https://www.semianalysis.com/p/gpt-4-architecture-infrastructure, 2023.
- [217] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. E. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention, 2023.

- [218] Lightllm A Light and Fast Inference Service for LLM. https://github.com/ModelTC/lightllm.
- [219] Julien Simon.Smaller is Better: Q8-Chat LLM is an Efficient Generative AI Experience on Intel® Xeon® Processors. https://www.intel.com/content/www/us/en/develope r/articles/case-study/q8-chat-efficient-generative-ai-experience-xeon.html.
- [220] R. Pope, S. Douglas, A. Chowdhery, J. Devlin, J. Bradbury, A. Levskaya, J. Heek, K. Xiao, S. Agrawal, and J. Dean. Efficiently scaling transformer inference, 2022.
- [221] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In *Proceedings of the 29th Symposium on Operating Systems Principles*, SOSP '23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
- [222] Alpaca: A Strong, Replicable Instruction-Following Model. https://crfm.stanford.edu/2023/03/13/alpaca.html.
- [223] J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- [224] L. Floridi and M. Chiriatti. Gpt-3: Its nature, scope, limits, and consequences. *Minds and Machines*, 30:681–694, 2020.
- [225] K. Ethayarajh. How contextual are contextualized word representations? comparing the geometry of bert, elmo, and gpt-2 embeddings. arXiv preprint arXiv:1909.00512, 2019.
- [226] Y. Zhong, S. Liu, J. Chen, J. Hu, Y. Zhu, X. Liu, X. Jin, and H. Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193-210, Santa Clara, CA, July 2024. USENIX Association.

- [227] W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica. Efficient memory management for large language model serving with pagedattention. In *Proceedings of the 29th Symposium on Operating Systems Principles*, SOSP '23, page 611–626, New York, NY, USA, 2023. Association for Computing Machinery.
- [228] C. Hu, H. Huang, J. Hu, J. Xu, X. Chen, T. Xie, C. Wang, S. Wang, Y. Bao, N. Sun, and Y. Shan. Memserve: Context caching for disaggregated llm serving with elastic memory pool, 2024.
- [229] J. Yao, H. Li, Y. Liu, S. Ray, Y. Cheng, Q. Zhang, K. Du, S. Lu, and J. Jiang. Cacheblend: Fast large language model serving for rag with cached knowledge fusion, 2024.
- [230] Meta. Llama 2. https://huggingface.co/meta-llama/Llama-2-7b, 2024.
- [231] P. Lienhart. LLM Inference Series: 4. KV caching, a deeper look. https://medium.c om/@plienhar/llm-inference-series-4-kv-caching-a-deeper-look-4ba9a77746c8, 2024.
- [232] Y. Sun, Y. Yuan, Z. Yu, R. Kuper, C. Song, J. Huang, H. Ji, S. Agarwal, J. Lou, I. Jeong, R. Wang, J. H. Ahn, T. Xu, and N. S. Kim. Demystifying cxl memory with genuine cxl-ready systems and devices. In *Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture*, MICRO '23, page 105–121, New York, NY, USA, 2023. Association for Computing Machinery.
- [233] Y. Tang, P. Zhou, W. Zhang, H. Hu, Q. Yang, H. Xiang, T. Liu, J. Shan, R. Huang, C. Zhao, C. Chen, H. Zhang, F. Liu, S. Zhang, X. Ding, and J. Chen. Exploring performance and cost optimization with asic-based cxl memory. In *Proceedings of the Nineteenth European Conference on Computer Systems*, EuroSys '24, page 818–833, New York, NY, USA, 2024. Association for Computing Machinery.
- [234] P. Labs. GPT-fast Simple and efficient pytorch-native transformer text generation. https://github.com/pytorch-labs/gpt-fast.git, 2024.

- [235] I. Corporation. Intel® Xeon® Platinum Processor. https://www.intel.com/content/www/us/en/products/details/processors/xeon/scalable/platinum.html, 2024.
- [236] M. Arif, A. Maurya, and M. M. Rafique. Accelerating performance of gpu-based work-loads using cxl. In Proceedings of the 13th Workshop on AI and Scientific Computing at Scale Using Flexible Computing, FlexScience '23, page 27–31, New York, NY, USA, 2023. Association for Computing Machinery.
- [237] S. Sano, Y. Bando, K. Hiwada, H. Kajihara, T. Suzuki, Y. Nakanishi, D. Taki, A. Kaneko, and T. Shiozawa. Gpu graph processing on cxl-based microsecond-latency external memory. In *Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis*, SC-W '23, page 962–972, New York, NY, USA, 2023. Association for Computing Machinery.
- [238] D. Gouk, S. Kang, H. Bae, E. Ryu, S. Lee, D. Kim, J. Jang, and M. Jung. Breaking barriers: Expanding gpu memory with sub-two digit nanosecond latency cxl controller. In Proceedings of the 16th ACM Workshop on Hot Topics in Storage and File Systems, HotStorage '24, page 108–115, New York, NY, USA, 2024. Association for Computing Machinery.
- [239] anon8231489123. ShareGPT Vicuna unfiltered dataset. https://huggingface.co/datasets/anon8231489123/ShareGPT\_Vicuna\_unfiltered, 2024.
- [240] Meta. Llama 2 70B: An MLPerf Inference Benchmark for Large Language Models. https://mlcommons.org/2024/03/mlperf-llama2-70b, 2024.