# TAO: Re-Thinking DL-based Microarchitecture Simulation

SANTOSH PANDEY, Rutgers University, USA AMIR YAZDANBAKHSH, Google DeepMind, USA HANG LIU, Rutgers University, USA

Microarchitecture simulators are indispensable tools for microarchitecture designers to validate, estimate, optimize, and manufacture new hardware that meets specific design requirements. While the quest for a fast, accurate and detailed microarchitecture simulation has been ongoing for decades, existing simulators excel and fall short at different aspects: (i) Although execution-driven simulation is accurate and detailed, it is extremely slow and requires expert-level experience to design. (ii) Trace-driven simulation reuses the execution traces in pursuit of fast simulation but faces accuracy concerns and fails to achieve significant speedup. (iii) Emerging deep learning (DL)-based simulations are remarkably fast and have acceptable accuracy, but fail to provide adequate low-level microarchitectural performance metrics such as branch mispredictions or cache misses, which is crucial for microarchitectural bottleneck analysis. Additionally, they introduce substantial overheads from trace regeneration and model re-training when simulating a new microarchitecture.

Re-thinking the advantages and limitations of the aforementioned three mainstream simulation paradigms, this paper introduces Tao that redesigns the DL-based simulation with three primary contributions: First, we propose a new training dataset design such that the subsequent simulation (i.e., inference) only needs functional trace as inputs, which can be rapidly generated and reused across microarchitectures. Second, to increase the detail of the simulation, we redesign the input features and the DL model using self-attention to support predicting various performance metrics of interest. Third, we propose techniques to train a microarchitecture agnostic embedding layer that enables fast transfer learning between different microarchitectural configurations and effectively reduces the re-training overhead of conventional DL-based simulators. Tao can predict various performance metrics of interest, significantly reduce the simulation time, and maintain similar simulation accuracy as state-of-the-art DL-based endeavors. Our extensive evaluation shows Tao can reduce the overall training and simulation time by 18.06× over the state-of-the-art DL-based endeavors.

CCS Concepts: • Computing methodologies -> Transfer learning; Modeling methodologies.

Additional Key Words and Phrases: computer architecture simulation; multi-task learning; program embeddings

## **ACM Reference Format:**

Santosh Pandey, Amir Yazdanbakhsh, and Hang Liu. 2024. TAO: Re-Thinking DL-based Microarchitecture Simulation. *Proc. ACM Meas. Anal. Comput. Syst.* 8, 2, Article 28 (June 2024), 25 pages. https://doi.org/10.1145/3656012

## 1 INTRODUCTION

Since its inception, microarchitecture simulators rapidly become the most commonly used tools in computer architecture-related research (see the report [65]). As of today, computer architecture simulation is the textbook standard and virtually used in any architecture explorations, e.g., design

Authors' addresses: Santosh Pandey, santosh.pandey@rutgers.edu, Rutgers University, New Brunswick, NJ, USA; Amir Yazdanbakhsh, ayazdan@google.com, Google DeepMind, Mountain View, CA, USA; Hang Liu, hl1097@soe.rutgers.edu, Rutgers University, New Brunswick, NJ, USA.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org.

 $\@ifnextchar[{\@model{O}}{@}$  2024 Copyright held by the owner/author(s). Publication rights licensed to ACM.

ACM 2476-1249/2024/6-ART28

https://doi.org/10.1145/3656012



Fig. 1. Mainstream simulation mechanisms vs. our effort, i.e., Tao.

space exploration [34, 39, 40, 73], microarchitectural bottleneck analysis [8, 27], workload characterization [31, 56] among many others [50, 74]. As a common practice, architecture researchers often use popular software architecture simulators to incorporate their radical new ideas. The updated simulator is then used to execute the programs of interest (i.e., benchmarks). The simulation yields a range of metrics that characterize the execution of benchmarks, with the level of detail in the simulation dictating the specificity of these metrics. Such output metrics provide feedback to the researcher for further explorations and/or decision makings. Due to the significance, many simulators have been built over the decades with different abstractions where each abstraction provides a tradeoff between speed, accuracy and detail (please see [2, 12, 18, 69] for more details).

#### 1.1 Related work and motivations

The quest towards a *fast, accurate* and *detailed* cycle-level architecture simulation has never stopped. This cohort of researchers have mainly dedicated their efforts into three prominent paradigms, i.e., execution-driven simulation [1, 7, 11, 36, 55, 57, 61, 74], trace-driven simulation [4, 5, 21, 37, 38, 60], and recently the DL-based simulation [48, 54, 59, 67] (see [10, 12, 18, 68] for more types of architecture simulations). Below, we briefly discuss these simulation methodologies with the motivations for our work, i.e., TAO.

**Motivation (i).** Execution-driven simulation offers the most detailed and accurate framework, although this comes at the cost of extremely slow speed and high maintenance overhead. Figure 1(a) presents the workflow of this paradigm. Takes as input of an executable for a program 1, this simulation approach uses software components to simulate the functionality and timing information of all components of a target processor. The output statistics contain the runtime behaviors of the hardware when the executable runs through the simulator, including CPI, branch mispredictions, cache misses and many other performance metrics. These performance metrics aid in microarchitectural bottleneck analysis and hardware design exploration. The milestone projects of this line of efforts encompass SimpleScalar [7], SESC [57], and gem5 [11]. The simulation throughput of a detailed execution-driven simulation is often around five orders of magnitude slower than a real processor. Various statistical approaches have been proposed to accelerate the microarchitecture simulation by only simulating representative [64] or a fraction of instructions [72] from the benchmark.

**Motivation (ii)**. Trace-driven simulation faces accuracy concerns in pursuit of higher throughput than execution-driven simulations [25, 28]. Figure 1(b) illustrates this method. For program 1, trace-driven simulation derives the detailed trace of the program on a particular microarchitecture. Subsequently, it simulates that trace on a different microarchitecture and derives the microarchitecture statistics. Shade [21], MacSim [38] and MASE [41] are some of the popular trace-driven simulators. Trace-driven simulation is mostly used to explore the design of specific microarchitecture components like cache and memory. The reference trace captures the events related to

CPU core, and cache and memory on  $\mu$ Arch A. During simulation on a different  $\mu$ Arch B, the events related to CPU cores are rapidly replayed, while the events related to memory and cache are simulated in detail for  $\mu$ Arch B. Replaying the events related to the CPU core is significantly faster than simulating them. Further, because simulating the CPU core-related instructions dominates the overall simulation cost [16], trace-driven simulation provides decent speedup over the execution-driven counterpart. However, reusing the same trace for different microarchitectures leads to accuracy concerns as the execution order of different memory instructions can vary (see [44]).

Motivation (iii). Emerging DL-based simulations are remarkably fast and can provide comparable cycle-level accuracy, whereas hits three roadblocks: limited output metrics, expensive microarchitecture specific trace generation, and restricted microarchitecture support. Notably, for the DL model of trained microarchitecture, the state-of-the-art DL simulator, i.e., SimNet [48, 59] can achieve >1,000× higher throughput than the execution-driven simulator, i.e., gem5. The substantial increase in throughput is attributed to replacing the highly irregular and heterogeneous simulation with DL models that are accelerator-friendly and parallelizable. DL-based simulators, illustrated in Figure 1(c), typically follow a two-step process to model the performance of a program: Step (i), a DL model specific to a  $\mu$ Arch A is trained using a detailed trace of a program 1 generated through simulation on  $\mu$ Arch A. Step (ii), this trained model is generalized to predict output metrics for the traces of unseen programs, i.e., program 2 for the same  $\mu$ Arch A. However, existing DL-based simulators only predict one output metric, i.e., CPI. Further, because they need low-level microarchitectural performance metrics like branch misprediction and cache misses as the input, one needs to generate microarchitecture specific traces for simulating the same program on different architectures. When the overhead for generating the trace and training the model is accounted for, DL-based simulators can be upto 17× slower than execution-driven simulator gem5 when simulating 1 billion instructions.

In summary, the mainstream cycle-level microarchitecture simulations excel and fall short at different aspects, i.e., speed, accuracy, and/or simulation details (i.e., # of output metrics). Particularly, (i) although execution-driven simulation offers the desired accuracy and detail, its slow throughput limits its application scenarios. (ii) Trace-driven simulation achieves decent acceleration by reusing traces but sacrifices accuracy and fails to provide significant speedups. (iii) DL-based simulations promise impressive throughput but fail to provide crucial microarchitectural performance metrics and introduce substantial overheads from trace regeneration and model re-training.

#### 1.2 Contributions

Departing from the designs and desired goals from the aforementioned three paradigms, this paper redesigns DL-based cycle-level microarchitecture simulator. Particularly, we take as input the functional and detailed traces train a DL-based simulator, support a set of desired performance metrics of interest and fast microarchitecture exploration, achieving the comparable accuracy as execution-driven simulation, and an order of magnitude higher throughput than the state-of-the-art DL-driven simulator, i.e., SimNet. Figure 1(d) illustrates the workflow of our system, which encompasses the following three contributions:

- First, we introduce a unique training dataset design so that the subsequent simulation (i.e., inference) only needs light-weighted and reusable functional trace as inputs.
- Second, for predicting a variety of performance metrics of interest, we propose an DL model with separate embedding and self-attention based performance prediction layers.
- Third, we introduce transfer learning techniques to rapidly explore various microarchitectures. This includes architecture agnostic embedding layers and judicious training dataset selection.

#### 2 BACKGROUND

#### 2.1 Architecture simulation

Architecture simulations by level of details. Architecture simulators can be divided into functional and detailed simulations based on the level of detail. (i) *Functional simulation*. Functional simulators are designed to model the functionality of a microarchitecture rather than its detailed implementation. They primarily validate hardware functions and generate execution traces for specific workloads. They do not simulate the microarchitecture in detail, so they typically do not produce timing information. However, their lack of detail allows them to operate at a speed that is one to two orders of magnitude faster than a detailed simulation. (ii) *Detailed simulation*. Detailed simulators simulate the processor by performing all the operations from each component with cycle granularity. They model the detailed knowledge of how the processor works to capture the dynamic behavior of the microarchitecture, which impacts the performance. It provides a meticulous analysis of performance characteristics, enabling researchers to explore how different microarchitectural elements interact and impact overall performance. While detailed simulation offers higher accuracy, it comes at the cost of increased computational overhead, making it more time-consuming.

**Execution trace.** This paper extensively uses execution trace, which refers to the stream of instructions generated by functional or detailed simulation. The gem5 simulator is modified to generate execution traces capturing various static instruction properties and dynamic performance metrics. Functional trace refers to the microarchitecture agnostic trace generated with functional simulation using *AtomicSimpleCPU* model. We use the terms functional trace and microarchitecture agnostic trace interchangeably. It only contains static properties like opcode, registers, and other instruction flags. Detailed trace refers to the trace generated with the *O3CPU* model. It captures various microarchitecture specific performance metrics like data access misses, instruction cache misses, branch mispredictions, speculative instructions and latency of individual instructions.

# 2.2 Deep learning (DL)-based microarchitecture simulation

In DL-based approaches, the simulated processor is abstracted as a whole, eliminating the need to simulate individual components within the processor. As DL excels at deriving the sophisticated rules that govern various complex functions, recent work shows that it can capture the microarchitecture simulations in a similar way. State-of-the-art DL-based simulations, e.g., SimNet [48] and Ithemal [54] manage to model the performance of a program at the instruction level, generally, in two steps: (i) An DL model is trained to capture the complex and dynamic relationships between instructions and the hardware based on the instruction properties and the performance metrics. The performance prediction problem can be defined as below:

$$Y = f(x_0, x_1, ..., x_n), \tag{1}$$

where *Y* is the desired output performance metric of the instruction, typically including the cycles required to execute the instruction.  $x_0, x_1 \dots x_n$  represent the input features used by the models.

The input features include the properties of the current and earlier instructions (i.e., context instruction). Context instructions are used to model the dependencies and resource contentions among the instructions. The instruction features of existing DL-based simulators include static features like opcode, registers used by the instruction, branch predictions and data access level. The performance metric and input features are gathered from detailed traces by simulating various programs. f(.) is a microarchitecture specific function the model learns. Earlier state-of-the-art works adapt long short-term memory (LSTM) or convolutional neural networks (CNN) for learning f(.). (ii) During inference, this trained DL model can be used to predict the performance metrics of various unseen programs for the same microarchitecture at the instruction level. The required

instruction input features are collected from dynamic profiling or simulation for a specific microarchitecture. As the processor is abstracted as a whole, any change in the microarchitecture requires re-training of the DL model with microarchitecture specific training datasets.

## 3 DESIGN PRINCIPLE, CHALLENGE AND OVERVIEW

**Design Principle #1.** We advocate that (i) the input to the DL model should only capture the instruction execution sequence and (ii) the DL model should govern the hardware features stemming from the following reasons. First, if the input to the DL model only needs to capture the execution sequence, the DL model captures all the microarchitecture features. That is, a trained DL model can be used to predict the crucial low-level microarchitectural performance metrics (i.e., CPI, branch mispredictions, cache misses) of any benchmarks. Second, generating the microarchitecture agnostic instruction execution sequence of a particular benchmark is significantly faster than generating the traces with architectural information (We refer to as detailed trace in this paper). Third, if a microarchitecture practitioner would like to change the microarchitecture of a particular hardware, the inputs to the DL model can be reused.

**Design Principle #2.** An DL-based microarchitecture simulator should (i) report various performance metrics during the architecture simulation and (ii) support rapid explorations of different architecture configurations. First, existing work primarily provides cycles as the only output metrics from the DL prediction model. One needs to rely on the conventional simulation for the rest of the metrics. This limits the application of existing DL-based simulators for microarchitectural bottleneck analysis. Further, learning from more metrics helps the model learn more complex program and hardware interactions, improving simulation detail and accuracy. Earlier work has shown that performing multi-metric prediction improves the accuracy of the prediction model [67]. Second, conventional efforts require re-training of the model from scratch. Simulating the whole design would incur huge costs just for the training. This discourages the usability of DL-based architecture simulation. We leverage the proposed microarchitecture agnostic embedding layers for transfer learning and fast adaptation across different microarchitecture designs with a relatively small training dataset.

**Challenges.** Tao faces three grand challenges: (i) For the training dataset, we need to associate the microarchitecture impacts with each executed instruction in the functional trace. (ii) Reporting various performance metrics demands us to derive sufficiently powerful DL models that can capture the impacts of various hardware components. (iii) Training microarchitecture-agnostic program embeddings presents difficulties because the embeddings are biased towards the architecture they are trained on. These three challenges motivate the design of Tao.

**Overview.** Section 4 unveils Tao, our multi-modal DL architecture for microarchitecture simulation. Our approach adheres to design principle #1 by proposing a workflow to construct training datasets from detailed and functional traces which attributes the differences in these two traces to performance metrics, allowing the reuse of functional traces for varying microarchitectures. For design principle #2, we propose multi-metric predictions with feature engineering with a self-attention model to increase the simulation detail. Further, we propose techniques to train microarchitecture agnostic embedding layers that enable fast transfer learning which significantly reduces the re-training overhead of DL-based microarchitecture training and simulation.

## 4 TAO: A FAST AND DETAILED DL MICROARCHITECTURE SIMULATOR

This section discusses three components of Tao. First, we present our workflow of training dataset generation. Since we perform supervised learning, we associate microarchitecture agnostic input with microarchitecture specific performance metrics. Our workflow derives this training dataset. Second, we introduce our multi-metric ML architecture that takes as input the microarchitecture

agnostic inputs and outputs various user-requested performance metrics. Third, we present our microarchitecture agnostic embedding construction for fast transfer learning.

## 4.1 Training Dataset Construction

Tao uses functional trace as input to the model and the output (i.e. label) can be various performance metrics. This permits the subsequent simulation (i.e., inference) to only require functional trace as inputs, which can be rapidly generated and reused across microarchitectures. For the output, the metrics are instruction latency, branch misprediction, data cache misses, instruction cache misses, and translation lookaside buffer (TLB) misses of each instruction [6, 42, 78]. For brevity, we use three major performance metrics, i.e., latency, branch misprediction, and data cache misses, to explain how we process the detail and function traces to arrive at the training dataset. However, it is important to note that Tao can potentially support other performance metrics. Eventually, we introduce an automatic workflow to generate the training dataset for any benchmarks.

Functional and detailed traces output similar sequence order, which permits us to associate each instruction of a functional trace with a detailed trace. However, the challenge is that the difference in number of instructions between detailed and functional traces is nontrivial. Table 1 shows the difference in instruction counts for detailed and functional simulations of 531.deepsjeng\_r SPEC 2017 benchmark [15] for a base ARM microarchi-

|     | # Detailed vs Functional Trace |                   |  |  |  |  |  |
|-----|--------------------------------|-------------------|--|--|--|--|--|
|     | Detailed trace                 | Functional trace  |  |  |  |  |  |
|     | (O3CPU)                        | (AtomicSimpleCPU) |  |  |  |  |  |
| 1M  | 2,655,925                      | 2,528,617         |  |  |  |  |  |
| 10M | 26,689,939                     | 25,469,667        |  |  |  |  |  |

Table 1. # of instructions differences in detailed vs functional trace for 531.deepsjeng\_r benchmark.

tecture. As the table shows, for simulation with 1M and 10M as specified instruction count with gem5, the instruction counts of functional and detailed trace differ in 5.2% and 4.8%, respectively.

Detailed trace generally differs from functional trace in the following two aspects. First, a detailed trace includes various performance metrics introduced earlier for individual instructions. Second, a detailed trace includes two types of additional dynamic instructions during execution that are missing in the functional trace. Specifically, the detailed trace contains incorrect speculative and stall instructions. Incorrect speculative instructions are the wrongly executed instructions squashed based on branch prediction. Stall instructions are used to stall the pipeline by inserting a no-operation (nop) instruction in the pipeline when any other instructions cannot be executed.

Our key idea is that both types of additional instructions can be converted into numerical performance differences and attributed to specific instructions from the functional trace. Using the stall instructions from the detailed trace as an example, one can project the timing impact of these instructions to the latency of the subsequent instructions. Below, we discuss how we model the impact of speculative and pipeline stall instructions.

**Squashed speculative instructions.** Instructions are speculatively executed following the prediction of whether a conditional branch instruction will be taken or not. If the predicted branch path is correct, speculatively executed instructions will be correct, thus the instruction streams of detailed and functional traces will be identical. When a speculative path is wrong due to branch misprediction, speculatively executed instructions should be squashed. This case leads to a distinction between functional and detailed traces. Having squashed speculative instructions in the detailed trace avoids separately modeling the total impact of a branch misprediction.

The total impact of branch misprediction can be accounted for in the functional trace with the fetch timing information obtained from the detailed trace. If a branch is mispredicted, it will delay the fetch of the next correct instruction. In a detailed trace, the fetch latency of the correct instruction does not include the speculation or branch resolution overhead. To include the miss

prediction overhead, we remove the squashed instruction from the detailed trace, get the difference in the fetch clock as the fetch latency, and add it to the subsequent instruction.

**Pipeline stalls.** Stall instructions can be handled similarly to squashed speculative instructions. When no instruction can be executed in the pipeline due to dependency or resource contention, nop instructions are filled. Similar to squashed speculative instructions, we remove and project the latency impact of nop instructions to the subsequent instruction. We use the fetch clock from the detailed trace to determine the additional fetch latency delay.



Fig. 2. Training dataset construction illustrated via the trace snippets for 531.deepsjeng\_r benchmark.

Figure 2 exemplifies that the training dataset resembles functional trace with only modifications regarding performance metrics (see red dashed line arrows). In the detailed trace, the first branch instruction (b.1s 0x455cb4) is mispredicted, and two consecutive instructions are speculatively executed until the branch is resolved. The fetch latency of the next correct instruction (subs x1, 0xff455) is 10, whereas the total overhead of branch misprediction is 18 cycles. To model the total impact in training dataset, we remove the speculative instructions and assign the fetch latency of 18 cycles instead of 10 to (subs x1, 0xff455). The fetch latency for (subs x1, 0xff455) changes from 10 to 18 cycles. With the new fetch latency, the ML model can be trained to predict the total impact of misprediction without squashed speculative instructions. Similarly, for stall instructions, the nop instruction is removed from the detailed trace, and then the fetch clock is used to derive fetch latency for the following instruction. The fetch latency for (1d x3, [ureg0]) is updated to 4 from 3. With our workflow, the total cycles remain the same for the detailed trace and adjusted trace, i.e., 25 cycles. In evaluation, we study the differences between detailed and function traces, mainly focusing on speculative and nop instructions for various benchmarks and microarchitectures.

# 4.2 Multi-Metric DL Model Design

In this section, we delve into feature engineering from the microarchitecture agnostic execution trace and the DL model architecture that predicts various crucial low-level performance metrics using the input features as demonstrated with instruction latencies, branch mispredictions and data access level prediction.

**Feature engineering.** We propose new techniques to build cross-instruction features, in addition to the per-instruction features from the state-of-the-art [48]. Figure 3 shows the process of gathering the input features for the model. The input should be representative enough that the DL model can learn to map the interplay between the instruction features and the microarchitecture to predict various performance metrics. We extract four key instruction properties from the microarchitecture agnostic execution trace: the opcode, registers, data access address and PC address. Opcode and registers derive the per-instruction features. For opcode, we employ an integer mapping for each unique opcode in the dataset. Regarding registers, since the instructions can involve multiple registers, we create a bitmap vector with a size equal to the total number of registers. If an



Fig. 3. Feature engineering

instruction uses  $i_{th}$  register,  $i_{th}$  index in the vector will be set to 1 (0 otherwise). Both source and destination registers are included in the bitmap vector.

Cross-instruction features, crucial for predicting branch misprediction and data access level, are derived from the PC and memory addresses. We use the branch history as input to model the outcome of conditional branch instructions. This history, indicating the outcomes of prior branch instructions, is employed by existing branch predictors to predict whether the branch will be taken [30]. For a given input feature size, storing the outcome of each branch in a separate queue will limit the number of unique branches. To address this, we employ a hash table to store the outcomes of branch instructions. Hashing effectively controls the input feature size while maintaining relevant outcomes history for each branch (exemplified in Figure 4).

Figure 4 shows an example of retrieving the branch input feature with a sample program execution trace. We construct a hash table with  $N_b$ =3 buckets and  $N_q$ =2. The table is populated as we go through each instruction. When a conditional branch instruction is encountered, we retrieve (PC address%4 $N_b$ ) bucket and use that as branch input features. Subsequently, we update the outcome of that branch to the respective bucket before the next



Fig. 4. Input for branch instruction.

instruction. To retrieve the branch input features for the last instruction of the program execution trace, i.e., (00A0: b.ls #5), we first determine the hash bucket, i.e.,  $(00A0\%4N_b)=B_0$ . The retrieved branch input feature will be [0, -]. The input contains the earlier outcome of the same instruction. The hash table effectively separates the outcome of other PC addresses like (00A8: b.le eax, edx, LOOP3) and (00B4: b.le rax, eax LOOP2), which may be unrelated. It is also important to note that this design also permits different branches that are hashed to the same bucket to together offer a global history for future predictions.

To model the data access level, we calculate the access distance, which is the difference between current memory access and the previous  $N_m$  memory accesses, and use that as the input to the model. Data access level is used to derive the cache misses. Intuitively, if the access distance between the current and earlier memory access is smaller, current access is more likely to be in the cache. Access distance is similar to reuse [22, 24, 35, 79] or stack distance [3, 17, 51] histograms in earlier analytical models but cheaper to calculate. We use a memory context queue to track the access

distance of  $N_m$  memory accesses. Figure 3 illustrates how access distance is calculated for memory instructions. In the case of (subs x1, 0xff455), being the first memory access, the access distance is zero. The address is added to the memory context queue. For the second memory access, the difference in memory address with the first instruction is 463412-463408=4. With  $N_m$ =4, the access distance will be [4,0,0,0]. The optimal value of  $N_b,N_q$  and  $N_m$  are empirically derived based on the simulation error across test benchmarks (Section 5.6).

**DL model architecture.** Figure 5 exemplifies our DL model design. The model first generates instruction embeddings from input features with two-level embedding layers and then uses multiheaded self-attention to perform multi-metric prediction. We use a sequence of N+1 instructions as input to the model. Here, N signifies that earlier instructions can influence the performance of the current instruction, which are the context instructions. Unlike the prior approaches [48, 59] that manages a context instruction queue, adding or removing context instruction based on fetch cycles, our approach relies on the self-attention layer to autonomously learn which earlier instructions significantly impact the current instruction.



Fig. 5. Our initial DL model architecture.

The embedding layers generate instruction embeddings in two steps. Initially, embeddings are created independently for each category of input. This separate generation facilitates enhanced representation learning for each category. Specifically, for opcode, a trainable lookup table based embedding layer is employed. For the remaining categories, distinct linear embedding layers are utilized. The individual instruction embedding is obtained by combining categorical embeddings through a linear layer. Note embedding layers independently generate instruction embeddings for current and N context instructions. Similar to SimNet, we assign the value of N as the maximum value of reorder buffer (ROB) in a design space, in this case, 128.

Following the generation of instruction embeddings, the prediction layers employ multi-head self-attention to determine the performance metrics. Considering the impact of microarchitecture, this approach allows attention layers to model the interaction between current and earlier instructions. Using self-attention obviates the need for manually tracking context instructions, enhancing efficiency. Employing multiple heads enables each head to learn unique hardware-instruction interplay. The output from each head is concatenated and passed through a linear layer.

We use different operators to predict different performance metrics based on the output of the last linear layer: (i) The fetch and execution cycles are directly predicted from the linear layer. (ii) An additional sigmoid layer is incorporated for branch prediction to predict whether the branch will be mispredicted. (iii) We use a softmax layer for the data access level, as the output can be multiple categories. (iv) More performance metrics like instruction cache miss and TLB miss can be predicted through a sigmoid layer. During training, a loss is computed from each performance metric and combined with a linear ratio in backpropagation. To obtain the total cycle of all instructions, we use

the retire clock of instructions. Retire clock is computed as current clock + fetch latency + execution latency. The retire clock of the last instruction of a benchmark determines the total cycles.

**Intuitive explanation on supporting a set of performance metrics.** Multi-metric prediction exploits the relatedness of performance metrics. With the attention model and microarchitecture agnostic input, our design allows us to output various performance metrics of interest. It can capture the relationship between each performance metric and the specific input features that impact the metric. This allows all metrics to be derived from the same hidden layers. We demonstrate the validity of this idea by accurately predicting three performance metrics in Section 5.4. Multi-metric prediction has two benefits. First, it increases the output details of the simulation. Second, individual loss from data access level and branch prediction helps the model relate the cycle prediction with memory and branch behavior during training.

# 4.3 Fast Transfer Learning via Microarchitecture Agnostic Embeddings

Figure 6 illustrates our fast transfer learning process to enable TAo for a new unseen microarchitecture rapidly, i.e.,  $\mu$ Arch C, employing microarchitecture agnostic embedding layers and fine-tuning. Initially, shared embedding layers are trained with two carefully selected microarchitectures, i.e.,  $\mu$ Arch A and  $\mu$ Arch B. During training for  $\mu$ Arch C, the parameters of shared embedding layers are frozen, i.e., we do not update the parameters during backpropagation. The parameters of prediction layers and embedding adaptation layer are fine-tuned with the training dataset for  $\mu$ Arch C.

Microarchitecture agnostic embedding design. The shared embedding layers generate embedding for each individual instruction, and microarchitecture specific prediction layers predict the performance labels. The prediction layers of each microarchitecture computes the gradients for the embedding layers separately. We propose to combine them to update the shared embedding layers.

Such designs that combine gradients to update shared layers can face two critical issues: *negative transfer* and *imbalance in gradient magnitude* for shared layers: (i) Negative transfer [52, 77] occurs when the shared layers receive gradients from different microarchitecture that are opposite to each other. (ii) Imbalance in gradients magni-



Fig. 6. Overview of transfer learning process for microarchitecture  $\mu$ Arch C.

tude [20] arises when one microarchitecture is too dominant during training, inducing gradients with relatively large magnitudes. These issues impact convergence and generalization [20, 77].

Figure 7 compares our multi-architecture training paradigm with two existing projects, Granite [67] and GradNorm [20]. We use two microarchitectures A and B, to illustrate the techniques. Although GradNorm is proposed for multi-task learning, we compare its effectiveness for generating microarchitecture agnostic embedding layers. In the figure, each prediction network predicts microarchitecture specific output labels ( $Y_A$  and  $Y_B$ ) and losses ( $I_A$  and  $I_B$ ).

In Granite, Figure 7(a), to derive the gradients for shared embedding layers, the gradients from the prediction layers of each  $\mu Arch$  are averaged (i.e.,  $\frac{G_A+G_B}{2}$ ). Just averaging the gradients may resolve neither the negative transfer nor gradient imbalance problem [76]. Using gradient imbalance as an example, if the gradient of one task is larger in magnitude than the other, the larger one will dominate the average gradients.

GradNorm, Figure 7(b), addresses the imbalance in gradient magnitude for multi-task learning by using learnable combination weights ( $w_A$  and  $w_B$ ) to combine the losses from each task. This indirectly controls the magnitude of the gradients. The underlying rationale is to dynamically



Fig. 7. Comparison of multi-architecture training paradigm.

adjust the combination weights in response to the gradient magnitudes of shared layers, ensuring they neither become excessively large nor too small. The process begins with the computation of a combined loss (L) as a weighted sum of microarchitecture specific weights and loss, i.e.,  $L_A w_A + L_B w_B$ . Subsequently, a standard backward pass generates gradients  $G_A$  and  $G_B$  for the respective prediction layers using L.  $G_A$  and  $G_B$  are basically weighted gradient loss i.e.,  $w_A \nabla L_A$  and  $w_B \nabla L_B$ , respectively.  $G_A$  and  $G_B$  are averaged for computing gradients of embedding layers. Combination weights are updated based on L,  $G_A$  and a learning rate  $r_A$ , described in [20]. In this way, GradNorm indirectly balances the magnitude of gradients by updating the loss weight, i.e.,  $w_A$  and  $w_B$  for various tasks.

While GradNorm can effectively address gradient magnitude imbalance, it cannot adequately address negative transfer issues that arise from conflicting gradient directions. Of note, conflicting gradients may appear when the performances of two different microarchitectures are opposite for the same instruction. Modifying the magnitude of gradients may not effectively change gradient direction in joint training [76]. Hence, it may not fully mitigate the adversarial effect of gradients.

Figure 7(c) illustrates our design that tackles negative transfer and gradient imbalance. In contrast to GradNorm which relies on reactive approaches of projecting conflicting gradients to a different plane [76] or finding common direction [62] to mitigate negative transfer, we adopt a proactive solution. We add an individual embedding adaptation layer, i.e.,  $W_A$  for  $\mu$ Arch A, similarly  $W_B$  for  $\mu$ Arch B, between the embedding and performance network, see Figure 7(c). The linear layer  $W_A$  projects the shared embedding (i.e., Green layers) into microarchitecture specific spaces (i.e.,  $\mu$ Arch prediction layers) during forward propagation.

Adding this linear projection layer resolves the negative transfer issue as follows: during back-propagation, to compute the gradients for the linear projection layer, we multiply the gradients from the earlier layer  $G_A$  with the transpose of the weight matrix  $W_A$ , i.e.,  $G_AW_A^T$  based on the chain rule. Under most of the cases, this operation rotates the gradients in the gradient space, changing the direction of gradients. For a very rare case when all the columns of  $G_A$  are the eigenvectors of  $W_A^T$ , this linear projection does not change the direction. This requires all the columns of both  $G_A$  and  $G_B$  to, respectively, be the eigenvectors of  $W_A^T$  and  $W_B^T$  to render our method in vein. We regard this scenario to be extremely rare. In this paper, for 200 epochs across three different microarchitectures also suggests that we do not experience such a rare case.

To tackle the gradient imbalance concern, we normalize the gradients for the embedding layers based on the magnitude of the gradients  $\overline{G_AW_A}$  and  $\overline{G_BW_B}$  to reduce any existing gradient magnitude imbalance. We adopt a typical normalization method: we first compute the mean  $X_{mean}$  of a gradient matrix X. Then, we get the difference of the gradient matrix with its mean (X- $X_{mean}$ ). The difference is divided by the range of the values in the gradient matrix, i.e.,  $\frac{X-X_{mean}}{X_{max}-X_{min}}$ . We perform this normalization individually for each gradient matrix. This normalization ensures that both gradients

# Algorithm 1 Training workflow for shared embedding layers with TAO

Initialize model weights, along with  $W_A$  and  $W_B$ 

- 1: for each epoch do
- 2: Compute  $L_A$  and  $L_B$

- ▶ Standard forward pass
- 3: Compute gradients  $G_A$  and  $G_B$  for the prediction layers
- 4: Compute gradients  $G_A W_A^T$  and  $G_B W_B^T$  for the linear projection layer
- 5: Normalize the gradients:  $\overline{G_A W_A^T} \leftarrow \text{normalize}(G_A W_A^T)$ , and  $\overline{G_B W_B^T} \leftarrow \text{normalize}(G_B W_B^T)$
- 6: Compute average of normalized gradients  $\frac{\overline{G_A W_A^T} + \overline{G_B W_B^T}}{2}$
- 7: Use the average of normalized gradients to update embedding layers and update model parameters
- 8: end for

have the similar scale. The average of normalized gradients, i.e.,  $\frac{\overline{G_A W_A^T} + \overline{G_B W_B^T}}{2}$  is used to update the shared embedding layers.

Algorithm 1 explains the workflow. First, microarchitecture specific loss  $L_A$  and  $L_B$  are computed. The gradient for each performance prediction layer  $G_A$  and  $G_B$  is calculated based on  $L_A$  and  $L_B$ , respectively. Then, we calculate the gradients for the linear projection layer as  $G_A W_A^T$  and  $G_B W_B^T$ . For gradient normalization, we normalize both gradients individually,  $\overline{G_A W_A^T}$  and  $\overline{G_B W_B^T}$ . The final gradients for the embedding layers will be the average of normalized gradients,  $\overline{G_A W_A^T + G_B W_B^T}$ . Finally, we update the gradients for embedding layers and continue the backward pass.

**Training dataset**. Tao only uses two microarchitectures based on performance variations to train the model efficiently with the desired accuracy. This is significantly more efficient than training general embedding layers with random microarchitectures. To achieve the accuracy and efficiency goal, we define metrics to measure the architectural variations and select the two architectural variations with the most difference. Below are our designs:

To measure the microarchitecture variations, we select four performance metrics, i.e., CPI, L1 cache miss, L2 cache miss, and branch misprediction rate. Of note, since our embedding is performance embedding, we tie the microarchitecture variation to performance metrics. We choose these four performance metrics because they can capture the processor, cache, memory, and branch behaviors of a program. Combinedly, these metrics explain the performance impact of key microarchitecture components on overall performance. The choice is also evident by the earlier project [27, 42, 64], which solely uses these metrics to perform microarchitectural bottleneck analysis and hardware design space exploration.

We measure the performance metrics difference of different microarchitectures with Mahalanobis distance [53] instead of Euclidean or Cosine distance for two reasons: (i) Euclidean distance is sensitive to a larger value of one metric, and Cosine distance ignores the value difference. (ii) The other two distances do not consider the correlation among the performance metrics or their scales during distance computation. Mahalanobis distance measures the distance between two points in a multi-dimensional space. For two vectors X and Y, Mahalanobis distance is defined as  $D_{MD}(X,Y) = \sqrt{(X-Y)^T \cdot S^{-1} \cdot (X-Y)}$ , where  $S^{-1}$  represents the inverse of the covariance matrix of performance metrics from all designs. The covariance matrix represents how the performance metrics vary together across the designs. Using the inverse of the covariance matrix, Mahalanobis distance normalizes the data and accounts for the correlation between dimensions. This normalization makes it less sensitive to a larger metric value.

| Sample design space |            | Random        |        | Performance metrics |           |           |         | Mahalanobis Distance |   |      |      |   |
|---------------------|------------|---------------|--------|---------------------|-----------|-----------|---------|----------------------|---|------|------|---|
| Parameters          | Range      | design        | Design | CDI                 | L1 cache  | L2 cache  | branch  |                      |   | A    | В    | C |
| Branch              | Local,     | selection and |        | CPI                 | miss rate | miss rate | mispred |                      | A |      |      |   |
| predictor           | BiMode     | simulation    | A      | 1.23                | 34%       | 21%       | 14%     | $\longrightarrow$    | В | 0.36 |      |   |
| L1 cache            | 16KB, 32KB | <b>─</b>      | В      | 1.15                | 25%       | 14%       | 12%     |                      | С | 0.48 | 0.33 |   |
| L2 cache            | 256KB, 1MB |               | С      | 1.11                | 23%       | 12%       | 21%     |                      |   |      |      |   |

Fig. 8. Selecting training dataset.

Figure 8 shows an example of the overall process. First, we randomly select N designs from the design space. Here, we select three designs A, B and C. We perform detailed simulations of those designs using gem5 to gather the performance metrics. The performance metric is averaged across the benchmarks. Then, we calculate the Mahalanobis distance for all designs, resulting in a 3x3 matrix. Based on Mahalanobis distance, we select two designs with the largest distance. Here, as the distance between A and C is the largest among all pairs, i.e., 0.48, we select A and C for microarchitecture agnostic embedding construction. Of note, since selecting a training dataset is one time cost, the overhead can be considered as preprocessing time.

## 5 EVALUATION

|          | Datasets                                               | Abbr.              |
|----------|--------------------------------------------------------|--------------------|
| Training | 531.deepsjeng_r, 654.roms_s, 544.nab_r, 641.leela_s    | dee, rom, nab, lee |
| Testing  | 605.mcf_s, 523.xalancbmk_r, 621.wrf_s, 507.cactuBSSN_r | mcf, xal, wrf, cac |

Table 2. SPEC CPU2017 benchmarks used for training and testing.

**Benchmarks.** We use the widely adopted SPEC CPU2017 [15] benchmark suite to evaluate TAO. The benchmark suite contains various benchmarks designated for "speed" or "rate" for INT and FLOAT workloads, resulting in a diverse and complex range of applications like 3D rendering, image manipulation, compression, etc. We select a subset of benchmarks from each category to train and evaluate the model. Instead of randomly selecting train/test benchmarks, we select unique representative benchmarks based on the performance variations as suggested by [58] (see Table V). This allows the model to be trained by diverse instructions from various benchmarks and helps generalize TAO over new benchmarks. Table 2 shows our training and testing datasets. For a fair comparison, this train/test dataset is used for all related evaluations.

**Training dataset.** To construct the training dataset, we first generate detailed and functional traces with 100 million instructions from each training benchmark with default test workloads using the gem5 O3CPU and AtomicSimpleCPU model, respectively. Of note, we skip the first 100 million instructions as adopted by earlier projects to avoid the common program initialization phase [23, 64]. The benchmarks are compiled and simulated in ARM Instruction Set Architecture (ISA). For preprocessing, we remove the duplicate samples from the dataset and generate input features with workflow as discussed in Section 4.1. After preprocessing, the resulting training dataset contains around 180 million instructions across four training benchmarks (See Table 2).

**Design space.** Table 3 shows the overall design space and microarchitecture designs used for evaluations in the paper. We vary various microarchitecture parameters related to the pipeline, cache and branch predictors, similar to those other researchers have looked for evaluations [23, 26, 33, 70]. We select nine design parameters with varying ranges for a single-core superscalar CPU. For example, the ROB has a minimum size of 32 entries and a maximum size of 128 entries. For evaluating simulation accuracy and throughput, we select three microarchitecture designs ( $\mu$ Arch

| Components   | Design parameters | Range                                | μ <b>Arch</b> A | μ <b>Arch B</b> | μ <b>Arch</b> C |
|--------------|-------------------|--------------------------------------|-----------------|-----------------|-----------------|
| Pipeline     | Fetch width       | 2,3,4                                | 2               | 3               | 4               |
| 1 ipenne     | ROB size          | 32, 64, 96, 128                      | 32              | 96              | 128             |
| Branch pred. | Algorithm         | Local, BiMode, TAGE_SC_L, Tournament | Local           | BiMode          | Tournament      |
| L1 Dcache    | Associativity     | 2, 4, 6, 8                           | 2               | 4               | 8               |
| Li Deacile   | Size              | 16KB, 32KB, 64KB, 128KB              | 16KB            | 32KB            | 64KB            |
| L1 Icache    | Associativity     | 2, 4, 6, 8                           | 2               | 4               | 8               |
| Li icaciie   | Size              | 8KB, 16KB, 32KB                      | 8KB             | 16KB            | 32KB            |
| L2 Dcache    | Associativity     | 2, 4, 6, 8                           | 2               | 4               | 8               |
| L2 Deache    | Size              | 256KB, 512KB, 1MB, 2MB, 4MB          | 256KB           | 1MB             | 4MB             |

Table 3. Microarchitectural design space parameters choices.

A,  $\mu$ Arch B and  $\mu$ Arch C) with large variations from Table 3 to demonstrate the robustness of our approach. The microarchitecture parameters for each design are also shown in the table. Each microarchitecture is evaluated on the test benchmarks in Table 2. A separate DL model is trained for each microarchitecture design with transfer learning (see Figure 6).

Simulation study criteria. We study the simulation error for CPI, branch prediction, and memory access levels, and throughput in this section. Particularly, simulation error represents the absolute CPI prediction error for each benchmark and is defined as  $\frac{|CPI_{pred}-CPI_{truth}|}{CPI_{truth}} \times 100\%$ . CPI is calculated by the sum predicted cycle of all instructions divided by the total count of instructions.  $CPI_{pred}$  and  $CPI_{truth}$  represents the CPI derived from Tao gem5, respectively. For evaluating cache misses and branch misprediction accuracy, we use misses per kilo instructions (MPKI). We use {<microarchitecture name>- <benchmark name>} notation in the plots to represent the outcome of benchmark <benchmark name> on microarchitecture <microarchitecture name>. Simulation throughput is measured in million instructions per second (MIPS). For evaluating Tao, we generate a functional trace with 100 million instructions for each test benchmark using gem5 AtomicSimpleCPU model, similar to earlier studies [23, 48].

**System.** For training and simulation, we evaluate our work on a server with four A100 GPUs (80 GB) and an Intel(R) Xeon(R) Silver 4309Y 32-core CPU. We use GPUs for the DL model inference as it provides significantly higher throughput than CPUs. Other hardware accelerators like FPGA can also be used for model inference [45]. For comparison with the state-of-the-art work, i.e., SimNet, we use the CNN (C3 hybrid) model as described in the paper [48] and GitHub [47]. We use the same datasets to train SimNet and Tao. The models are trained with Pytorch 2.1.0.

# 5.1 Comparison with the State-of-the-Art



Fig. 9. Simulation accuracy comparison with the state-of-the-art.

Proc. ACM Meas. Anal. Comput. Syst., Vol. 8, No. 2, Article 28. Publication date: June 2024.

This section compares the simulation accuracy and overall simulation time of TAo with the state-of-the-art DL-based simulator SimNet.

Figure 9 compares the simulation error for the selected three microarchitectures and four test benchmarks. The x-axis represents the simulation error derived from gem5, and the y-axis represents benchmarks from different microarchitectures. In most microarchitectures and benchmarks, Tao closely matches the simulation error of SimNet. On average, SimNet and Tao exhibit simulation errors of 5.11% and 5.23%, respectively. The slightly higher simulation error of TAo can be attributed to prediction error for branch misprediction and cache misses. Interestingly, Tao performs relatively better in mcf and cac benchmark. The improvement in mcf can be attributed to relatively higher arithmetic instructions of mcf benchmark, in which TAO can provide better prediction with an embedding representation of instructions while SimNet uses numerical representation. cac has a relatively higher number of memory stores and fewer branch instructions. SimNet incurs higher errors for memory store instructions due to limited input features relating to memory store instructions. For the evaluated microarchitectures and benchmarks, TAO demonstrates a minimum and maximum simulation error of 3.7% and 7.4%, respectively. Benchmark cac has a relatively higher simulation error than other benchmarks for both SimNet and TAO, likely due to its distinctive behavior in memory access among all the benchmarks. Notably, Tao maintains similar accuracy as SimNet without performance metrics across benchmarks and microarchitecture designs.

|            |                  | Тао        | SimNet      | Speedup<br>(vs. SimNet) |       |              |       | gem5 | Speedup<br>(vs. gem5) |
|------------|------------------|------------|-------------|-------------------------|-------|--------------|-------|------|-----------------------|
| Training   |                  | 1.9 hours  | 54.2 hours  | 28.52×                  |       | 28.52× -     |       | -    |                       |
| Simulation | Trace generation | 0.53 hours | 13.22 hours | 24.94×                  | 7.81× | 14.01 hours  | 7.26× |      |                       |
| Simulation | Inference        | 1.41 hours | 1.93 hours  | 1.37×                   | 7.01X | 14.01 110018 | 7.20× |      |                       |
| Overall    |                  | 3.84 hours | 69.35 hours | 18.0                    | 6×    | 14.01 hours  | 3.66× |      |                       |

Table 4. Simulation time comparison with the state-of-the-art DL-based simulator SimNet and traditional simulator gem5 for 10 billion instructions.

Table 4 compares the overall time for training and simulation of SimNet vs Tao. Both SimNet and Tao are trained until the error during training is under 6%. It takes 54.2 hours to train a CNN SimNet model. Meanwhile, with microarchitecture agnostic embeddings and transfer learning (Section 5.5), Tao can train a model with similar accuracy in merely 1.9 hours. It improves the training time by 28.52×. Overhead associated with transfer learning for Tao is discussed in Section 5.5.

For simulation, SimNet requires 13.22 hours to generate an input trace with 10 billion instructions. In contrast, utilizing the microarchitecture independent trace which do not simulate any microarchitecture component, the trace generation time is significantly reduced to 0.53 hours for Tao. Tao, along with SimNet, performs parallel simulation to provide highly scalable simulation throughput. We follow the parallel simulation technique described in [59]. The program traces are partitioned into subtraces and simulated in parallel. For SimNet, it takes 1.93 hours to simulate 10 billion instructions with a simulation throughput of 1.46 MIPS. On the other hand, Tao completes the simulation in 1.41 hours with a throughput of 1.98 MIPS. This speedup results from two aspects: (i) Tao only needs to generate a functional trace for simulation, which is 24.94× faster. Of note, the functional trace is architecture agnostic, which implies we can potentially avoid trace generation from one microarchitecture to another. The input trace of SimNet requires simulation of cache, and branch along with additional simulation of pipeline. Furthermore, for each microarchitecture change, the trace needs to be regenerated. (ii) During inference, the relatively slow throughput of SimNet is attributed to history context simulation involving frequent CPU-GPU data movements.

The simulation throughput for Tao can be further improved with various self-attention optimization techniques [19, 29] and further simulation optimizations discussed in [59]. Leveraging functional trace and efficient DL-based simulation workflow, the simulation process is accelerated by 7.81×. Overall, Tao demonstrates a remarkable speed advantage for simulating a new microarchitecture, being 18.06× faster than SimNet. This speedup is linearly scaled with the number of microarchitecture designs and benchmarks used to simulate. gem5 provides a simulation throughput of 0.198 MIPS and takes 14.01 hours to simulate 10 billion instructions. Tao provides 7.26× speedup for simulation against gem5. Even including the training time, Tao provides a speedup of 3.66×. TAO provides further speedup when simulating more instructions or using more GPUs for simulation.

## 5.2 Detailed Trace and Functional Trace for Training Dataset

Figure 10(a) analyzes the ratio of instruction differences in the detailed trace compared to the functional trace used for training dataset construction. The y-axis represents the instruction ratio of each instruction type, and the x-axis represents the microarchitectures and benchmarks. For 100 million simulated instructions from each benchmark, the detailed trace has, on average 96.98% of squashed pipeline instructions and 3.02% of nop instructions. The remaining instructions remain the same for different microarchitectures. The variation in the count of speculative instructions across benchmarks comes from different branch predictors and their respective accuracy.





(a) Instruction differences in percentage for speculative and nop instructions.

(b) Trace generation throughput comparison for detailed vs functional traces.

Fig. 10. Choice of the context size and branch configuration.

Figure 10(b) compares the trace generation throughput for the detailed trace and the functional trace used by Tao for simulation. On average, the trace generation throughput for detailed and functional traces is 0.21 and 5.29 MIPS, respectively. The functional traces utilized by Tao exhibit a remarkable speed advantage, being generated 25.19 × faster than their detailed counterparts. The slower throughput of detailed traces can be attributed to the intricate modeling of various hardware components, such as memory, cache, and branch predictor. Notably, the functional trace throughput remains consistent across different microarchitectures for the same benchmarks.  $\mu$ Arch A, characterized by a higher occurrence of branch mispredictions, exhibits an elevated number of speculative instructions. Consequently, the average trace throughput for  $\mu$ Arch A (0.19 MIPS) is marginally lower compared to  $\mu$ Arch B (0.21 MIPS) and  $\mu$ Arch C (0.23 MIPS).

## 5.3 Phase Level Behavior

Figures 11 shows the phase level behavior for benchmarks mcf, xal, wrf and cac, respectively. We compare the CPI, L1 misses and branch misprediction for each benchmark against the ground truth generated from the gem5 simulation for microarchitecture  $\mu$ Arch A. While SimNet only captures the phase level behavior of CPI, Tao can also capture the behavior of instruction cache misses



Fig. 11. Phase behavior for test benchmarks.

and branch mispredictions. Hence, we only compare against CPI for SimNet. The y-axis for L1 Dcache misses and branch mispredictions in the figure represents MPKI. We compute average CPI, L1 misses, and branch misprediction per ten million instructions. The x-axis shows the number of instructions in millions.

Our evaluation reveals that Tao adeptly captures the dynamic behavior of the program for each performance metric during execution. For CPI, Tao accurately captures performance variation across different phases of program execution. Notably, Tao shows slightly better phase level prediction for cac benchmark (Figure 11(j)), attributed to its enhanced accuracy in predicting latency for store instructions within the benchmark. For L1 Dcache misses, Tao precisely captures the behavior for most of the benchmarks. On average, Tao shows slightly higher prediction error for branch MPKI than CPI and L1 Dcache MPKI. Nevertheless, it still effectively captures the trend in branch MPKI over the course of program execution. Benchmark cac does not show much variation in branch MPKI due to a lower count of branch instructions (Figure 11(l)).



Fig. 12. Choice of the context size and branch configuration.

# 5.4 Multi-Metric Prediction Study

This section compares the prediction accuracy of L1 Dcache MPKI and branch MPKI with varying input features for their respective category. We train prediction models for each parameter on training benchmarks and compare the simulation accuracy across test benchmarks to determine the best value for the parameters.

**Data access level input.** Figure 12(a) varies the queue size of memory accesses  $(N_m)$  from 32 to 256 and evaluates average accuracy of test benchmarks for each microarchitecture design. The results indicate a general improvement in accuracy with increasing queue size for all microarchitecture configurations. However, beyond a queue size of 64, the accuracy improvements are marginal. Consequently, we select  $N_m = 64$  for generating input features related to data access level.

**Branch misprediction input.** Figure 12(b) investigates the influence of varying the combination of hash buckets  $N_b$  (1k, 2k) and queue size  $N_q$  (16,32) on the input features representing the history of branch PC address. The combination (1k,32) demonstrates favorable accuracy for predicting the branch MPKI, with no significant improvement observed by further increasing these parameters. Therefore, we opt for  $N_b = 1k$  and  $N_q = 32$  to generate the input features specifically tailored for branch instructions.

# 5.5 Evaluation on Transfer Learning via Microarchitecture Agnostic Embeddings

This section evaluates the effectiveness of proposed techniques for microarchitecture agnostic embeddings construction and transfer learning.



Fig. 13. Number of epochs vs test error (log scale).

Figure 13 compares the test error during training of microarchitecture agnostic embedding layers for Granite, GradNorm and Tao. For Tao, we compare the performance with and without embedding adaptation layers (Tao w/o embed). The y-axis represents the average prediction error from  $\mu$ Arch A and  $\mu$ Arch B for the test dataset during the training. The x-axis represents the number of epochs for training. While training for 200 epochs, Granite and GradNorm converge with a test error of

7.5% and 7%, respectively. Granite exhibits the highest prediction error, attributed to its challenges in handling gradient imbalance and negative transfer. GradNorm, adept at balancing gradients from each microarchitecture prediction layer, achieves a lower error than Granite. However, it falls short of further error reduction due to negative transfer. Without using embedding adaptation layers, Tao demonstrates a slight improvement over Granite, achieving a test error of 7.18% with gradient normalization. But it falls short of surpassing GradNorm. Notably, leveraging an embedding adaptation layer with gradient normalization, Tao reduces prediction error further to 5.5%.



Fig. 14. Training dataset selection.

Training dataset. Figure 14 evaluates the effectiveness of Mahalanobis distance against random selection and Euclidean distance for benchmark selection to construct microarchitecture agnostic embeddings. We prefer simulation error over training error to evaluate how well the embeddings perform with transfer learning. The y-axis represents the average simulation error for the test microarchitectures and benchmarks. We select one to six different microarchitectures randomly for random selection to construct reusable embeddings. This excludes the test  $\mu$ Arch A, B and C. For random selection, the simulation error starts converging after five microarchitectures. The simulation error does not further decrease due to adversarial gradients from different microarchitectures. For Mahalanobis and Manhattan distance, we measure the performance variations of 20 designs randomly sampled from Table 3 and select two microarchitectures for training. Euclidean distance based selection has slightly less simulation error of 7.5% compared with 8.5% for random selection. With Mahalanobis distance based selection, we achieve slightly better accuracy, 6.34%, than random selection of six microarchitectures. The generated embeddings are more robust because the microarchitecture selected from Mahalanobis distance has more variations.

**Transfer learning.** Table 5 compares the effectiveness of transfer learning to train a DL model for an unseen microarchitecture. For both TAO and SimNet, we train a DL model until the error during training is close to 6%.

Here, the first approach scratch represents a model for unseen microarchitecture trained from scratch without any transfer learning. Training a model from scratch takes 56 and 54 hours, respectively, for TAO and SimNet. In the second approach, direct

| Tachniques                      | TAO        | SimNet     |  |
|---------------------------------|------------|------------|--|
| Techniques                      | (in hours) | (in hours) |  |
| Scratch                         | 56         | 54         |  |
| Direct fine-tuning              | 38         | 41         |  |
| Shared embeddings + fine-tuning | 1.9        | -          |  |

Table 5. Training time.

fine-tuning, all parameters of the model are initialized from an earlier trained model. It takes 38 and 41 hours, respectively, for TAO and SimNet. The model proposed by TAO shows better transfer learning speed due to separated program embeddings and prediction layers. Although fine-tuning reduces the training time, it is not significant.

The third approach, shared embeddings + fine-tuning, is proposed by Tao and not directly applicable to SimNet. We use the embeddings constructed from microarchitecture agnostic embedding construction to train a model for unseen microarchitecture. The prediction layers are initialized from earlier trained models and fine-tuned. We only use 20 million instructions for fine-tuning the prediction layers. shared embeddings + fine-tuning further reduces the training time to only 1.9 hours. The resulting speedup comes from a reduced number of epochs for training, less inference time with shared embeddings and less per epoch time due to reduced datasets.

Table 6 shows the preprocessing overhead of one-time microarchitecture agnostic embedding construction. Constructing the embeddings involves training dataset selection and training shared embeddings (refer to

| Training dataset sele   | Training   |          |
|-------------------------|------------|----------|
| Random design selection | embeddings |          |
| and simulation          | selection  |          |
| 0.35 hours              | 0.1 min    | 71 hours |

Table 6. Overhead of architecture agnostic embedding construction.

Section 4.3). To collect the training dataset, we first randomly sample 16 microarchitectures from the design space outlined in Table 3, which has 184,320 total possible designs. For each sampled microarchitecture, we simulate 10 million instructions for all training benchmarks with gem5. It takes 0.35 hours to simulate and gather the performance metrics for each microarchitecture. Then, we select two microarchitectures for training the embeddings based on the Mahalanobis distance among the performance metrics of 16 microarchitectures. We use a Python script to compute the distance which takes only 0.1 min and select two microarchitectures with the maximum distance. The microarchitecture agnostic embeddings are trained from the training dataset of the two selection microarchitectures, which takes around 71 hours.

## 5.6 Hardware Design Space Exploration

This section aims to determine if TAO can be used for microarchitectural design space exploration. For evaluation, we vary the L1 Dcache size (16KB, 32KB, 64KB, 128KB) and the branch predictors (Local, Tournament, BiMode, and TAGE\_SC\_L).



Fig. 15. Hardware design space exploration of L1 Dcache misses and branch mispredictions with TAO.

Figures 15(a) compares the average cache MPKI across four test benchmarks obtained while varying L1 DCache size for gem5 simulation and Tao. The simulated cache MPKI decreases as the cache size increases from 16KB to 128KB. Cache MPKI predicted by Tao aligns with the simulated result from gem5, demonstrating that a cache size of 128KB results in the least MPKI.

In Figures 15(b), we compare the average branch MPKI across four test benchmarks using different branch predictors for gem5 and Tao. The simulated result from gem5 indicates the highest branch MPKI for the Local and the lowest for the Tage\_SC\_L. Branch MPKI predicted by

Tao also aligns with the simulated result from gem5. The prediction error is lower for simpler branch predictors like Local, experiencing only a marginal increase for relatively complex branch predictors like Tage\_SC\_L. Nonetheless, Tao maintains the relative accuracy across the spectrum of branch predictors. Overall, Tao prediction aligns with the simulated results from gem5 for hardware design exploration of L1 Dcache and branch predictors.

#### **6 GENERALITY OF TAO**

**Unseen benchmarks.** Tao can be generalized across a wide variety of unseen b.enchmarks. The generality of Tao comes from the fact that the deep learning model is trained at the instruction level. We use multiple diverse training benchmarks to train over a variety of instructions. That allows TAO to predict performance metrics for each instruction across different benchmarks accurately. Our evaluation confirms that Tao maintains a good accuracy over unseen benchmarks.

Unseen architectures. TAO is designed to simulate single-core out-of-order superscalar processors. To simulate an unseen microarchitecture, we gather a training dataset through gem5 simulation and train the DL model with transfer learning (see Figure 1(d)). TAO can accommodate changes in ISAs similarly to microarchitecture changes with some additional feature engineering for ISA-specific opcodes and registers. Tao cannot be directly used to simulate multi-core CPU and GPU architectures. However, the techniques proposed in this paper, i.e., microarchitecture agnostic trace, embeddings, and multi-metric prediction, establish a framework for a rapid DL-based simulation and is transferable to other architectures.

#### 7 RELATED WORK

Along with traditional simulation approaches, there have been parallel efforts to use ML/DL approaches for building performance models. In addition to the related work in Section 1.1, this section discusses ML and DL-based performance models.

ML-based performance models. ML-based performance models opt to build a performance model that can extrapolate the performance to unseen microarchitecture designs by simulating a few designs. Of note, these performance models aid in design exploration, different from ML-based performance models that model the runtime of applications in CPU [46, 78], data centers [66] or GPU [9, 13, 49, 71, 75]. These work use linear regression [34], artificial neural network (ANN) [32], spline functions [42, 43], and radial basis function networks [33] to build the performance model. These models take the program features and/or architecture configurations as input for predictions. [34] applies linear regression to obtain estimates of performance modeled as a weighted sum of predictor variables like cache size and branch predictor. [42, 43] and [33] add spline functions and radial basis functions respectively to better model the non-linearity between the design parameters and performance. [23] further studies the correlation of microarchitecture parameters to the best configuration using microarchitecture specific parameters. [32] uses ANN to model the performance to automatically learn a prediction model instead of building models with domain knowledge. [63] proposes a GNN-based learned cost model to estimate performance metrics of ML hardware accelerators. However, these ML-based models fail to accurately model the run-time complex dynamic interaction between the program and hardware.

**DL-based detailed performance models.** DL-based prediction models overcome the limitations of ML-based performance models by increasing the level of abstraction at the instruction level. Ithemal [54] and Granite [67] are two recent works performing basic block prediction. These models first gather training datasets by collecting the basic blocks with tools like Dynamorio [14]. The models predict the latency of each block separately. The input features are constructed based on each instruction and their structure in the basic blocks. Ithemal uses LSTM to construct the embeddings for each basic block hierarchically. Meanwhile, Granite leverages the structure and

dependency graph of instructions within the basic block and GNN models for throughput prediction. They are mostly used to assist compilers. Basic block throughput prediction models are limited to static basic block prediction, ignoring the impact of caches and branch prediction.

#### 8 CONCLUSION

This paper introduces an DL-based simulator Tao that supports detailed, accurate and fast microarchitecture explorations. Tao includes three contributions: First, we propose a workflow for training dataset construction that allows the reuse of execution traces for simulation. Second, to increase the detail of the simulation, we redesign the input features and the DL model using self-attention layers to support predicting a set of performance metrics of interest. Third, we propose techniques to train a microarchitecture agnostic embedding layer that enables fast transfer learning between different microarchitectural configurations and effectively reduces the re-training overhead of conventional DL-based simulators. Tao is the first DL simulator, to the best of our knowledge, that supports detailed architecture metrics, and 18.06× faster than the state-of-the-art simulator, i.e., SimNet.

#### 9 ACKNOWLEDGEMENT

We thank the anonymous reviewers and shepherd Derek Eager for their helpful suggestions and feedback. We also would like to extend our gratitude towards Cliff Young and James Laudon for their feedback on the early draft of this work. We also appreciate the support from the extended team at Google DeepMind. This work was in part supported by the NSF CRII Award No. 2331536, CAREER Award No. 2326141, and NSF Awards 2212370, 2319880, 2328948, 2319975 and 233130.

#### REFERENCES

- [1] Jung Ho Ahn, Sheng Li, O Seongil, and Norman P Jouppi. 2013. McSimA+: A manycore simulator with application-level+ simulation and detailed microarchitecture modeling. In 2013 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). IEEE, 74–85.
- [2] Ayaz Akram and Lina Sawalha. 2019. A survey of computer architecture simulation techniques and tools. *Ieee Access* 7 (2019), 78120–78145.
- [3] George Almási, Călin Cașcaval, and David A Padua. 2002. Calculating stack distances efficiently. In *Proceedings of the 2002 workshop on Memory system performance*. 37–43.
- [4] Abdullah Alomar, Pouya Hamadanian, Arash Nasr-Esfahany, Anish Agarwal, Mohammad Alizadeh, and Devavrat Shah. 2023. CausalSim: A Causal Framework for Unbiased Trace-Driven Simulation. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). 1115–1147.
- [5] Yehia Arafa, Abdel-Hameed Badawy, Ammar ElWazir, Atanu Barai, Ali Eker, Gopinath Chennupati, Nandakishore Santhi, and Stephan Eidenbenz. 2021. Hybrid, scalable, trace-driven performance modeling of GPGPUs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 1–15.
- [6] Newsha Ardalani, Clint Lestourgeon, Karthikeyan Sankaralingam, and Xiaojin Zhu. 2015. Cross-architecture performance prediction (XAPP) using CPU code to predict GPU performance. In *Proceedings of the 48th International Symposium on Microarchitecture*. 725–737.
- [7] Todd Austin, Eric Larson, and Dan Ernst. 2002. SimpleScalar: An infrastructure for computer system modeling. *Computer* 35, 2 (2002), 59–67.
- [8] Chen Bai, Jiayi Huang, Xuechao Wei, Yuzhe Ma, Sicheng Li, Hongzhong Zheng, Bei Yu, and Yuan Xie. 2023. ArchExplorer: Microarchitecture exploration via bottleneck analysis. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 268–282.
- [9] Ioana Baldini, Stephen J Fink, and Erik Altman. 2014. Predicting gpu performance from cpu runs using machine learning. In 2014 IEEE 26th International Symposium on Computer Architecture and High Performance Computing. IEEE, 254–261.
- [10] Fabrice Bellard. 2005. QEMU, a fast and portable dynamic translator.. In USENIX annual technical conference, FREENIX Track, Vol. 41. California, USA, 46.
- [11] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R Hower, Tushar Krishna, Somayeh Sardashti, et al. 2011. The gem5 simulator. ACM SIGARCH computer architecture news 39, 2 (2011), 1–7.

- [12] Hadi Brais, Rajshekar Kalayappan, and Preeti Ranjan Panda. 2020. A survey of cache simulators. *ACM Computing Surveys (CSUR)* 53, 1 (2020), 1–32.
- [13] Lorenz Braun, Sotirios Nikas, Chen Song, Vincent Heuveline, and Holger Fröning. 2020. A simple model for portable and fast prediction of execution time and power consumption of GPU kernels. ACM Transactions on Architecture and Code Optimization (TACO) 18, 1 (2020), 1–25.
- [14] Derek Bruening, Qin Zhao, and Saman Amarasinghe. 2012. Transparent dynamic instrumentation. In Proceedings of the 8th ACM SIGPLAN/SIGOPS conference on Virtual Execution Environments. 133–144.
- [15] James Bucek, Klaus-Dieter Lange, and Jóakim v. Kistowski. 2018. SPEC CPU2017: Next-generation compute benchmark. In Companion of the 2018 ACM/SPEC International Conference on Performance Engineering. 41–42.
- [16] Anastasiia Butko, Rafael Garibotti, Luciano Ost, Vianney Lapotre, Abdoulaye Gamatie, Gilles Sassatelli, and Chris Adeniyi-Jones. 2015. A trace-driven approach for fast and accurate simulation of manycore architectures. In *The 20th Asia and South Pacific Design Automation Conference*. IEEE, 707–712.
- [17] Calin CaBcaval and David A Padua. 2003. Estimating cache misses and locality using stack distances. In *Proceedings of the 17th annual international conference on Supercomputing*. 150–159.
- [18] Trevor E Carlson, Wim Heirman, and Lieven Eeckhout. 2011. Sniper: Exploring the level of abstraction for scalable and accurate parallel multi-core simulation. In *Proceedings of 2011 International Conference for High Performance Computing*, *Networking*, *Storage and Analysis*. 1–12.
- [19] Shiyang Chen, Shaoyi Huang, Santosh Pandey, Bingbing Li, Guang R Gao, Long Zheng, Caiwen Ding, and Hang Liu. 2021. Et: re-thinking self-attention for transformer models on gpus. In Proceedings of the international conference for high performance computing, networking, storage and analysis. 1–18.
- [20] Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and Andrew Rabinovich. 2018. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In *International conference on machine learning*. PMLR, 794–803.
- [21] Bob Cmelik and David Keppel. 1994. Shade: A fast instruction-set simulator for execution profiling. In Proceedings of the 1994 ACM SIGMETRICS conference on Measurement and modeling of computer systems. 128–137.
- [22] Chen Ding and Yutao Zhong. 2003. Predicting whole-program locality through reuse distance analysis. In Proceedings of the ACM SIGPLAN 2003 conference on Programming language design and implementation. 245–257.
- [23] Christophe Dubach, Timothy Jones, and Michael O'Boyle. 2007. Microarchitectural design space exploration using an architecture-centric approach. In 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007). IEEE, 262–271.
- [24] Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero, and Alexander V Veidenbaum. 2012. Improving cache management policies using dynamic reuse distances. In 2012 45Th annual IEEE/ACM international symposium on microarchitecture. IEEE, 389–400.
- [25] Muhammad ES Elrabaa, Ayman Hroub, Muhamed F Mudawar, Amran Al-Aghbari, Mohammed Al-Asli, and Ahmad Khayyat. 2017. A very fast trace-driven simulation platform for chip-multiprocessors architectural explorations. IEEE Transactions on Parallel and Distributed Systems 28, 11 (2017), 3033–3045.
- [26] Stijn Eyerman, Lieven Eeckhout, Tejas Karkhanis, and James E Smith. 2009. A mechanistic performance model for superscalar out-of-order processors. ACM Transactions on Computer Systems (TOCS) 27, 2 (2009), 1–37.
- [27] Brian A Fields, Rastislav Bodik, Mark D Hill, and Chris J Newburn. 2003. Using interaction costs for microarchitectural bottleneck analysis. In Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36. IEEE, 228–239.
- [28] Stephen R Goldschmidt and John L Hennessy. 1993. The accuracy of trace-driven simulations of multiprocessors. ACM SIGMETRICS Performance Evaluation Review 21, 1 (1993), 146–157.
- [29] Cong Guo, Bo Yang Hsueh, Jingwen Leng, Yuxian Qiu, Yue Guan, Zehuan Wang, Xiaoying Jia, Xipeng Li, Minyi Guo, and Yuhao Zhu. 2020. Accelerating sparse dnn models without hardware-support via tile-wise sparsity. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
- [30] John L Hennessy and David A Patterson. 2011. Computer architecture: a quantitative approach (fifth ed.). Elsevier.
- [31] Kenneth Hoste and Lieven Eeckhout. 2007. Microarchitecture-independent workload characterization. *IEEE micro* 27, 3 (2007), 63–72.
- [32] Engin Ïpek, Sally A McKee, Rich Caruana, Bronis R de Supinski, and Martin Schulz. 2006. Efficiently exploring architectural design spaces via predictive modeling. ACM SIGOPS Operating Systems Review 40, 5 (2006), 195–206.
- [33] PJ Joseph, Kapil Vaswani, and Matthew J Thazhuthaveetil. 2006. A predictive performance model for superscalar processors. In 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO'06). IEEE, 161–170.
- [34] PJ Joseph, Kapil Vaswani, and Matthew J Thazhuthaveetil. 2006. Construction and use of linear regression models for processor performance analysis. In *The Twelfth International Symposium on High-Performance Computer Architecture*, 2006. IEEE, 99–108.
- [35] Mahmut Kandemir, Hui Zhao, Xulong Tang, and Mustafa Karakoy. 2015. Memory row reuse distance and its role in optimizing application performance. In Proceedings of the 2015 ACM SIGMETRICS International Conference on

- Measurement and Modeling of Computer Systems. 137-149.
- [36] Svilen Kanev, Gu-Yeon Wei, and David Brooks. 2012. XIOSim: power-performance modeling of mobile x86 cores. In *Proceedings of the 2012 ACM/IEEE international symposium on Low power electronics and design.* 267–272.
- [37] Mahmoud Khairy, Zhesheng Shen, Tor M Aamodt, and Timothy G Rogers. 2020. Accel-Sim: An extensible simulation framework for validated GPU modeling. In 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 473–486.
- [38] Hyesoon Kim, Jaekyu Lee, Nagesh B Lakshminarayana, Jaewoong Sim, Jieun Lim, and Tri Pho. 2012. Macsim: A cpu-gpu heterogeneous simulation framework user guide. *Georgia Institute of Technology* (2012).
- [39] Ryan Gary Kim, Janardhan Rao Doppa, and Partha Pratim Pande. 2018. Machine learning for design space exploration and optimization of manycore systems. In 2018 IEEE/ACM International Conference on Computer-Aided Design (ICCAD). IEEE, 1–6.
- [40] Srivatsan Krishnan, Amir Yazdanbakhsh, Shvetank Prakash, Jason Jabbour, Ikechukwu Uchendu, Susobhan Ghosh, Behzad Boroujerdian, Daniel Richins, Devashree Tripathy, Aleksandra Faust, et al. 2023. Archgym: An open-source gymnasium for machine learning assisted architecture design. In Proceedings of the 50th Annual International Symposium on Computer Architecture. 1–16.
- [41] Eric Larson, Saugata Chatterjee, and Todd M Austin. 2001. MASE: a novel infrastructure for detailed microarchitectural modeling.. In ISPASS, Vol. 1. 9.
- [42] Benjamin C Lee and David M Brooks. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. ACM SIGOPS operating systems review 40, 5 (2006), 185–194.
- [43] Benjamin C Lee and David M Brooks. 2007. Illustrative design space studies with microarchitectural regression models. In 2007 IEEE 13th International Symposium on High Performance Computer Architecture. IEEE, 340–351.
- [44] Kiyeon Lee, Shayne Evans, and Sangyeun Cho. 2009. Accurately approximating superscalar processor performance from traces. In 2009 IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 238–248.
- [45] Bingbing Li, Santosh Pandey, Haowen Fang, Yanjun Lyv, Ji Li, Jieyang Chen, Mimi Xie, Lipeng Wan, Hang Liu, and Caiwen Ding. 2020. Ftrans: energy-efficient acceleration of transformers using fpga. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 175–180.
- [46] Jiangtian Li, Xiaosong Ma, Karan Singh, Martin Schulz, Bronis R de Supinski, and Sally A McKee. 2009. Machine learning based online performance prediction for runtime parallelization and task scheduling. In 2009 IEEE international symposium on performance analysis of systems and software. IEEE, 89–100.
- [47] Lingda Li. [n. d.]. Lingda-li/simnet. https://github.com/lingda-li/simnet
- [48] Lingda Li, Santosh Pandey, Thomas Flynn, Hang Liu, Noel Wheeler, and Adolfy Hoisie. 2022. SimNet: Accurate and High-Performance Computer Architecture Simulation Using Deep Learning. Proc. ACM Meas. Anal. Comput. Syst. 6, 2, Article 25 (jun 2022), 24 pages. https://doi.org/10.1145/3530891
- [49] Ying Li, Yifan Sun, and Adwait Jog. 2023. Path Forward Beyond Simulators: Fast and Accurate GPU Execution Time Prediction for DNN Workloads. In Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 380–394.
- [50] Jieun Lim, Nagesh B Lakshminarayana, Hyesoon Kim, William Song, Sudhakar Yalamanchili, and Wonyong Sung. 2014. Power modeling for GPU architectures using McPAT. ACM Transactions on Design Automation of Electronic Systems (TODAES) 19, 3 (2014), 1–24.
- [51] Sooyoung Lim and Dongchul Park. 2022. Efficient Stack Distance Approximation Based on Workload Characteristics. *IEEE Access* 10 (2022), 59792–59805.
- [52] Shengchao Liu, Yingyu Liang, and Anthony Gitter. 2019. Loss-balanced task weighting to reduce negative transfer in multi-task learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 9977–9978.
- [53] Goeffrey J McLachlan. 1999. Mahalanobis distance. Resonance 4, 6 (1999), 20-26.
- [54] Charith Mendis, Alex Renda, Saman Amarasinghe, and Michael Carbin. 2019. Ithemal: Accurate, portable and fast basic block throughput estimation using deep neural networks. In *International Conference on machine learning*. PMLR, 4505–4515.
- [55] Jason E Miller, Harshad Kasture, George Kurian, Charles Gruenwald, Nathan Beckmann, Christopher Celio, Jonathan Eastep, and Anant Agarwal. 2010. Graphite: A distributed parallel simulator for multicores. In HPCA-16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture. IEEE, 1–12.
- [56] Hashem H Najaf-Abadi and Eric Rotenberg. 2008. Configurational workload characterization. In ISPASS 2008-IEEE International Symposium on Performance Analysis of Systems and software. IEEE, 147–156.
- [57] Pablo Montesinos Ortego and Paul Sack. 2004. SESC: SuperESCalar simulator. In 17 th Euro micro conference on real time systems (ECRTS'05). Citeseer, 1–4.
- [58] Reena Panda, Shuang Song, Joseph Dean, and Lizy K John. 2018. Wait of a decade: Did spec cpu 2017 broaden the performance horizon?. In 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 271–282.

- [59] Santosh Pandey, Lingda Li, Thomas Flynn, Adolfy Hoisie, and Hang Liu. 2022. Scalable deep learning-based microarchitecture simulation on GPUs. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–15.
- [60] Alejandro Rico, Alejandro Duran, Felipe Cabarcas, Yoav Etsion, Alex Ramirez, and Mateo Valero. 2011. Trace-driven simulation of multithreaded applications. In (IEEE ISPASS) IEEE International Symposium on Performance Analysis of Systems and Software. IEEE, 87–96.
- [61] Daniel Sanchez and Christos Kozyrakis. 2013. ZSim: Fast and accurate microarchitectural simulation of thousand-core systems. ACM SIGARCH Computer architecture news 41, 3 (2013), 475–486.
- [62] Ozan Sener and Vladlen Koltun. 2018. Multi-task learning as multi-objective optimization. Advances in neural information processing systems 31 (2018).
- [63] Kiran Seshadri, Berkin Akin, James Laudon, Ravi Narayanaswami, and Amir Yazdanbakhsh. 2022. An evaluation of edge tpu accelerators for convolutional neural networks. In 2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 79–91.
- [64] Timothy Sherwood, Erez Perelman, Greg Hamerly, and Brad Calder. 2002. Automatically characterizing large scale program behavior. *ACM SIGPLAN Notices* 37, 10 (2002), 45–57.
- [65] Kevin Skadron, Margaret Martonosi, David I August, Mark D Hill, David J Lilja, and Vijay S Pai. 2003. Challenges in computer architecture evaluation. Computer 36, 8 (2003), 30–36.
- [66] Gagan Somashekar, Karan Tandon, Anush Kini, M Das, Petr Husak, CC Chang, R Bhagwan, N Natarajan, and A Gandhi. 2024. Oppertune: Post-deployment configuration tuning of services made easy. In 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). USENIX Association.
- [67] Ondřej Sýkora, Phitchaya Mangpo Phothilimthana, Charith Mendis, and Amir Yazdanbakhsh. 2022. GRANITE: A Graph Neural Network Model for Basic Block Throughput Estimation. In 2022 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 14–26.
- [68] Rafael Ubal, Byunghyun Jang, Perhaad Mistry, Dana Schaa, and David Kaeli. 2012. Multi2Sim: A simulation framework for CPU-GPU computing. In Proceedings of the 21st international conference on Parallel architectures and compilation techniques. 335–344.
- [69] Richard A Uhlig and Trevor N Mudge. 1997. Trace-driven memory simulation: A survey. ACM Computing Surveys (CSUR) 29, 2 (1997), 128–170.
- [70] Sam Van den Steen, Stijn Eyerman, Sander De Pestel, Moncef Mechri, Trevor E Carlson, David Black-Schaffer, Erik Hagersten, and Lieven Eeckhout. 2016. Analytical processor performance and power modeling using micro-architecture independent characteristics. IEEE Trans. Comput. 65, 12 (2016), 3537–3551.
- [71] Gene Wu, Joseph L Greathouse, Alexander Lyashevsky, Nuwan Jayasena, and Derek Chiou. 2015. GPGPU performance and power estimation using machine learning. In 2015 IEEE 21st international symposium on high performance computer architecture (HPCA). IEEE, 564–576.
- [72] Roland E Wunderlich, Thomas F Wenisch, Babak Falsafi, and James C Hoe. 2003. SMARTS: Accelerating microarchitecture simulation via rigorous statistical sampling. In Proceedings of the 30th annual international symposium on Computer architecture. 84–97.
- [73] Amir Yazdanbakhsh, Christof Angermueller, Berkin Akin, Yanqi Zhou, Albin Jones, Milad Hashemi, Kevin Swersky, Satrajit Chatterjee, Ravi Narayanaswami, and James Laudon. 2021. Apollo: Transferable architecture exploration. arXiv preprint arXiv:2102.01723 (2021).
- [74] Wu Ye, Narayanan Vijaykrishnan, Mahmut Kandemir, and Mary Jane Irwin. 2000. The design and use of simplepower: a cycle-accurate energy estimation tool. In *Proceedings of the 37th Annual Design Automation Conference*. 340–345.
- [75] Fuxun Yu, Shawn Bray, Di Wang, Longfei Shangguan, Xulong Tang, Chenchen Liu, and Xiang Chen. 2021. Automated runtime-aware scheduling for multi-tenant dnn inference on gpu. In 2021 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1–9.
- [76] Tianhe Yu, Saurabh Kumar, Abhishek Gupta, Sergey Levine, Karol Hausman, and Chelsea Finn. 2020. Gradient surgery for multi-task learning. Advances in Neural Information Processing Systems 33 (2020), 5824–5836.
- [77] Xiangyun Zhao, Haoxiang Li, Xiaohui Shen, Xiaodan Liang, and Ying Wu. 2018. A modulation module for multi-task learning with applications in image retrieval. In *Proceedings of the European Conference on Computer Vision (ECCV)*. 401–416.
- [78] Xinnian Zheng, Lizy K John, and Andreas Gerstlauer. 2016. Accurate phase-level cross-platform power and performance estimation. In *Proceedings of the 53rd Annual Design Automation Conference*. 1–6.
- [79] Yutao Zhong, Xipeng Shen, and Chen Ding. 2009. Program locality analysis using reuse distance. ACM Transactions on Programming Languages and Systems (TOPLAS) 31, 6 (2009), 1–39.

Received January 2023; revised April 2024; accepted April 2024