# The **Champ**ionship **Sim**ulator: Architectural Simulation for Education and Competition

Nathan Gober\*, Gino Chacon\*, Lei Wang\*, Paul V. Gratz\*, Daniel A. Jiménez\*, Elvira Teran<sup>†</sup>, Seth Pugsley<sup>‡</sup>, Jinchun Kim\*

\* Texas A&M University

{ngober, ginochacon, wilsonwang2019}@tamu.edu, pgratz@gratz1.com, djimenez@acm.org, cienlux@gmail.com

† Texas A&M International University
elviraterangarcia@gmail.com

‡ Intel
sethpugsley@gmail.com

Abstract—Recent years have seen a dramatic increase in the microarchitectural complexity of processors. This increase in complexity presents a twofold challenge for the field of computer architecture. First, no individual architect can fully comprehend the complexity of the entire microarchitecture of the core. This leads to increasingly specialized architects, who treat parts of the core outside their particular expertise as black boxes. Second, with increasing complexity, the field becomes decreasingly accessible to new students of the field. When learning core microarchitecture, new students must first learn the big picture of how the system works in order to understand how the pieces all fit together. The tools used to study microarchitecture experience a similar struggle. As with the microarchitectures they simulate, an increase in complexity reduces accessibility to new users.

In this work, we present ChampSim. ChampSim uses a modular design and configurable structure to achieve a low barrier to entry into the field of microarchitecural simulation. ChampSim has shown itself to be useful in multiple areas of research, competition, and education. In this way, we seek to promote access and inclusion despite the increasing complexity of the field of computer architecture.

#### I. Introduction

Computer architecture has developed over the decades into a mature field with strong specialization. Processors have grown in complexity, gaining more and more features as transistor density continues to increase, and as each new iterative innovation builds upon previous solutions. An increasingly lengthy historical perspective is required to fully appreciate the state of the leading edge of processors. Also, the trend of software tools used to assist microarchitecture development has been towards large, complex, and monolithic simulation environments.

Simultaneously, computer architecture education has become a core component of undergraduate computer engineering and computer science curricula. A basic understanding of computer architecture topics such as cache replacement and branch prediction is no longer restricted to graduate level courses, but can easily find a place in an undergraduate research project or classroom assignment. While precise tools used by experts provide a high level of correctness, students need tools appropriate for their level of knowledge, experience,

and ability. A large gap exists between the need for arcane tools used by experts and the need for accessible tools that can be used by novices. In this gap should stand a tool that strikes a balance between correctness and accessibility.

Computer hardware development is impeded by the long production pipelines required to realize a design. If a hardware design must traverse its entire life cycle in order to be evaluated, it may take many years to produce a well-tested design. It is furthermore infeasible to tape out many iterations of a design in order to evaluate their comparative performance. In order to reduce costs and mitigate long production pipeline times, hardware architects turn to higher-level software simulation of their designs. Simulation bridges the gap by enabling rapid comparisons between the designs, giving an approximation of the system's performance. Software simulation of designs is, by its nature, far slower than the hardware designs being evaluated, but allows for a much shorter design-to-evaluation time.

The advancement of the field of computer architecture depends on accurate and rapid simulation. Simulator authors make tradeoffs between evaluation time and simulation accuracy. The landscape of simulators today is wide, but few see broad usage. Some simulators are less concerned with evaluation time, and are more concerned with simulation accuracy. Others place greater emphasis on model performance over a more detail-oriented approach. Such simulators find their balance in a short evaluation time.

In this work, we present ChampSim, a simulator designed to promote innovative research, inclusive education, and healthy competition in the field of computer architecture. The ChampSim simulator is derived from the simulation tools used in the Second Data Prefetching Competition, which was held in conjunction with ISCA 2015. ChampSim simulates a heterogeneous multicore system with an arbitrary memory hierarchy, where each out-of-order core can be configured arbitrarily. It is intended to promote access in the rapidly-growing field of computer architecture. To accomplish this, we keep three key principles in mind: Low startup time, broad applicability, and design configurability. These design principles have led to a

simulator that favors usability and rapid iteration.

This paper will discuss in detail the purpose, design, and effectiveness of the ChampSim simulator, beginning with a discussion of the guiding principles of the design in Section II, details on the key features of ChampSim's modular architecture in Section III, a history of ChampSim's application in the field and a vision of its continuing place in Section IV, and a list of future development plans in Section VI.

## II. THE CHAMPSIM ARCHITECTURAL SIMULATOR

The ChampSim simulator has its roots in the simulation environment used for the Second Data Prefetching Competition [1]. In a competition environment, it is valuable to have an easy-to-use environment to encourage wide participation and a variety of novel submissions. These accessibility principles have continued through the development of ChampSim and have emerged into three guiding design principles: low startup time, broad applicability, and high configurability.

## A. Low startup time

While failure is an important part of learning, wrangling with a highly complex simulator before even beginning the implementation of a new idea can frustrate the learning process of any student or beginning researcher. With this insight in mind, we seek that a new user should be able to download and compile ChampSim in a few minutes, create their first design in a few hours, and perform new and meaningful computer architecture research within a few weeks. In each of the applications for which ChampSim is intended, it is valuable that a user is able to begin using the simulator quickly. Furthermore, the runtime of the simulation should be short enough to provide quick feedback for a novice user.

Many general-purpose processors have a lot in common. Designs are pipelined, usually with a decoupled, in-order front end and an out-of-order back end. Many researchers are not seeking to modify these basic design aspects, but have a particular element of the design in mind that they intend to study or improve. ChampSim presents a selection of areas that commonly see research activity as configurable modules: branch predictors, cache replacement policies, branch target buffers, and both instruction and data prefetchers. These modules provide an intuitive interface into a larger system, allowing designers to test new designs quickly and effectively, while affording them the opportunity to not have to worry about the parts of the system they are not studying.

Reference implementations of legacy modules, such as the GShare branch predictor [2] or the next-line prefetcher, are included with the simulator. These reference implementations can be used as starting points for new designs or as placeholders if the user is not interested in modifying those particular modules.

## B. Broad Applicability

A researcher seeking to perform hardware research should not be expected to be broadly familiar with a variety of programming languages or to track their changes over time. Therefore, for the sake of inclusion, we seek that a user should only need an entry-level understanding of C++, the language in which ChampSim is written, to perform research using ChampSim. The interface to an simulator should be simple and present the user with a few meaningful choices that map well onto their experience. Each module is fundamentally only the implementation of a few functions.

ChampSim is trace-driven, meaning that simulation is performed in two stages. First, the workload to be simulated is instrumented and run offline. The tracing instrumentation produces a digest of the program's activity, called a trace. The tracing step can be performed offline from the simulation step and the trace can be stored in a repository and made available to users. To test a simulated design, the user selects a trace file as input. The trace is streamed into the program as a stand-in for actual program execution. This strategy sacrifices a modest amount of accuracy, particularly in how the operating system interacts with the program, in favor of ease in reproducing results and of speed of the model. It is simpler for ChampSim to read a decoded trace file than to execute an external program, and given established, compiled repositories of traces, it is a helpful abstraction to the user to remove another step of environment setup.

Users are able to generate their own program traces with the included "tracer" tracing tool, included in the ChampSim package. The included tracer is built upon Intel PIN [3], a well-documented tool for instrumenting programs at runtime, though other tool sets, such as DynamoRIO [4], can be used. Alternately, instruction traces can be dumped from executiondriven simulators such as gem5 or QFlex [5]-[7]. The tracer inspects every instruction the program runs and encapsulates each instruction into a decoded format that includes the instruction pointer, branching behavior, and which registers and memory locations form the input and output operands for the instruction. The concatenation of many instructions forms the entire trace. This trace format permits ChampSim to run with low memory requirements, since the trace does not need to be held in memory but can be streamed off the disk after inline decompression.

## C. Design configurability

ChampSim is capable of modeling a large variety of commodity processors. A configuration file specifies many aspects of the modeled CPU core, including frequency, cache configuration, re-order buffer size, load and store queue sizes, widths for instruction fetch, decode, execution and retire, and a variety of latencies for different components. In addition, ChampSim includes a DRAM system that models bank and bus contention. The trace format includes only virtual memory addresses, so ChampSim simulates a page table and TLB hierarchy with arbitrary mappings of virtual to physical pages.

Each cache must be configured with a prefetcher (to simulate no prefetcher, there is an included "do-nothing" prefetcher) and a replacement policy. The cache interfaces with a read queue, a write queue, and a prefetch queue. Prefetches originating from the cache level are placed in its prefetch

#### **Branch Predictor Template Branch Target Predictor Template** initialize btb initialize branch predictor btb prediction predict branch update btb last branch result **Prefetcher Template** Replacement Policy Template prefetcher initialize initialize replacement prefetcher cache operate find victim prefetcher cycle operate update replacement state prefetcher cache fill replacement final stats prefetcher final stats

Fig. 1. ChampSim defines four types of modules. Each module requires a specific set of functions, or hooks, to be defined to allow the module to interact with the underlying model.

queue and serviced with a lower priority than the read queue. When a prefetch misses the cache, it is by default sent to the lower level's prefetch queue, though this is configurable. All TLB structures function identically to caches.

Each core model is configured with an instruction prefetcher (which may be a do-nothing prefetcher), a branch predictor, and a branch target buffer. Each core must be configured with an instruction TLB and cache, plus a data TLB and cache.

The cache architecture is flexible. The sink of all TLB requests must be configured to be a core's hardware page table walker, whose lower level is, in turn, the same core's L1 data cache, and the ultimate sink of all instruction or data memory requests must be configured to be the physical memory. Beyond that, the cache hierarchy can be arbitrary. Caches can be shared between cores and can be non-uniform levels of depth.

## III. MODULARITY

Typically in computer architecture, the student or researcher approaches their research with the goal of improving only a small part of a large processor. A researcher rarely attempts to develop an entirely new processor from scratch. Therefore, there is a need for a highly modular simulator, whose operative elements are easily substituted for alternative elements. A modular design permits convenient comparisons between competing component designs, as the underlying simulation environment is otherwise identical. This ease of comparison is important in both an academic and an industrial environment because it simplifies communication with interested audiences. It is far easier to communicate the benefit of a change when it is readily evident to the audience that the performance gain is attributable to the design change, and not a peculiarity of the underlying system.

A modular design also enables easier collaboration between researchers. A design can be easily shared when the essence of the design is limited to a few files of source code. ChampSim's modular design minimizes the amount of repetitive code between modules.



Fig. 2. ChampSim is capable of simulating multicore systems, an example of which is given here. Each core has its own branch target predictor and branch prediction module that may be configured uniquely from one another. Each core has a private cache hierarchy and virtual memory hardware support. Every core's memory hierarchy shares a last level cache, which is subsequently connected to the main memory model.

ChampSim defines four types of modules: branch predictors, branch target predictors (BTPs), memory prefetchers, and cache replacement policies. Each cache and TLB must define a prefetcher and a replacement policy. Prefetchers and replacement policies can be heterogeneous across cache and TLB levels. Each core defines a branch predictor and BTP that also can be heterogeneous between cores. ChampSim detects all required modules at configuration time and links a component's source as needed.

Each module implements a set of predefined functions. These functions are called when certain events occur in the underlying simulator model, and are generally referred to as "hooks." The hooks are implemented as member functions of either the core or cache models to allow modules access to all structures within their respective model. This structure permits designs that may have unrealistic visibility into other parts of the model to allow for extending a design's capabilities or to collect statistics. A module's attempts to edit the internal state of the core or cache model is considered undefined behavior and may adversely affect the simulator's operation. More information is provided to the modules than some may consider reasonably constructible in a processor. Such information is not provided as an implication that it should be possible, but to permit designs that make such a case. We leave the evaluation of whether a design is constructible to the discretion of the researcher and the committees responsible to review the design. Figure 1 lists the required hooks for each module in ChampSim. All of the listed hooks must be implemented, even if they are empty functions and perform no action.

ChampSim is designed to be flexible and permit a wide



Fig. 3. The virtual memory hardware support structures and private caches are instantiated for each core model. Each TLB and cache hierarchy level has interchangeable prefetcher and replacement policies. Each core sends requests for virtual-to- physical address translations to its TLBs. Each core sends instruction or data memory requests to its private caches. On a Second Level TLB miss, the Page Table Walker navigates the page table through a series of memory reads to find the translation. Page table entries are cached in the private and shared caches, just like any other data.

variety of core and cache parameters to be defined within a JSON configuration file, with reasonable defaults as a fallback. Before compilation, a configuration step is required in which the configuration script examines a user-specified configuration file to determine how to build the simulator. The core, caches, and other structures are sized individually with values extracted from the configuration file. This allows ChampSim to simulate not only homogeneous systems but heterogeneous multicore systems with each core having different sized internal and memory structures. The modules are built with compiler flags that instruct the preprocessor to substitute

the simulator hook function names with unique, procedurally generated names. The configuration script generates discriminator code that ensures that only hooks corresponding to the correct module are called at run time by each cache and core model. ChampSim is capable of simulating multicore systems with individual private caches and virtual memory structures such as the one shown in Figure 2. The virtual memory and cache hierarchy is illustrated in Figure 3. This is only an example of ChampSim's default configuration and may be reconfigured based on a user's needs.

To simulate multi-programmed workloads, at run time, the user provides one trace file for each core, statically assigned to that core for the duration of the simulation. An upcoming version of ChampSim will allow more flexibility in the trace assignments, as well as multi-threaded trace simulation (see Section VI).

# A. Branch Predictors

A hardware branch predictor is commonly included in contemporary processors to mitigate the performance impact of branches. If a given branch is predicted incorrectly, the subsequent wrong path instructions must be flushed from the pipeline. ChampSim's traces do not include any wrong path instructions, so it instead models this with a fixed-latency penalty following the resolution of incorrectly predicted branches. In ChampSim, the branch predictor module hooks are member functions of the core model. These hooks allow for designs to create prediction structures that have access to recent branch information and branch outcomes. The branch predictor's hooks are as follows:

- initialize\_branch\_predictor(): Each core calls this hook at the beginning of the simulation after the core has been fully initialized. This hook can be used to initialize structures in cases where the static initialization of C++ evaluation would occur too early. Furthermore, as a member function of the core model, the initialization hook can observe the size of internal core structures.
- predict\_branch(): This hook is called for each branch instruction in the execution path of the program. It is responsible for predicting whether the branch is taken or not, and returns a boolean value.
- last\_branch\_result(): This hook presents the branch predictor's opportunity to train on the true outcome of each branch instruction. The same information is provided to the branch predictor as predict\_branch(), but updated with the actual branch outcome.

## B. Branch Target Predictors

The branch target predictor (BTP) works in concert with the branch direction predictor to attempt to fully predict the outcome and target of branch instructions. The BTP provides the next instruction pointer after a taken branch. Incorrect predictions of branch targets cause wrong path execution in real processors. As with branch direction misprediction, ChampSim models these mispredictions with fixed-size delays. All BTP modules are member functions of the core model. Similar to the branch predictor module, these hooks allow for the BTP to receive information about observed branches and their outcome. These hooks are named based on branch target buffer (BTB) operations but can be used to design other forms of BTPs.

- initialize\_btb(): Each core calls this hook at the beginning of the simulation, after the core has been fully initialized. This hook can be used to initialize structures within the BTP in cases where the static initialization of C++ evaluation would occur too early. Furthermore, as a member function of the core model, the initialization hook can observe the size of internal core structures.
- btb\_prediction(): This hook is called for each branch instruction in the execution path of the program. The BTP is given the instruction pointer of the branch instruction and its branch type. It must attempt to predict the byte address of the branch target and whether the branch is always taken, returning these values as a tuple. The BTP may indicate that the branch is predicted not taken by returning a target address of 0. If this disagrees with the branch direction prediction, the module predicting not-taken receives priority.
- update\_btb(): This hook presents the BTP's opportunity to train on the true outcome of each branch instruction. This hook provides the branch instruction pointer, the true branch target, whether the branch was taken, and the branch type to train its internal structures to make better predictions.

## C. Data Prefetchers

Memory prefetchers are implemented in caches to reduce miss rates by anticipating future demand accesses. Prefetchers require information regarding cache accesses and the accesses that result in cache fills. This information allows the prefetcher to observe cache behavior, generate predictions, and receive feedback regarding predictions. In particular, memory prefetchers are assumed to operate asynchronously with the operation of the underlying cache model, precipitating a need for the prefetch\_line() callback function. While other modules supply their information as return values for function hooks, a prefetcher may operate many cycles following the initiating event. All data prefetcher hooks are member functions of the cache model, allowing for prefetcher developers to extend the prefetcher's visibility of the cache's behavior as they see fit.

- prefetcher\_initialize(): Each core calls this hook at the beginning of the simulation after the cache has been fully initialized. This hook can be used to initialize prefetcher structures if static initialization would occur too early.
- prefetcher\_cache\_operate(): The cache calls this hook when a demand request occurs. By default, this hook is only called for loads and prefetches from upper cache levels, though this is configurable. A prefetcher uses this hook to examine the access patterns at its

- cache level on a per-address basis to predict future cache accesses. As the primary function that communicates these access patterns to the prefetcher, this function also provides the prefetcher with the memory access's address, the instruction address that caused the memory access, whether the access resulted in a hit in the cache, and the type of memory access. The prefetcher also receives metadata from the incoming access that may be provided by a prefetcher in an upper cache level. Likewise, the prefetcher may return metadata to the cache to embed within the memory access's packet to communicate with lower levels of the cache hierarchy.
- prefetcher\_cache\_fill(): This hook is called each time a block is filled in the cache. Its parameters include information about which blocks were evicted and filled. Prefetchers can use this hook to evaluate their own accuracy, estimate cache miss latency, or to make future prefetching decisions about evicted blocks.
- prefetcher\_cycle\_operate(): This hook is called once every cycle on the same clock frequency as the underlying cache. This hook enables developers to precisely simulate the pipelining of complex prefetchers, if desired. Memory prefetch timing is often crucial to avoid misses, so complex prefetchers may use this function to emulate the delay between a prefetch and its initiating event.
- prefetcher\_final\_stats(): Once all instructions are executed, this hook prints user-defined statistics recorded throughout the simulation to the simulation's output.
- prefetch\_line(): This hook allows for the asynchronous operation of the prefetcher. Unlike other hook functions, this function is never called by the underlying simulator but must be called by the user to perform a prefetch. The prefetcher might call this in response to a demand request, a fill, or on a particular cycle, and can be called as many times as the user wants. The user can choose whether to fill into either the prefetcher's current cache level or a lower cache level.

# D. Instruction Prefetchers

Unlike the data prefetcher modules, the instruction prefetcher module hooks are members of the core model and are connected to the L1 instruction cache. Most instruction prefetcher hooks are called by the cache when certain events occur in the cache. Instruction prefetchers have the same hooks as the data prefetchers, and are called on the same events. The only exception is the prefetch\_line() function, which is replaced by prefetch\_code\_line(). Furthermore, instruction prefetchers have one additional hook:

 prefetcher\_branch\_operate(): This hook is called when a branch instruction is read from the trace.
 This hook can give the instruction prefetcher information about the boundaries of basic blocks.

## E. Replacement Policies

Cache replacement policies choose which cache line to evict from a cache set. Many do this by using a heuristic to predict the future utility of cache lines, and evict the line with the lowest expected utility. The information provided to the replacement policy allows it to view the cache's demand access stream. All of the hooks of a replacement policy are member functions of the cache model.

- initialize\_replacement(): Each core calls this
  hook at the beginning of the simulation, after the cache
  has been fully initialized. This hook can be used to
  initialize structures specific to the replacement policy
  where static initialization would occur too early.
- find\_victim(): The cache calls this hook when a
  cache set requires a victim cache block. The module
  returns the cache way within the set to be replaced. Cache
  bypassing is also an option, and is indicated by returning
  a value equal to the maximum number of ways.
- update\_replacement\_state(): The cache calls
  this hook to update the replacement policy on a change in
  the cache's state due to a cache hit or fill. The information
  provided to the replacement policy enables the replacement structures to receive feedback on whether their
  victim selection was harmful to the system's performance.
- replacement\_final\_stats(): This hook is called at the end of the simulation after all instructions are executed. This hook is used to print user-defined statistics recorded during the simulation.

## IV. APPLICATION

ChampSim has already found a place in the ecosystem of computer architecture simulation as a lightweight tool to introduce ideas to the community. It has promoted new and innovative research, fueled in part by industry-sponsored competitions. Furthermore, it shows promise as a tool for classroom use, striking a balance between being simple and accessible, and being a feature-rich simulator.

## A. Research

The rapid prototyping and evaluation of computer hardware designs is essential to the continuing development of computer architecture. Established industry heavyweights design and implement simulators for internal use. These proprietary simulators are rarely disclosed, since doing so may reveal valuable trade secrets that will hinder them competitively. Therefore, the research community must itself develop and support open-source simulators to conduct independent computer architecture research.

ChampSim has already been in use among the community for some time. Due to its exposure through computer architecture competitions, several high-impact publications have founded their results on ChampSim. It has been key in demonstrating the excellence of many memory prefetchers, both of data [8]–[17] and instructions [18]. It has found use in evaluating cache replacement policies, particularly in the last

| Competition | Submissions |
|-------------|-------------|
| DPC-2 [1]   | 13          |
| CRC-2 [31]  | 15          |
| DPC-3 [32]  | 14          |
| IPC-1 [33]  | 16          |
| MLDPC [34]  | 4           |
| Total       | 62          |
| TABLE I     |             |

THE NUMBER OF SUBMISSIONS TO A SELECTION OF COMPETITIONS.

EACH OF THESE SUBMISSIONS IS A UNIQUE PROPOSAL IN THE SPACE OF

COMPUTER ARCHITECTURE.

level cache [19]–[22]. ChampSim's view of a TLB as functionally identical to a cache has led to research into prefetching in TLBs [23], [24]. The modularity of ChampSim's branch predictor has shown benefits in novel works [25]. The branch target buffer has also been studied with ChampSim [26].

As a fast, lightweight simulator, ChampSim's relatively small code footprint has proven to be easy to understand and modify. While some have leveraged the module system to develop novel works, others have additionally modified the underlying model itself [27]–[30]. Users can furthermore modify the existing program tracer to generate traces of different formats to meet particular needs.

In only a few years, ChampSim has shown to be a valuable tool for academic research. As of this writing, research performed using ChampSim has already received 352 citations. Development of ChampSim is still ongoing, and new features are continually being added. ChampSim is likely to foster further work in these areas as the simulator's development continues.

## B. Competition

There has been a rise in industry-sponsored contests to promote memory system research. ChampSim has been featured in:

- The Second Data Prefetching Championship (2015) [1]
- The Second Cache Replacement Championship (2017) [31]
- The Third Data Prefetching Championship (2019) [32]
- The First Instruction Prefetching Championship (2020) [33]
- The ML-Based Data Prefetching Competition (2021) [34]

In these competitions, a call for submissions solicits submissions of innovative designs for a particular module. The submissions are compared and ranked by metrics such as instructions per cycle or cache misses. These contests encourage many designs to be compared and rewarded for their merits. They have increased the rate of publication in the field of memory prefetching and cache replacement. There have been many submissions to these competitions, listed in Table I, which have produced many follow-up publications. ChampSim has been the simulator of choice in these competitions due to its availability, being open-source, ease of use, requiring minimal dependencies, and speed, evaluating billions of instructions per hour.

ChampSim was designed from the beginning to make these kinds of comparative contests simple to organize, evaluate, and participate in. Contestants typically are given 2 months to ramp up with ChampSim and produce novel work. The high rate of participation in each contest is evidence that ChampSim is accessible and easy to use. The modular design creates a competition workflow wherein contestants need only submit the source code for the module under evaluation. The competition organizer can then easily compile, run, and evaluate the competitors' submissions. We invite any others who would like to organize competitions in any field of CPU architecture to contact us.

## C. Education

In a classroom setting, instructors are given approximately fourteen weeks to teach students about tools needed to complete assignments while also teaching them about the topics at hand. It is accordingly difficult to use a heavyweight simulator in a classroom setting. Students may be asked asked to configure and install systems they do not understand for the purpose of only a few assignments. If this configuration and installation is difficult, it impedes the students' progress by increasing their frustration and decreasing their engagement with the classroom topics. As before, the tools used in an academic setting should, as much as possible, lower the barrier to entry for these young entrants to the field. In such a scenario, a lightweight, easy to understand simulator is preferred. ChampSim meets the needs of educators and allows students to implement microarchitecture techniques they learn in class.

ChampSim has been designed such that the only dependencies are up-to-date C++ and Python environments. Further dependencies could increase the likelihood that a student may encounter difficulties when initializing their simulation environment. Its development target includes compilers available natively in popular long-term supported operating systems that are commonly available to students. At the time of writing, this includes GCC 7, Clang 4, Microsoft Visual Studio 2017, and Python 3.6.

ChampSim has a small codebase, approximately 5000 lines of code, and so it is possible to comprehend in its entirety. The design of ChampSim does not necessarily attempt to precisely replicate all hardware functions, but to emulate it in a readable way. These design decisions allow a student to rapidly grasp its functionality, configure it, and implement or modify techniques they've learned about in class, all within the span of one academic semester. ChampSim has been successfully used in classroom settings at Texas A&M University at both the undergraduate and graduate levels.

In the undergraduate level course, one the students' first assignments was to download ChampSim and to become familiar with each of its components. The students' final project was to optimize a last-level cache replacement policy, and prove that their technique outperformed the base case (base cases were different for each student). All students were able to successfully complete the first assignment and never during the semester-long project faced difficulties finding the computer resources to download and configure ChampSim. As

the semester progressed, students were able to identify which module of the simulator modeled each of the components discussed in the course and were able to visualize the topics by studying and modifying the simulator. Students maintained a high level of interest during the lectures, were able to run simulations and correctly interpret the results, and showed a high level of mastery on the course exams.

ChampSim was also used in a graduate course at Texas A&M University. In particular, the focus of this course is in out-of-order core and memory system microarchitecture. As this is an introductory course in microarchitecture research, the aim is to get the incoming students up and running on a simulator as soon as possible, so that they can use most of the semester developing a course project. To this end, ChampSim was used in two early semester assignments, one exploring the impact of different branch predictors on performance of cores with different OoO width, and another exploring the impact of differing prefetch and replacement policies on the L2 cache. After this introduction, nearly all the course students choose to use ChampSim for their course projects, and those that did were more likely to achieve their project goals by the end of the semester.

For an undergraduate computer architecture course at Texas A&M International University, the students used ChampSim throughout the course for project work. At the beginning of the semester, students were asked to download and set up the simulator. All students were able to do in a matter of minutes, despite none of these students having previous experience with any other simulator. For their first assignment, students were asked to run simulations changing the configuration of the simulator testing different cache sizes, different replacement policies, and other configurable aspects, and were asked to reason about results obtained. As the semester progressed and topics were covered during lecture, being able to plainly read the source code of ChampSim helped the students understand the concepts in the course and how the concepts could be implemented in real systems.

In the lecture on memory hierarchy, students were asked to implement their own replacement policy and, while it was not expected nor required for such policy to show a performance improvement, it was required for the student to justify their idea and implementation given what they learned in class. The same kind of assignment was given over branch prediction and memory prefetching. These types of assignment challenged the students to go beyond the concepts presented to them on their textbook and to think outside the box about the details in the implementation of such components. ChampSim was shown to be a powerful supplementary teaching tool for this type of course.

We continue to advocate for its usefulness as a tool to invite new students into the domain of computer architecture research. The authors are aware that ChampSim has also been used at additional universities at both the undergraduate and the graduate level, each with reports of success and discovery on the part of the students. We hope that ChampSim continues to serve as a low-impendance tool that permits discovery and

research with a low degree of frustration.

## V. RELATED WORKS

This section provides a brief overview of selected prior works in architectural simulation and evaluation with regards to the key design principles discussed in Section II. Simulation and emulation is the main tool used by computer architects in academia to evaluate the performance impact of new microarchitectural designs. As a result, simulation methodology and simulator design is exceedingly broad area of study and development [35], [36]. As such, there have been an extremely large number of simulators proposed over the years, here we discuss a subset of the most popular current simulators.

Scarab [37] is a recently developed simulator capable of performing cycle-accurate execution-driven simulation by using Intel's PIN to execute application binaries passed by a user. For faster simulation times, Scarab also supports trace-driven simulation. The simulator models an out-of-order pipelined processor, a full cache hierarchy, and interfaces with Ramulator [38] to provide an accurate and highly configurable main memory model. Scarab is a promising simulation framework with a low start-up time, but users are required to use Intelbased systems to use the Pintool based execution-driven mode. Scarab is suitable for more advanced architects' use and may not be appropriate for a classroom setting with students of varying expertise.

Gem5 is a popular event-driven simulator with a wide variety of features to enable the exploration of most areas of microarchitectural design. A researcher that finds gem5 does not meet their needs will likely find another work that extends gem5 for their purpose [39]-[42]. Gem5's full system mode allows users to emulate a complete computing system with devices and for an OS to be executed on the emulation. For designs that do not require OS-support, gem5 provides system emulation mode in which a user may provide a binary for the simulator to execute. The simulator supports most available ISAs and is extremely modular, permitting researchers to plug-and-play system components, interchange cache hierarchy organizations, experiment with various network on chip topologies, and explore various core models. This high amount of configurability allows gem5 to meet most architectural research needs. Still, this robustness and configurability leads to very high complexity which heavily increases the barrier of entry and startup time for new students and researchers.

Zsim [43] is a multicore simulator that breaks the simulation down into multiple phases, relying on Pin to perform instrumentation of an input binary to relieve the burden of highly accurate timing models. Each core modeled in the system is executed in a separate thread based on the instrumentation phase to allow for highly scalable system simulation. The simulator is configurable and extendable to allow researchers to represent a wide variety of memory hierarchies and heterogeneous systems. Similar to Scarab, zsim limits the host system to use an Intel processor subsequently limiting the system to the x86 ISA. Due to the simulation methodology

used, the host system may require specific configurations that a student may not have permission to modify.

Sniper [44] is a highly parallel multicore simulator that extends Graphite [45] to add interval model implementation. This methodology allows for faster simulation by abstracting the system model to events that affect timing instead of tracking individual instructions as they traverse the pipeline. The simulator supports heterogeneous core models and operating system execution. The developers provide Python-based scripts to control and monitor the simulation during runtime. Similar to Scarab and zsim, PIN dynamically executes user-inputted binaries. Users are required to request access to the latest version of Sniper's source code. Overall, the simulator is configurable and supports multiple system models, but is limited to x86 execution. The complexity of interval modeling makes the system difficult for novice architects and students to modify, resulting in a high barrier of entry.

Each of the simulators discussed here has been widely used for a variety of research applications and resulted in a large number of publications. These tools provide configurable features but require extensive knowledge and ramp-up time to begin design exploration. ChampSim is not an attempt to replace these tools, but rather to fill the gap in computer architecture tools for architects. ChampSim is ideal for users that do not require heavyweight simulation for their research and educators seeking to provide meaningful curriculum for semester-long courses.

## VI. FUTURE WORK

ChampSim is a comparatively new tool and remains in very active development, with new features and bug fixes being regularly added, within the constraints of the design principles. Because of the growing adoption of ChampSim as a tool for education and research, backwards compatibility is a high priority in any changes that are made. For example, module hook signatures are unlikely to change (without sufficient cause) to ensure that already published artifacts remain compatible with newer versions of ChampSim.

Nevertheless, the trace format used imposes some restrictions on the correctness of ChampSim's simulation. The current trace format uses fixed-size arrays to represent register and memory location usage information. This may be inefficient, leaving some gaps of wasted memory if one instruction accesses fewer memory locations or registers than other instructions, but it also carries implicit assumptions about which instruction set architecture is being traced. A planned future instruction trace format will use variable-sized arrays to represent this information and should also use compression as part of the trace specification.

As part of this new trace format effort, more information will be included. ChampSim currently views instructions as being in one of the following classes: branch, load, store or arithmetic. It handles all arithmetic instructions identically, with the same arbitrary delay imposed on each. In the future, more accurate simulation is possible with a more precise set of delays.

The current implementation of ChampSim assumes a workload of multiple, single-threaded programs. That is, each core is assumed not to interact with any other core's execution, except to create pressure on any shared caches. Lock contention is a frequent source of execution delays in multi-threaded workloads, and adding support for such contention will improve simulation accuracy. In the future, ChampSim will support multi-threaded workload simulation, adding support for synchronization constructs such as locks and barriers. In a new trace format, the interaction between threads can be encoded.

ChampSim currently makes no effort to maintain cache coherence and only models non-inclusive cache hierarchies. A future step will be to implement other forms of clusivity and allow for cache coherence modeling, since an ongoing body of work, is the interaction of memory prefetching with various forms of each. The modular design of ChampSim positions it well to examine the interactions of combinations of such works, since the individual components can be trivially substituted for one another.

Development has continued on the front end of ChampSim since the First Instruction Prefetching Competition [33] to make the front end increasingly fetch directed. The branch direction and target predictors may, as a part of this effort, become asynchronous like the memory prefetchers, allowing for prediction latencies longer than a single cycle.

## VII. CONCLUSION

In this work, we have discussed ChampSim, an architectural simulation tool designed for ease of use and rapid simulation. We have presented it as a tool for research, competition, and education, specifically that these may be more accessible to an upcoming cohort of new architects. In this way, we hope to promote a broad, diverse, and inclusive community of computer architecture.

The design of the simulator focuses on a low startup time, such that the user is able to begin using ChampSim with minimal frustration; broad applicability, so that any new researcher is able to comprehend their place in the broader scope of the environment; and configurable design, where the user is able to simulate their work in a variety of contexts with ease. ChampSim accomplishes this with a modular design. Each element of the simulator is designed to be replaceable with an equivalent element. This allows for prompt and proper comparison between competing designs.

ChampSim has been used in a variety of research and teaching endeavors throughout its lifetime. It has been shown to have a strong impact on the community by promoting novel works in the field. The competitions that have been centered around ChampSim have prompted many submissions and led to many high-impact publications. ChampSim has also shown to be useful in a classroom, as a tool where students can explore and develop their ideas. We believe that this work has already found a niche in the landscape of architectural simulation, and we hope that this work can continue to grow and to serve the community well.

#### REFERENCES

- S. Pugsley, A. Alameldeen, H. Kim, and C. Wilkerson, "The second data prefetching championship," 2015.
- [2] S. McFarling, "Combining branch predictors," Citeseer, Tech. Rep., 1993
- [3] C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood, "Pin: Building customized program analysis tools with dynamic instrumentation," in ACM SIGPLAN Conference on Programming Language Design and Implementation. Chicago, IL, USA: Association for Computing Machinery, 2005, p. 190–200. [Online]. Available: https://doi.org/10. 1145/1065010.1065034
- [4] D. Bruening, E. Duesterwald, and S. Amarasinghe, "Design and implementation of a dynamic optimization framework for windows," in ACM Workshop on Feedback-Directed and Dynamic Optimization, 2000.
- [5] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti *et al.*, "The gem5 simulator," *ACM SIGARCH computer architecture news*, vol. 39, no. 2, pp. 1–7, 2011.
- [6] J. Lowe-Power, A. M. Ahmad, A. Akram, M. Alian, R. Amslinger, M. Andreozzi, A. Armejach, N. Asmussen, S. Bharadwaj, G. Black et al., "The gem5 simulator: Version 20.0+," CoRR, vol. abs/2007.03152, 2020. [Online]. Available: https://arxiv.org/abs/2007.03152
- [7] B. Villalonga, S. Boixo, B. Nelson, C. Henze, E. Rieffel, R. Biswas, and S. Mandrà, "A flexible high-performance simulator for verifying and benchmarking quantum circuits implemented on real hardware," npj Quantum Information, vol. 5, no. 1, pp. 1–16, 2019.
- [8] N. S. Kalani and B. Panda, "Instruction criticality based energy-efficient hardware data prefetching," *IEEE Computer Architecture Letters*, vol. 20, no. 2, pp. 146–149, 2021.
- [9] C. Zhang, Y. Zeng, J. Shalf, and X. Guo, "RnR: A software-assisted record-and-replay hardware prefetcher," in *IEEE/ACM International Symposium on Microarchitecture*. Virtual Event: Association for Computing Machinery, Oct. 2020, pp. 609–621.
- [10] J. Kim, E. Teran, P. V. Gratz, D. A. Jiménez, S. H. Pugsley, and C. Wilkerson, "Kill the program counter: Reconstructing program behavior in the processor cache hierarchy," SIGPLAN Not., vol. 52, no. 4, p. 737–749, apr 2017. [Online]. Available: https://doi.org/10.1145/3093336.3037701
- [11] J. Kim, S. H. Pugsley, P. V. Gratz, A. N. Reddy, C. Wilkerson, and Z. Chishti, "Path confidence based lookahead prefetching," in *IEEE/ACM International Symposium on Microarchitecture*. Association for Computing Machinery, Oct. 2016, pp. 1–12.
- [12] M. Bakhshalipour, M. Shakerinava, P. Lotfi-Kamran, and H. Sarbazi-Azad, "Bingo spatial data prefetcher," in *IEEE International Symposium on High-Performance Computer Architecture*, Feb. 2019, pp. 399–411.
- [13] E. Bhatia, G. Chacon, S. Pugsley, E. Teran, P. V. Gratz, and D. A. Jiménez, "Perceptron-based prefetch filtering," in 47th ACM/IEEE Annual International Symposium on Computer Architecture. Phoenix, Arizona: IEEE, Jun. 2019, p. 1–13. [Online]. Available: https://doi.org/10.1145/3307650.3322207
- [14] H. Wu, K. Nathella, M. Pabst, D. Sunwoo, A. Jain, and C. Lin, "Practical temporal prefetching with compressed on-chip metadata," *IEEE Transactions on Computers*, 2021.
- [15] H. Wu, K. Nathella, J. Pusdesris, D. Sunwoo, A. Jain, and C. Lin, "Temporal prefetching without the off-chip metadata," in *IEEE/ACM International Symposium on Microarchitecture*. Columbus, OH, USA: Association for Computing Machinery, Oct. 2019, p. 996–1008. [Online]. Available: https://doi.org/10.1145/3352460.3358300
- [16] H. Wu, K. Nathella, D. Sunwoo, A. Jain, and C. Lin, "Efficient metadata management for irregular data prefetching," in 47th ACM/IEEE Annual International Symposium on Computer Architecture. Phoenix, Arizona: IEEE, Jun. 2019, p. 449–461. [Online]. Available: https://doi.org/10.1145/3307650.3322225
- [17] S. Pakalapati and B. Panda, "Bouquet of instruction pointers: Instruction pointer classifier-based spatial hardware prefetching," in 47th ACM/IEEE Annual International Symposium on Computer Architecture. Virtual Event: IEEE, May 2020, pp. 118–131.
- [18] A. Ros and A. Jimborean, "A cost-effective entangling prefetcher for instructions," in 48th ACM/IEEE Annual International Symposium on Computer Architecture. Valencia, Spain: IEEE, Jun. 2021, pp. 99–111.
- [19] S. Sethumurugan, J. Yin, and J. Sartori, "Designing a cost-effective cache replacement policy using machine learning," in *IEEE International*

- Symposium on High-Performance Computer Architecture, Feb. 2021, pp. 291–303
- [20] P.-Y. Péneau, D. Novo, F. Bruguier, L. Torres, G. Sassatelli, and A. Gamatié, "Improving the performance of STT-MRAM LLC through enhanced cache replacement policy," in *Architecture of Computing* Systems. Springer International Publishing, 2018, pp. 168–180.
- [21] Y. Deng, J. Yue, Z. Lu, and Y. Zhu, "Efficient hardware-assisted outplace update for persistent memory," in *Design, Automation, and Test* in *Europe Conference*, Virtual Event, Feb. 2021, pp. 507–512.
- [22] A. Jain and C. Lin, "Rethinking Belady's algorithm to accommodate prefetching," in 46th ACM/IEEE Annual International Symposium on Computer Architecture. Los Angeles, California: IEEE, Jun. 2018, pp. 110–123.
- [23] G. Vavouliotis, L. Alvarez, B. Grot, D. Jiménez, and M. Casas, "Morrigan: A composite instruction TLB prefetcher," in *IEEE/ACM International Symposium on Microarchitecture*. Virtual Event: Association for Computing Machinery, Oct. 2021, p. 1138–1153. [Online]. Available: https://doi.org/10.1145/3466752.3480049
- [24] G. Vavouliotis, L. Alvarez, V. Karakostas, K. Nikas, N. Koziris, D. A. Jiménez, and M. Casas, "Exploiting page table locality for agile TLB prefetching," in 48th ACM/IEEE Annual International Symposium on Computer Architecture. Valencia, Spain: IEEE, Jun. 2021, pp. 85–98.
- [25] C. Lin and S. J. Tarsa, "Branch prediction is not a solved problem: Measurements, opportunities, and future directions," *CoRR*, vol. abs/1906.08170, 2019. [Online]. Available: http://arxiv.org/abs/1906.08170
- [26] T. Asheim, B. Grot, and R. Kumar, "BTB-X: A storage-effective BTB organization," *IEEE Computer Architecture Letters*, vol. 20, no. 2, pp. 134–137, 2021.
- [27] Y. Ishii, J. Lee, K. Nathella, and D. Sunwoo, "Re-establishing fetch-directed instruction prefetching: An industry perspective," in *IEEE International Symposium on Performance Analysis of Systems and Software*, Virtual Event, Mar. 2021, pp. 172–182.
- [28] P. Kumar, C. S. Yashavant, and B. Panda, "DAMARU: A denial-of-service attack on randomized last-level caches," *IEEE Computer Architecture Letters*, vol. 20, no. 2, pp. 138–141, 2021.
- [29] E. C. Barboza, S. Jacob, M. Ketkar, M. Kishinevsky, P. Gratz, and J. Hu, "Automatic microprocessor performance bug detection," in *IEEE International Symposium on High-Performance Computer Architecture*, Feb. 2021, pp. 545–556.
- [30] Z. Shi, X. Huang, A. Jain, and C. Lin, "Applying deep learning to the cache replacement problem," in *IEEE/ACM International Symposium on Microarchitecture*. Columbus, OH, USA: Association for Computing Machinery, Oct. 2019, p. 413–425. [Online]. Available: https://doi.org/10.1145/3352460.3358319
- [31] P. V. Gratz, J. Kim, G. Chacon, A. Alameldeen, C. Wilkerson, S. Pugsley, A. Jaleel, B. Falsafi, M. Sutherland, and J. Picorel, "The second cache replacement championship," 2017.

- [32] S. Pugsley, A. Alameldeen, and M. Ferdman, "The third data prefetching championship," 2017.
- [33] S. Pugsley, A. Alameldeen, M. Al-otoom, and H. Zhou, "The first instruction prefetching championship," 2020.
- [34] A. Jain and Q. Duong, "Ml-based data prefetching competition," 2021.
- [35] A. Akram and L. Sawalha, "A survey of computer architecture simulation techniques and tools," *IEEE Access*, vol. 7, pp. 78120–78145, 2019.
- [36] H. Brais, R. Kalayappan, and P. R. Panda, "A survey of cache simulators," ACM Comput. Surv., vol. 53, no. 1, Feb. 2020. [Online]. Available: https://doi.org/10.1145/3372393
- [37] Hpsresearchgroup, "hpsresearchgroup/scarab: Joint hps and eth repository to work towards open sourcing scarab and ramulator." [Online]. Available: https://github.com/hpsresearchgroup/scarab
- [38] Y. Kim, W. Yang, and O. Mutlu, "Ramulator: A fast and extensible dram simulator," *IEEE Computer Architecture Letters*, vol. 15, no. 1, 2016.
- [39] A. A. Gubran and T. M. Aamodt, "Emerald: Graphics modeling for SoC systems," in 47th ACM/IEEE Annual International Symposium on Computer Architecture. Phoenix, Arizona: IEEE, Jun. 2019, p. 169–182. [Online]. Available: https://doi.org/10.1145/3307650.3322221
- [40] Y. S. Shao, S. L. Xi, V. Srinivasan, G.-Y. Wei, and D. Brooks, "Co-designing accelerators and SoC interfaces using gem5-aladdin," in IEEE/ACM International Symposium on Microarchitecture. Association for Computing Machinery, Oct. 2016.
- [41] J. Power, J. Hestness, M. S. Orr, M. D. Hill, and D. A. Wood, "gem5-gpu: A heterogeneous CPU-GPU simulator," *IEEE Computer Architecture Letters*, vol. 14, no. 1, pp. 34–36, 2015.
- [42] Y. M. Qureshi, W. A. Simon, M. Zapater, K. Olcoz, and D. Atienza, "Gem5-X, a many-core heterogeneous simulation platform for architectural exploration and optimization," ACM Transactions on Architecture and Code Optimization, vol. 18, no. 4, pp. 1–27, Dec. 2021. [Online]. Available: https://doi.org/10.1145/3461662
- [43] D. Sanchez and C. Kozyrakis, "Zsim: Fast and accurate microarchitectural simulation of thousand-core systems," in 40th ACM/IEEE Annual International Symposium on Computer Architecture. Tel-Aviv, Israel: IEEE, Jun. 2013, p. 475–486. [Online]. Available: https://doi.org/10.1145/2485922.2485963
- [44] T. E. Carlson, W. Heirman, and L. Eeckhout, "Sniper: Exploring the level of abstraction for scalable and accurate parallel multicore simulation," in *International Conference for High Performance Computing, Networking, Storage and Analysis.* Seattle, Washington: Association for Computing Machinery, 2011. [Online]. Available: https://doi.org/10.1145/2063384.2063454
- [45] J. E. Miller, H. Kasture, G. Kurian, C. Gruenwald, N. Beckmann, C. Celio, J. Eastep, and A. Agarwal, "Graphite: A distributed parallel simulator for multicores," in *IEEE International Symposium on High-Performance Computer Architecture*, Jan. 2010, pp. 1–12.