# Protocol Guided Analysis of Post silicon Traces Under Limited Observability

by

# Yuting Cao

A thesis submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Department of Computer Science and Engineering College of Engineering University of South Florida

> Major Professor: Hao Zheng, Ph.D. Dmitry Goldgof, Ph.D. Yao Liu, Ph.D.

> > Date of Approval: To be determine

Keywords: silicon, validation, trace, analysis, observability, signal selection

Copyright © 2015, Yuting Cao

# ACKNOWLEDGMENTS

I am grateful to Dr. Hao Zheng for his precious and constant help on this project.

# TABLE OF CONTENTS

| LIST OF TABLES                                                                                                                                                                                            | iii                              |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|
| LIST OF FIGURES                                                                                                                                                                                           | iv                               |
| ABSTRACT                                                                                                                                                                                                  | vii                              |
| CHAPTER 1 INTRODUCTION  1.1 Silicon Validation  1.2 Pre- and Post-silicon Validation  1.3 Related Work  1.4 Motivation                                                                                    | 1<br>1<br>1<br>2<br>4            |
| CHAPTER 2 BACKGROUND  2.1 SoC Protocols and Post-silicon Trace Analysis  2.2 Labeled Petri-Nets                                                                                                           | 6<br>6<br>8                      |
| CHAPTER 3 FLOW GUIDED TRACE INTERPRETATION 3.1 Notations and formalization 3.2 Flow Guided Trace Interpretation                                                                                           | 9<br>9<br>10                     |
| CHAPTER 4 TRACE ANALYSIS UNDER PARTIAL OBSERVABILITY 4.1 Single signal event 4.2 Sequence of signal events 4.3 Difficulties and solutions 4.4 Trace Signal Selection 4.5 Interactive Trace Interpretation | 13<br>13<br>14<br>14<br>15<br>16 |
| CHAPTER 5 IMPLEMENTATIONS AND RESULTS 5.1 Simulation simple TLM SoC with GEM5 5.2 Simulation simple RTL SoC with VHDL                                                                                     | 17<br>17<br>22                   |
| CHAPTER 6 CONCLUSION AND FUTURE WORKS                                                                                                                                                                     | 25                               |
| APPENDICES                                                                                                                                                                                                | 26                               |
| APPENDICES  Appendix A Flow specifications and protocols provided by GEM5  A.1 Flow Specifications                                                                                                        | 27<br>28<br>28                   |

| A.2 Protocols                | 29 |
|------------------------------|----|
| Appendix B RTL model in VHDL | 33 |
| B.1 Flow Specification       | 33 |
| B.2 Protocol                 | 34 |
| LIST OF REFERENCES           | 38 |

# LIST OF TABLES

| Table 5.1 | Runtime Results of Trace analysis. Time is in seconds and memory usage                     |    |
|-----------|--------------------------------------------------------------------------------------------|----|
|           | is in MB.                                                                                  | 18 |
| Table 5.2 | The number of flow instances derived by the trace analysis with the full observability.    | 19 |
| Table 5.3 | The number of flow instances derived by the trace analysis with certain monitors disabled. | 20 |
| Table 5.4 | The number of flow instances derived by the trace analysis with the full observability.    | 23 |

# LIST OF FIGURES

| Figure 2.1 | (a) A graphical representation of a SoC firmware load protocol [5]. (b)                                       |    |
|------------|---------------------------------------------------------------------------------------------------------------|----|
|            | LPN formalization. Each event has a form of $\langle \mathtt{src}, \mathtt{dest}, \mathtt{cmd} \rangle$ where |    |
|            | cmd is a command sent from a source component src to a destination                                            |    |
|            | component dest. The solid black places without outgoing edges are                                             |    |
|            | terminals, which indicate termination of protocols represented by the                                         |    |
|            | LPNs.                                                                                                         | 7  |
| Figure 5.1 | SoC platform structure.                                                                                       | 18 |
| Figure 5.2 | SoC platform structure.                                                                                       | 22 |
| Figure A.1 | Flow sequence chart of write operation when requested data is not included in Dcache.                         |    |
|            | ReadExRes can also be sent from Memory if Dcache2 doesn't have requested data.                                |    |
|            | This sequence chart is symmetric for CPU2.                                                                    | 28 |
| Figure A.2 | Flow sequence chart of write operation when XCache has the exclusive right of re-                             |    |
|            | quested data. XCache can be instruction cache or data cache. This sequence chart is                           |    |
|            | symmetric for CPU2.                                                                                           | 28 |
| Figure A.3 | Flow sequence chart of write operation when requested data is shared by another                               |    |
|            | component. UpgradeRes can also be sent from Memory if Dcache2 doesn't have                                    |    |
|            | requested data. This sequence chart is symmetric for CPU2.                                                    | 28 |
| Figure A.4 | Flow sequence chart of read operation when XCache has the exclusive right of re-                              |    |
|            | quested data. XCache can be instruction cache or data cache. This sequence chart is                           |    |
|            | symmetric for CPU2.                                                                                           | 29 |

| Figure A.5 | A.5 Flow sequence chart of read operation when requested data is shared by another    |    |  |
|------------|---------------------------------------------------------------------------------------|----|--|
|            | component. LoadLockedRes can also be sent from Memory if Dcache2 doesn't have         |    |  |
|            | requested data. This sequence chart is symmetric for CPU2.                            | 29 |  |
| Figure A.6 | Flow sequence chart of read operation when requested data is not present. StoreCon-   |    |  |
|            | dRes can also be sent from Memory if Dcache2 doesn't have requested data. This        |    |  |
|            | sequence chart is symmetric for CPU2.                                                 | 29 |  |
| Figure A.7 | Flow specification of a cache coherent write operation initiated from CPU1 to in-     |    |  |
|            | struction cache. This flow is symmetric for CPU2.                                     | 30 |  |
| Figure A.8 | Flow specification of a cache coherent read operation initiated from CPU1 to instruc- |    |  |
|            | tion cache. This flow is symmetric for CPU2.                                          | 31 |  |
| Figure A.9 | Flow specification of a cache coherent read operation initiated from CPU1 to data     |    |  |
|            | cache. This flow is symmetric for CPU2.                                               | 32 |  |
| Figure B.1 | CPU write when cache has exclusive right of the requested data.                       | 33 |  |
| Figure B.2 | CPU write when data only exist in the other CPU's cache                               | 33 |  |
| Figure B.3 | CPU write when requested data only reside in Memory                                   | 33 |  |
| Figure B.4 | Cache send write back request to Memory                                               | 33 |  |
| Figure B.5 | CPU read when cache has exclusive right of the requested data.                        | 34 |  |
| Figure B.6 | CPU read when data only exist in the other CPU's cache                                | 34 |  |
| Figure B.7 | CPU read when requested data only reside in Memory                                    | 34 |  |
| Figure B.8 | Flow specification of a cache coherent read operation initiated from CPU1 to Cache.   |    |  |
|            | This flow is symmetric for CPU2.                                                      | 35 |  |

| Figure B.9  | Flow specification of a cache coherent write operation initiated from CPU1 to Cache. |    |
|-------------|--------------------------------------------------------------------------------------|----|
|             | This flow is symmetric for CPU2.                                                     | 36 |
| Figure B.10 | Flow specification of a cache coherent read operation initiated from CPU1 to Cache.  |    |
|             | This flow is symmetric for CPU2.                                                     | 37 |

# **ABSTRACT**

We consider the problem of reconstructing system- level behavior of an SoC design from a partially observed signal trace. Solving this problem is a critical activity in post-silicon validation, and currently depends primarily on human creativity and insights. In this paper, we provide an algorithm to automatically infer system-level transactions from incomplete, ambiguous, and noisy trace data. We demonstrate the approach on a multicore virtual platform developed within the GEM5 environment.

#### INTRODUCTION

In this thesis, we will present background information on the area of post-silicon validation and problems in the area in Chapter 2. Then in Chapter 3 an overview of current flow verification is presented. After that, we present our method in Chapter 4, 5 and 6 and implementation in Chapter 6. Chapter 7 will summarize our work and talk about our future works. All the flow specifications and protocols used in implementation process will be explained in Chapter 8.

#### 1.1 Silicon Validation

Validation is the activity of ensuring a product satisfies its specifications, compatible with related software and hardware and meets user expectations. [2]

Silicon Validation is needed because numbers of processor bug are growing and bugs are becoming more diverse and complex.

#### 1.2 Pre- and Post-silicon Validation

The product development cycle often tends to be linear. The phase will start with planning and architecture, followed by RTL and schematic creation and architectural and functional validation (pre-silicon validation) leading up the tape out. Post-silicon validation will start when the first chip arrives. And eventually when all specification are meet, the product will be released to be market. [2]

Pre-silicon validation aims to verify the architecture design before it's implemented on an actual chip. It's mainly done at RTL level on simulators, FPGAs, or emulators. During pre-silicon validation, the cause of fixing bug is relatively low. Any bugs can be fixed by RTL modification and small amount of time The downside is, pre-silicon validation is limited by its speed and coverage. Simulators are usually very slow and only suitable for small portions of the design. FPGAs are up to 3 orders of magnitude faster. Emulator can combine multiple FPGAs to work on a larger portion of the RTL design with cause of limited speed. All the limitations makes pre-silicon only able to verify part of the system.

Post-silicon validation makes use of pre-production silicon integrated circuit (IC) to ensure that the fabricated system works as desired under actual operating conditions with real software. Since the silicon executes at target clock speed, post-silicon executions are billions of times faster than RTL simulations, and even provide speed-up of several orders of magnitude over other pre-silicon platforms (e.g., FPGA, system-level emulation, etc.). This makes it possible to explore deep design states which cannot be exercised in pre-silicon, and identify errors missed during pre-silicon validation and debug. However, limited pin-out and other architecture factors makes it impossible to have full observability of the system. Only a limited number of signals can be observed and traced. This limitation brings challenge in both detecting bugs and debugging process.

#### 1.3 Related Work

Paper [2] goes through the definition of silicon validation and reason why it's important. Together with current techniques used for pre- and post-silicon validation.

Our work is closely related to communication-centric and transaction based debug. An early pioneering work is described in [8], which advocates the focus on observing activities on the interconnect network among IP blocks, and mapping these activities to transactions for better correlation between computations and communications. Therefore, the communication transactions, as a result of software execution, provide an interface between computation

and communication, and facilitate system-level debug. This work is extended in [9, 10]. However, this line of work is focused on the network-on-chip (NoC) architecture for interconnect using the run/stop debug control method.

A similar transaction-based debug approach is presented in [11]. Furthermore, it proposes an automated extraction of state machines at transaction level from high level design models. From an observed failure trace, it performs backtracking on this transaction level state machine to derive a set of transaction traces that lead to the observed failure state. In the subsequent step, bounded model checking with the constraints on the internal variables is used to refine the set of transaction traces to remove the infeasible traces. This approach requires user inputs to identify impossible transaction sequences, and may not find the states causing the failure if the transaction traces leading to the observed failure state is long. Backtracking from the observed failure state requires pre-image computation, which can be computationally expensive. A transaction-based online debug approach is proposed in [12] to address these issues. This approach utilizes a transaction debug pattern specification language [13] to define properties that transactions should meet. These transaction properties are checked at runtime by programming debug units in the on-chip debug infrastructure, and the system can be stopped shortly after a violation is detected for any one of those properties. In this sense, it can be viewed as the hardware assertion approaches in [14] elevated to the transaction level.

In [15], a coherent workflow is described where the result from the pre-silicon validation stage can be carried over to the post-silicon stage to improve efficiency and productivity of post-silicon debug. This workflow is centered on a repository of system events and simple transactions defined by architects and IP designers. It spans across a wide spectrum of the post-silicon validation including DFx instrumentation, test generation, coverage, and debug. The DFx instruments are automatically inserted into the design RTL code driven by the defined transactions. This instrumentation is optimized for making a large set of events and transactions observable. Test generation is also optimized to generate only the necessary

but sufficient tests to allow all defined transactions to be exercised. Moreover, coverage for post-silicon validation is now defined at the abstract level of events and transactions rather than the raw signals, and thus can be evaluated more efficiently. In [16], a model at an even higher-level of abstraction, *flows*, is proposed. Flows are used to specify more sophisticated cross-IP transactions such as power management, security, etc, and to facilitate reuse of the efforts of the architectural analysis to check HW/SW implementations.

#### 1.4 Motivation

Post-silicon validation is a critical component of the design validation life-cycle for modern microprocessors and SoC designs. Unfortunately, it is also a highly complex component, performed under aggressive schedules and accounting for more than 50% of the overall design validation cost [6]. Consequently, it is crucial to develop techniques for streamlining and automating post-silicon validation activities.

A key component of post-silicon validation of SoC designs is to correlate traces from silicon execution with the intended system-level transactions. An SoC design is typically composed of a large number of pre-designed hardware or software blocks (often referred to as "intellectual properties" or "IPs") that coordinate through complex protocols to implement the system-level behavior. Any execution trace of the system involves a large number of interleaved instances of these protocols. For example, consider a smartphone executing a usage scenario where the end-user browses the Web while listening to music and sending and receiving occasional text messages. Typical post-silicon validation use-cases involve exercising such scenarios.

An execution trace would involve activities from the CPU, audio controller, display controller, wireless radio antenna, etc., reflecting the interleaved execution of several communication protocols. On the other hand, due to observability limitations, only a small number of participating signals can be actually traced during silicon execution. Furthermore, due to electrical perturbations, silicon data can be noisy, lossy, and ambiguous. Consequently, it is

non-trivial to identify all participating protocols and pinpoint the interleaving that results in an observed trace.

With the increasing complexity of modern SoC designs nowadays, debugging protocols inside IP blocks by themselves is not enough anymore. The complexity of the SOC increasingly resides in the interactions between the IP blocks. Therefore, debug must be conducted at a higher system level, where the computation threads and communication threads interact. Because the interconnect implements the communication, and hence the synchronization between the IP blocks is the natural focus for system-level debug [8].

In this thesis, we consider the problem of reconstructing protocol-level behavior from silicon traces in SoC designs. Given a collection of system-level communication protocols and a trace of (partially observed) hardware signals, our approach infers, with a certain measure of confidence, the protocol instances (and their interleavings) being exercised by the trace. The approach is based on a formalization of system-level transactions via labeled Petri-Nets, which are capable of describing sequencing, concurrency, and choices over system events. We develop algorithms to infer system-level transactions from traces with missing, noisy, and ambiguous signal values. We demonstrate our approach on a multicore virtual platform constructed within the GEM5 environment [7] and another RTL model written in VHDL.

#### BACKGROUND

# 2.1 SoC Protocols and Post-silicon Trace Analysis

An SoC design involves integration of a number of IPs that communicate through complex protocols. Such system-level protocols are typically specified in architecture documents as message flow diagrams. For this paper, we use the words "protocol" and "flow" interchangeably. Fig 2.1(a) shows one diagram for a protocol to authenticate and load a firmware during system boot for firmware upgrade. During validation, the system under debug (SUD) exercises some complex system-level use-case which involves interleaved execution of possibly a large number of such flows. A trace of a small number of hardware signals is then shipped off-chip analysis. The off-chip analysis includes two broad phases: (1) trace abstraction, and (2) trace interpretation. Trace abstraction maps the hardware trace into higher-level architectural constructs, e.g., messages, operations, etc.: a message such as Authorization request may be implemented in hardware through a Boolean or temporal combination of specific hardware signals in the NoC fabric between Device and CE, e.g., as a sequence containing a header, a specific value of a sequence of data words, etc. We will refer to such architectural constructs as protocol events or flow events. Note that due to limited observability, it may not be possible to map a given set of (observed) hardware signals uniquely to a flow event. Finally, the trace may be a result from several instances of the same protocol executing concurrently, e.q., a firmware authentication protocol may be invoked when another instance of the protocol has not completed.



Figure 2.1. (a) A graphical representation of a SoC firmware load protocol [5]. (b) LPN formalization. Each event has a form of  $\langle src, dest, cmd \rangle$  where cmd is a command sent from a source component src to a destination component dest. The solid black places without outgoing edges are *terminals*, which indicate termination of protocols represented by the LPNs.

Trace interpretation entails mapping flow events created during trace abstraction to system-level protocols in order to identify the set of protocol instances (and interleavings) responsible for creating the observed behavior. The trace may identify a problem in the protocols themselves, e.g. an interleaving of some protocol executions may lead to an unexpected message being sent or cause the system to crash. More commonly, one finds a bug in the implementation of the protocol, i.e., a trace inconsistent with any possible interleaving of the protocol executions. Identifying these problems involves significant human expertise, and can often take days to weeks of effort.

# 2.2 Labeled Petri-Nets

Labeled Petri-nets (LPN) is a formalization of state transition systems that is capable of describing sequencing, concurrency, and choices. Fig. 2.1(b) illustrates how to use LPN to formalize protocols. Formally, an LPN is a tuple  $(P, T, s_0, E, L)$  where P is a finite set of places, T is a finite set of transitions, init is the set of initially marked places, also referred to as the initial marking, E is a finite set of events, and E is a labeling function that maps each transition E to an event E is an event E for each transition E is preset, denoted as E is the set of places connected to E, and its postset, denoted as E is a labeling function that set of places that E is connected to E. A marking E is a subset of places marked with tokens, and it is also referred to as a state of a LPN. The initial marking init is also the initial state of the LPN.

#### FLOW GUIDED TRACE INTERPRETATION

In this chapter we formalize the trace interpretation problem in terms of labeled Petrinets, and discuss our algorithms to address the problem. For pedagogical reasons, here we assume full observability of all hardware signals involved in the flow events. In the next section we will extend the approach to partial observability.

#### 3.1 Notations and formalization

The set of system flows is a collection  $\vec{F}$  of LPNs. A flow execution scenario is defined as a set  $\{(F_{i,j}, s_{i,j})\}$  where  $F_{i,j}$  is the jth instance of flow  $F_i \in \vec{F}$ , and  $s_{i,j}$  is a state of  $F_{i,j}$ . A flow execution scenario indicates the set of protocols and the number of instances of a particular protocol are activated and their corresponding current states. Since we assume full observability, we view an observed trace  $\rho = e_1 e_2 \dots e_n$  as a sequence of events. Given an observed trace  $\rho$ , the goal of trace interpretation is to construct a set of candidate flow execution scenarios whose execution can create the sequence of events in  $\rho$ . We call such execution scenarios compliant with  $\rho$ . Let  $accept(F_{i,j}, s_{i,j}, e)$  be a function that determines if event e can be emitted by  $F_{i,j}$  in state  $s_{i,j}$ . Formally,  $accept(F_{i,j}, s_{i,j}, e)$  returns  $(F_{i,j}, s'_{i,j})$  where  $s'_{i,j} = (s_{i,j} - \bullet t) \cup t \bullet$  if there exists a transition t in  $F_i$  such that L(t) = e and  $\bullet t \subseteq s_{i,j}$ . It returns  $\emptyset$  otherwise.

# 3.2 Flow Guided Trace Interpretation

Given an observed trace  $\rho$  and the set  $\vec{F}$  of LPNs, Algorithm 1 provides a basic procedure for computing a set of compliant flow execution scenarios. The algorithm operates by keeping track (in variable Scen) of a set of candidate flow execution scenarios compliant with each prefix of  $\rho$ . At each iteration, for each event  $e_h$  in the observed trace, we update Scen by either extending a member of Scen or initiating a new protocol instance for each Scen with respect to Scen in every possible way. If  $Scen} = Scen$  downwisteness, then we report that the trace is inconsistent,  $Scen} = Scen$  interleaving of the protocol instances from  $Scen} = Scen$  that is compliant with  $Scen} = Scen$  interleaving of the protocol instances from  $Scen} = Scen$  that is compliant with  $Scen} = Scen$  interleaving of the protocol instances from  $Scen} = Scen$  that is compliant with  $Scen} = Scen$  interleaving of the protocol instances from  $Scen} = Scen$  that is compliant with  $Scen} = Scen$  interleaving of the protocol instances from  $Scen} = Scen$  that is compliant with  $Scen} = Scen$  interleaving of the protocol instances from  $Scen} = Scen$  that is compliant with  $Scen} = Scen$  in the second instance  $Scen} = Scen} = Scen} = Scene$  in the second instance  $Scen} = Scene$  in the second instan

To illustrate the basic idea, consider the system flow in Fig. 2.1(b), which we will call  $F_1$ . Suppose that the following flow trace is abstracted from an observed signal trace.

$$t_1 t_2 t_1 t_2 t_3 t_3 t_4 t_5 t_5 t_4 \dots$$

Here transition names in the LPN are used to represent the flow events in the trace. The first four events results in the following flow execution scenario

$$\{(F_{1,1}, \{p_3\}), (F_{1,2}, \{p_3\})\}.$$

For the first event  $t_3$ , it results in two execution scenarios below depending on which flow instance emits  $t_3$ .

$$\{(F_{1,1}, \{p_4\}), (F_{1,2}, \{p_3\})\}$$

$$\{(F_{1,1}, \{p_3\}), (F_{1,2}, \{p_4\})\}.$$

After handing the next event  $t_3$ , the above two execution scenarios are reduced to the one as shown below.

$$\{(F_{1,1}, \{p_4\}), (F_{1,2}, \{p_4\})\}.$$

Using Algorithm 1 to handle the remaining four events, the following execution scenario is derived.

$$\{(F_{1,1}, \{p_5, p_6\}), (F_{1,2}, \{p_5, p_6\})\}$$

```
Create an empty scenario scen
Scen = \{scen\}
foreach h, 1 \le h \le n do
     found \leftarrow \texttt{true}
     Scen' = \emptyset
     foreach scen \in Scen do
          foreach (F_{i,j}, s_{i,j}) \in scen_1 do
               if accept(F_{i,j}, s_{i,j}, e_h) = (F_{i,j}, s'_{i,j}) then
                    Let scen' be a copy of scen
                  scen' \leftarrow (scen' - (F_{i,j}, s_{i,j})) \cup (F_{i,j}, s'_{i,j})Scen' \leftarrow scen' \cup Scen'found \leftarrow \texttt{false}
               end
          \mathbf{end}
          foreach F_i \in \vec{F} do
               create a new instance F_{i,j+1}
               if accept(F_{i,j+1}, init_{i,j+1}, e_h) = (F_{i,j+1}, s'_{i,j+1}) then
                   Let scen' be a copy of scen
                   scen' \leftarrow scen' \cup (F_{i,j+1}, s'_{i,j+1})

Scen' \leftarrow scen' \cup Scen'

found \leftarrow false
                \mathbf{end}
          end
     \mathbf{end}
     if found == true then
          return Inconsistent
     end
     Scen = Scen'
end
return Scen
```

**Algorithm 1:** Check-Compliance $(\vec{F}, \rho)$ 

TRACE ANALYSIS UNDER PARTIAL OBSERVABILITY

In general, a signal trace of partial observability corresponds a set of traces of flow events

due to the ambiguous interpretation of signal events. In the following, we discuss two cases

for trace abstraction on partial observability: mapping a single signal event to a flow event

or mapping a sequence of signal events to a flow event. A signal event is defined as a state

on or an assignment to a set of signals. Hereafter, the term flow traces is used to refer to

traces of flow events.

4.1 Single signal event

Consider the following example for the first case. Suppose that there are three flow

events:  $e_1$ ,  $e_2$ , and  $e_3$ , which are implemented in hardware by the signal events shown in the

list below. We use Boolean expressions to represent signal events for the discussion.

 $e_1: abc$ 

 $e_2: \bar{a}bc$ 

 $e_3: a\bar{b}c$ 

Suppose that only signals b and c are observable, and we obtain the following trace:

 $bc \ bc \ \bar{b}c$ 

13

During trace abstraction, the first two signal events bc can be mapped to  $\{e_1, e_2\}$  since a is not observable, and the last one b'c is mapped to  $\{e_3\}$ . Therefore, this signal trace is abstracted to four flow traces,  $\{e_1, e_2\} \times \{e_1, e_2\} \times \{e_3\}$ .

# 4.2 Sequence of signal events

Next, we consider the case where a flow event is mapped from a sequence of signal events. Now suppose that two other flow events are implemented by sequences of signal events as defined in the list below.

 $e_4$ :  $abc \bar{a}bc$ 

 $e_5$ : abc abc abc  $\bar{a}bc$ 

Given an observed trace of the same observability shown below

bc bc bc bc,

it is abstracted to the following flow traces.

$$e_4e_4$$
,  $_{-}e_4$ ,  $_{-}e_4$ ,  $_{-}e_5$ 

where \_ denotes signal events that are not mapped to any flow events. Note that the above abstraction leads to three distinct flow traces as the middle three correspond to the same flow trace.

#### 4.3 Difficulties and solutions

It is clear from above that a partial trace is viewed as a set of flow traces, and Algorithm 1 can be suitably extended to work with flow traces to obtain the set of candidate flows. However, applicability of the algorithm in practice can be gated because the number of potential flow execution scenarios generated under partial observability may be enormous.

Note that this is not a limitation of the algorithm; if the observability of critical events is poor there simply *are* too many flow execution scenarios compliant with the observed trace.

Nevertheless, we need to address the issue to make trace interpretation (whether automatic or not) practicable. There are two potential approaches: (1) better selection of post-silicon trace observability, and (2) use of system insights during validation. Trace signal selection itself is an important and orthogonal topic [17, 18], and finding an automated way of signal selection algorithm can be one of our future research direction. We will briefly describe impact of signal selection and how the debuggers' insights of a system's architecture can help to address the complexity issue in the trace interpretation in next two sections.

### 4.4 Trace Signal Selection

Examples in previous section showed that limited observability will increase number of scenarios during the trace interpretation process. The difference can be magnitude of 2 or even more number of scenarios compared to with full observability. For real time simulation, where there are millions of trace messages, this could lead to thousands of scenarios or even worse. Therefore choosing the right set of signals that can represent the most numbers of flow messages is very important.

The increased scenario number will not just add up the analysis time, but also makes the debugging process harder. With more final scenarios and we can't tell which one is the real final state, we need to analyze all those scenarios to find possible root cause of errors. Therefore, choose the signals that can lead to fewest number of scenario is our final goal.

Our thesis did signal selection mainly manually, we tried different sets of signals to observe and chose those sets that leads to fewest numbers of final scenario. Finding a method to automate signal selection process is one of our future research direction. A lot of research is done in the area and most of them use SSR(State Restoration Ratio) as their metic. However, as [1] proposed that the signal selection standard SSR is not suitable for evaluating trace

signal quality. We believe a better signal selection algorithm can be built with suitable metric other than SSR.

# 4.5 Interactive Trace Interpretation

Post-silicon validation is performed by debuggers with deep knowledge about the system's architecture and micro-architecture, and the test environment. Two key insights are (1) the maximal number of instances of a flow activated in the test environment, and (2) the mutual relationship between two flows. For example, the test environment may not permit multiple instances of firmware authentication to operate concurrently, or a flow involving audio and Web browsing to initiate until the flows participating in boot are completed. Our framework permits incorporating such insights as constraints in trace analysis; flow execution scenarios that violate these constraints are ignored. These insights can lead to two advantages. First, they help to reduce the potentially large number of partial scenarios generated during the trace interpretation step, thus making the analysis more efficient. Second, they permit the debugger to quickly filter out uninteresting combinations of flows and focus on interesting interleavings.

This approach can be flexible in that it allows a debugger to analyze the observed traces in a trial-and-error manner if the precise knowledge of the system (micro-)architecture is hard to come by. For instance, the debugger might initially make a very restricted assumption on how the SUD executes a flow specification, and these assumptions can potentially lead to an empty set of flow execution scenarios. Depending on which of these assumptions triggered during the trace interpretation step, the debugger can study these assumptions more carefully, and relax some or all of them for the next run of analysis. This iteration can be repeated as many times as necessary until some results deemed meaningful are produced.

#### IMPLEMENTATIONS AND RESULTS

### 5.1 Simulation simple TLM SoC with GEM5

To determine the efficiency of the trace analysis method for a realistic example, a transaction level model of a SoC is constructed within the GEM5 environment [7]. This SoC model, as shown in Fig. 5.1, consists of two ARM Cortex-A9 cores, each of which contains two separate 16KB data and instruction caches. The caches are connected to a 1GB memory through a memory bus model. Components communicate with each other by sending and receiving various request and response messages. In order to observe and trace communications occurring inside this model during execution, monitors are attached to links connecting the components. These monitors record the messages flowing through the links they are attached to, and store them into output trace files.

For this model, we consider the flow specifications describing the cache coherence protocols supported in GEM5 that is used to build the model in Fig 5.1. The GEM5 cache coherence protocols can be found at [19]. These flow specifications describe data/instruction read operations and data write operations initiated from CPUs. Three such flows describe the cache coherent protocols for each CPU. Since there are two CPUs, there are six flows in the model.

We wrote two simple concurrent programs, one for each CPU, to exercise the flows. They read numbers from a file, perform some operations on these numbers, and store the results back to the file. How GEM5 supports shared memory multi-threaded program execution is unclear. Therefore, no data are shared in both caches in this test. Furthermore, GEM5 does



Figure 5.1. SoC platform structure.

Table 5.1. Runtime Results of Trace analysis. Time is in seconds and memory usage is in MB.

|      | F-Obs. | P-Obs.  | P-Obs. | P-Obs. |
|------|--------|---------|--------|--------|
|      | r-Obs. | No Amb. | Amb. 1 | Amb. 2 |
| Time | 3      | 2.78    | 896    | < 1    |
| Mem  | 12     | 10      | 420    | 9      |

not support true concurrency. When there are two programs running on the CPUs, GEM5 alternates the executions between the two CPUs. To simulate asynchronous concurrency with the interleaving semantics, those two simple programs are instrumented with pseudo-blocking commands, one placed before each statement. A pseudo blocking command includes a random number generator that returns either 0 or 1 and a loop that only exits when the returned random number is 0.

After this model is executed with the simple concurrent programs, the trace analysis is applied to traces with different observabilities collected from this model. The runtime results are shown in Table 5.1. The first column shows the results from analyzing the trace with the

Table 5.2. The number of flow instances derived by the trace analysis with the full observability.

| Flows                 | #Instances |
|-----------------------|------------|
| CPU1 Data Read        | 17582      |
| CPU1 Instruction Read | 4002       |
| CPU1 Write            | 3370       |
| CPU2 Data Read        | 17386      |
| CPU2 Instruction Read | 3955       |
| CPU2 Write            | 3308       |

full observability, while the next three show the result from analyzing traces with different partial observability assumptions.

In the first experiment, full observability is assumed. After the SoC model finishes executing the program, there are totally 343581 messages collected in the trace file. Not all of the messages are relevant to the flow specification as many are used by GEM5 to initialize its simulation environment. After removing those irrelevant messages, the number of messages in the trace file is to reduced to 121138.

The time taken to remove the irrelevant messages from the trace is negligible. The total runtime and the peak memory taken by the trace analysis algorithm on the reduced trace are 3 seconds and 12MB, respectively. Only one flow execution scenario is extracted, and Table 5.2 shows the number of flow instances contained in that scenario for the six flows describing cache coherent operations initiated from both CPUs.

In the second experiment, partial observability is taken into account with the four monitors attached to the links between two CPUs and their caches are disabled. Then, the trace is generated by the remaining five monitors from the SoC model executing the same program. The new trace contains 15089 messages. Similarly, only one flow execution scenario is extracted, and the numbers of the flow instances contained in that execution scenario

Table 5.3. The number of flow instances derived by the trace analysis with certain monitors disabled.

| Flows                 | #Instances |
|-----------------------|------------|
| CPU1 Data Read        | 829        |
| CPU1 Instruction Read | 169        |
| CPU1 Write            | 82         |
| CPU2 Data Read        | 803        |
| CPU2 Instruction Read | 190        |
| CPU2 Write            | 83         |

are shown in Table 5.3. From these results, the numbers of the flow instances are dropped significantly compared to the results extracted from the trace with the full observability as shown in Table 5.2. This difference is due to that some communications occurred in the system when executing the program involve the CPUs and their corresponding caches only, and the traffic on the links between the CPUs and their corresponding caches is not observable. Therefore, the instances of the flow specifications characterizing these communications do not exist in the trace. In other words, all extracted flow instances in Table 5.3 characterize the communications that pass through the memory bus in the system model. The runtime and memory usage as shown in the third column in Table 5.1 are similar to those for analyzing the trace of the full observability.

In the third experiment, further partial observability is taken into consideration. In this experiment, only the five links involving the memory bus are still considered. However, an assumption is made that all events passing the same link are not distinguishable due to the limited observability. The monitors are modified such that whenever an event is captured on one of the links, it dumps a set of events passing through the same link into the trace file. Therefore, each line of the trace file corresponds to a set of events. After applying the trace analysis to this trace, a total of 13944 flow execution scenarios are extracted. This large

number, compared to the results from the first two experiments, is due to the ambiguous interpretation of the events with limited observability.

The whole experiment takes about 15 minutes and 420 MB to finish as shown in column 4 in Table 5.1, significantly higher than the numbers for analyzing traces where there is no ambiguity in the observed events. This is due to the fact that a trace of ambiguous events is in fact a set of traces of original events, which lead to large numbers of execution scenarios either during or at the end of the analysis. In this experiment, the peak number of execution scenarios during the analysis process is 70384, many of which are invalid and removed eventually. However, controlling the number of intermediate execution scenarios during the trace analysis is critical in order for the analysis to be tractable. Here, insights from validators could help, but are not used in this experiment.

As shown above, the ambiguous interpretation of events can lead to large numbers of intermediate and final execution scenarios, which not only make the trace analysis more time consuming but also make it difficult to gain insightful understanding from the derived execution scenarios. Careful selection of what to observe may have big impact on results from the trace analysis. In this last experiment, we relax the assumption made in the previous experiment such that the events passing each link are partitioned into two groups, one for read operations and one for write operations. Similar to the assumption made in the previous experiment, events in the same group are assumed to be non-distinguishable. The monitors are modified accordingly such that they output all events in the same group into the trace file if an event from that group is captured. After the trace analysis on this new partially observed trace is finished, only one execution scenario is derived where the distribution of the numbers of flow instances is the same as those shown in Table 5.3. The peak number of execution scenarios encountered during the trace analysis is 4. The total runtime and memory usage are negligible as shown in the last column in Table 5.1. Compared to the results from the previous experiment, the precision and the performance of the trace analysis are improved dramatically as a result of careful selection of observable events.



Figure 5.2. SoC platform structure.

# 5.2 Simulation simple RTL SoC with VHDL

Another register transaction level SoC model is constructed. This model is more specific compared to the last model as the register value will be updated every cycle, thus will be much slower than the model on GEM5. Because the lack of similar research, we couldn't find any existing model, so this model is built from scratch just by me. The desired protocols are implemented inside all the components. Because this is a simplified model, the CPU can't run software programs. A test generator is implemented inside CPU to allow simple functions.

As showed in Fig. 5.2, this model consists of two CPU models, each with its own 1KB cache. The caches are connected to a 4MG memory through a memory bus model. At each rising clock cycle, our model will record selected signal value into a vcd(value change dump) format file.

Table 5.4. The number of flow instances derived by the trace analysis with the full observability.

| Flows           | #Instances |
|-----------------|------------|
| CPU1 Read       | 5090       |
| CPU1 Write      | 4910       |
| CPU1 Write Back | 1270       |
| CPU2 Read       | 4932       |
| CPU2 Write      | 5068       |
| CPU2 Write Back | 1211       |

This model has 6 protocols implemented. It's based on the protocols provided by GEM5 with some modification. 3 types of protocols: read, write and write back is implemented for both CPU. Write back protocol is new and will be invoked when Cache need to flush back dirty datas.

For every clock cycle, the test generator inside each CPU will randomly generate a read or write operation. In order to better active and cache coherent protocol, only first 3 bits of the 16 bits request address are random generated, while the rest of the bits are predefined. By limiting the address to a certain range, it's more likely that one CPU will request data that exists in the other CPU.

In this experiment model, full observability is assumed. After the SoC model finishes 2000 flows for each cpu, there are totally 122704 messages collected in the trace file. The system takes 10 second to run and the peak memory used is 18MB.

Table 5.4 shows the number of flow instances contained in that scenario for the six flows describing cache coherent operations initiated from both CPUs.

During the building of the system, we used our analysis algorithm as a debugging method to find the problems. Following types of errors are detected by the algorithm and fixed.

- Error one: interconnect bus sending same request to Memory more than one time.

  During the trace analysis, our algorithm shows that the trace will always stop when interconnect send request to Memory and the message cannot be mapped to any existing scenarios. The error message looks exactly the same as the previous message.

  We traced the bug to interconnect and find out the request wasn't reset correctly.
- Error two: command changed after interconnect bus receive snoop request from Cache. Our trace analysis algorithm find message from cache to memory can't map to any existing scenarios in multiple runs, and all the stacked messages are snoop response. By debugging the cache component, we discovered the wrong implemented cache coherent protocol and fixed the error.

#### CONCLUSION AND FUTURE WORKS

This thesis presents a method for post-silicon validation by interpreting observed raw signal traces at the level of system flow specifications. The derived flow execution scenarios provide more structured information on system operations, which is more understandable to system validators. This information can help to locate design defects more easily, and also provides a measurement of validation coverage.

Due to partial observability, this approach may derive a large number of different flow execution scenarios for a given signal trace. Insights from system validators can help to eliminate some false scenarios due to the partial observability. An interesting future direction is formalization of the validators' insights using temporal logic on flows so that the validators can express their intents more precisely and concisely.

The trace analysis approach presented in this thesis needs to be iterated with different observations selected in different iterations in order to eliminate the false scenarios and to root cause system failures as quickly as possible. The observation selection and stitching signal traces of different observations together for the above goal will also be pursued in the future.

APPENDICES

# APPENDICES

#### Appendix A Flow specifications and protocols provided by GEM5

## A.1 Flow Specifications



Figure A.1. Flow sequence chart of write operation when requested data is not included in Dcache. Read-ExRes can also be sent from Memory if Dcache2 doesn't have requested data. This sequence chart is symmetric for CPU2.



Figure A.2. Flow sequence chart of write operation when XCache has the exclusive right of requested data. XCache can be instruction cache or data cache. This sequence chart is symmetric for CPU2.



Figure A.3. Flow sequence chart of write operation when requested data is shared by another component. UpgradeRes can also be sent from Memory if Dcache2 doesn't have requested data. This sequence chart is symmetric for CPU2.



Figure A.4. Flow sequence chart of read operation when XCache has the exclusive right of requested data. XCache can be instruction cache or data cache. This sequence chart is symmetric for CPU2.



Figure A.5. Flow sequence chart of read operation when requested data is shared by another component. LoadLockedRes can also be sent from Memory if Dcache2 doesn't have requested data. This sequence chart is symmetric for CPU2.



Figure A.6. Flow sequence chart of read operation when requested data is not present. StoreCondRes can also be sent from Memory if Dcache2 doesn't have requested data. This sequence chart is symmetric for CPU2.

#### A.2 Protocols



| $msg_0: ($    | CPU1,    | ${\it write} {\it Req},$ | icache1 ) | $msg_{11}: ($ | icache2, | readExres,          | Bus )        |
|---------------|----------|--------------------------|-----------|---------------|----------|---------------------|--------------|
| $msg_1: ($    | dcache1, | ${\rm readExreq}\ ,$     | Bus )     | $msg_{12}: ($ | Bus,     | readExres,          | dcache1)     |
| $msg_2: ($    | Bus,     | ${\rm readExreq},$       | dcahce2 ) | $msg_{13}: ($ | icache1, | ${\bf writeRes},$   | CPU1)        |
| $msg_3: ($    | dcache2, | ${\rm readExreq},$       | cpu2)     | $msg_{14}: ($ | icache1, | ${\bf writeRes},$   | CPU1)        |
| $msg_4: ($    | Bus,     | ${\rm readExreq},$       | icahce2 ) | $msg_{15}: ($ | dcache1, | ${\bf UpgradeReq},$ | Bus)         |
| $msg_5: ($    | icache2, | ${\rm readExreq},$       | cpu2)     | $msg_{16}: ($ | Bus,     | ${\bf UpgradeReq},$ | icahce $2$ ) |
| $msg_6: ($    | Bus,     | ${\rm readExreq},$       | icahce1 ) | $msg_{17}: ($ | Bus,     | ${\bf UpgradeReq},$ | Memory )     |
| $msg_7: ($    | dcache1, | ${\rm readExreq},$       | cpu1)     | $msg_{18}: ($ | icache2, | ${\bf UpgradeRes},$ | Bus )        |
| $msg_8: ($    | Bus,     | ${\rm readExreq},$       | Memory )  | $msg_{19}: ($ | Bus,     | ${\bf UpgradeRes},$ | dcache1 )    |
| $msg_9: ($    | true )   |                          |           | $msg_{20}: ($ | icache1, | ${\it writeRes},$   | CPU1)        |
| $msg_{10}: ($ | Memory,  | ${\it readExres},$       | Bus)      |               |          |                     |              |

Figure A.7. Flow specification of a cache coherent write operation initiated from CPU1 to instruction cache. This flow is symmetric for CPU2.



```
msg0: (
          CPU1,
                    ReadReq,
                                    icache1)
                                                msg8: (
                                                                       StoreCondreq,
                                                                                       Memory )
                                                            Bus,
msg1:(
          dcache1,
                    StoreCondreq,
                                    Bus)
                                                msg9: (
                                                            true)
msg2: (
          Bus,
                    StoreCondreq,
                                     icahce2)
                                                msg10: (
                                                            Memory,
                                                                       ReadRes,
                                                                                       Bus)
msg3:(
          icache2,
                    {\bf Store Condreq},
                                    cpu2)
                                                msg11: (
                                                            icache2,
                                                                       ReadRes,
                                                                                       Bus)
msg4: (
          Bus,
                    StoreCondreq,
                                     dcahce2)
                                                msg12: (
                                                            Bus,
                                                                       ReadRes,
                                                                                       dcache1)
                                                                                       CPU1)
msg5:(
          dcache2,
                    StoreCondreq,
                                    cpu2)
                                                msg13:(
                                                            icache1,
                                                                       ReadRes,
msg6: (
                    StoreCondreq,
                                     dcahce1)
                                                msg14: (
                                                            icache1,
                                                                       ReadRes,
                                                                                       CPU1)
          Bus,
                    StoreCondreq,
                                    cpu1)
msg7: (
          icache1,
```

Figure A.8. Flow specification of a cache coherent read operation initiated from CPU1 to instruction cache. This flow is symmetric for CPU2.



```
msg8: (
msg0: (
          CPU1,
                    ReadReq,
                                      dcache1)
                                                            icache1,
                                                                       LoadLockedreq,
                                                                                        cpu1)
msg1: (
          dcache1,
                    ReadRes,
                                      CPU1)
                                                 msg9: (
                                                            Bus,
                                                                       LoadLockedreq,
                                                                                        Memory)
msg2: (
          icache1,
                    LoadLockedreq,
                                      Bus)
                                                 msg10: (
                                                            true)
msg3: (
          Bus,
                     LoadLockedreq,
                                      dcahce2)
                                                 msg11:(
                                                            Memory,
                                                                       ReadRes,
                                                                                        Bus)
msg4: (
          dcache2,
                    {\bf LoadLockedreq},
                                      cpu2)
                                                 msg12: (
                                                            icache2,
                                                                       ReadRes,
                                                                                        Bus)
                                      icahce2)
                                                                       ReadRes,
                                                                                        icache1)
msg5: (
          Bus,
                    LoadLockedreq,
                                                 msg13:(
                                                            Bus,
                                                                                        CPU1)
msg6: (
          icache2,
                    LoadLockedreq,
                                      cpu2)
                                                 msg14:(
                                                            dcache1,
                                                                       ReadRes,
msg7: (
          Bus,
                    LoadLockedreq,
                                      dcahce1)
```

Figure A.9. Flow specification of a cache coherent read operation initiated from CPU1 to data cache. This flow is symmetric for CPU2.

# Appendix B RTL model in VHDL

# **B.1** Flow Specification



Figure B.1. CPU write when cache has exclusive right of the requested data.



Figure B.2. CPU write when data only exist in the other CPU's cache



Figure B.3. CPU write when requested data only reside in Memory



Figure B.4. Cache send write back request to Memory



Figure B.5. CPU read when cache has exclusive right of the requested data.



Figure B.6. CPU read when data only exist in the other CPU's cache



Figure B.7. CPU read when requested data only reside in Memory

# **B.2** Protocol

There will be 3 protocols in total: read, write and write back protocl.



```
msg1: ( Cache1, wt , Bus ) msg2: ( Bus, wt, Memory )
```

msg3: ( Memory, wt, Bus )

Figure B.8. Flow specification of a cache coherent read operation initiated from CPU1 to Cache. This flow is symmetric for CPU2.

All the write operations are implemented in protocol presented in Fig. B.9. When the request activate cache coherent protocol, like in Fig. B.2, it will end in state17. The rest will end in state9.

All read operations are implemented in protocol presented in Fig. B.10. Specification in Fig. B.7 will end in *state*17. The rest of the specification without activating cache coherence protocol end in *state*9.



```
msg1: (
         CPU1,
                                                                         Cache1)
msg2: (
                         CPU1)
         Cache1,
                    wt,
                                               msg8: (
                                                          Bus,
                                                                         Cache1)
msg3: (
         Bus,
                          Cache2)
                                               msg9: (
                                                          Cache1,
                                                                   wt,
                                                                        CPU1)
                    snp,
         Cache2,
msg4: (
                   snp,
                         Bus)
                                               msg10: (
                                                          Cache1,
                                                                        CPU1)
msg5: (
         Bus,
                    wt,
                          Memory)
                                               msg11:(
                                                          Cache1, wt,
                                                                        CPU1)
msg6: (
         Memory,
                   wt,
                         Bus)
```

Figure B.9. Flow specification of a cache coherent write operation initiated from CPU1 to Cache. This flow is symmetric for CPU2.



```
msg1: (
         CPU1,
                                                                         Cache1)
msg2: (
         Cache1,
                    rd,
                         Bus)
                                                msg8: (
                                                          Bus,
                                                                    rd,
                                                                         Cache1)
msg3: (
         Bus,
                    snp,
                          Cache2)
                                                msg9: (
                                                          Cache1,
                                                                   rd,
                                                                        CPU1)
msg4: (
         Cache2,
                   snp,
                          Bus)
                                                msg10: (
                                                          Cache1,
                                                                   rd,
                                                                        CPU1)
msg5: (
         Bus,
                    rd,
                          Memory)
                                               msg11:(
                                                          Cache1, rd,
                                                                        CPU1)
msg6: (
         Memory,
                   rd,
                          Bus)
```

Figure B.10. Flow specification of a cache coherent read operation initiated from CPU1 to Cache. This flow is symmetric for CPU2.

#### LIST OF REFERENCES

- [1] Sai Ma, Debjit Pal, Rui Jiang, Sandip Ray, and Shobha Vasudevan. Can't see the forest for the trees: State restoration's limitations in post-silicon trace signal selection. In Proceedings of the IEEE/ACM International Conference on Computer-Aided Design, ICCAD '15, pages 1–8, Piscataway, NJ, USA, 2015. IEEE Press.
- [2] P. Patra. On the cusp of a validation wall. *IEEE Design Test of Computers*, 24(2):193–196, March 2007.
- [3] Yael Abarbanel, Eli Singerman, and Moshe Y. Vardi. Validation of soc firmware-hardware flows: Challenges and solution directions. In *Proceedings of the 51st Annual Design Automation Conference*, DAC '14, pages 2:1–2:4, New York, NY, USA, 2014. ACM.
- [4] Dongyang Lin, Tianqi Hong, Yanjing Li, S Eswaran, Sudhakar Kumar, Farzan Fallah, Nagib Hakim, Donald S Gardner, and Subhasish Mitra. Effective post-silicon validation of system-on-chips using quick error detection. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 33(10):1573–1590, 2014.
- [5] S. Krstic, Jin Yang, D.W. Palmer, R.B. Osborne, and E. Talmor. Security of soc firmware load protocols. In *Hardware-Oriented Security and Trust (HOST)*, 2014 IEEE International Symposium on, pages 70–75, May 2014.
- [6] Priyadarsan Patra. On the cusp of a validation wall. IEEE Des. Test, 24(2):193–196,March 2007.

- [7] Nathan Binkert, Bradford Beckmann, Gabriel Black, Steven K. Reinhardt, Ali Saidi, Arkaprava Basu, Joel Hestness, Derek R. Hower, Tushar Krishna, Somayeh Sardashti, Rathijit Sen, Korey Sewell, Muhammad Shoaib, Nilay Vaish, Mark D. Hill, and David A. Wood. The gem5 simulator. SIGARCH Comput. Archit. News, 39(2):1–7, August 2011.
- [8] Kees Goossens, Bart Vermeulen, Remco van Steeden, and Martijn Bennebroek. Transaction-based communication-centric debug. In *Proceedings of the First International Symposium on Networks-on-Chip*, NOCS '07, pages 95–106, Washington, DC, USA, 2007. IEEE Computer Society.
- [9] Bart Vermeulen and Kees Goossens. A network-on-chip monitoring infrastructure for communication-centric debug of embedded multi-processor socs. In VLSI Design, Automation and Test, 2009. VLSI-DAT '09. International Symposium on, VLSI-DAT '09, pages 183–186, 2009.
- [10] Kees Goossens, Bart Vermeulen, and Ashkan Beyranvand Nejad. A high-level debug environment for communication-centric debug. In *Proceedings of the Conference on De*sign, Automation and Test in Europe, DATE '09, pages 202–207, 3001 Leuven, Belgium, Belgium, 2009. European Design and Automation Association.
- [11] Amir Masoud Gharehbaghi and Masahiro Fujita. Transaction-based post-silicon debug of many-core system-on-chips. In *ISQED*, pages 702–708, 2012.
- [12] Mehdi Dehbashi and Grschwin Fey. Transaction-based online debug for noc-based multiprocessor socs. In Proceedings of the 2014 22Nd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing, PDP '14, pages 400–404, Washington, DC, USA, 2014. IEEE Computer Society.
- [13] Amir Masoud Gharehbaghi and Masahiro Fujita. Transaction-based debugging of system-on-chips with patterns. In *Proceedings of the 2009 IEEE International Con-*

- ference on Computer Design, ICCD'09, pages 186–192, Piscataway, NJ, USA, 2009. IEEE Press.
- [14] Marc Boule, Jean-Samuel Chenard, and Zeljko Zilic. Assertion checkers in verification, silicon debug and in-field diagnosis. In *Proceedings of the 8th International Symposium* on Quality Electronic Design, ISQED '07, pages 613–620, Washington, DC, USA, 2007. IEEE Computer Society.
- [15] Eli Singerman, Yael Abarbanel, and Sean Baartmans. Transaction based pre-to-post silicon validation. In *Proceedings of the 48th Design Automation Conference*, DAC '11, pages 564–568, New York, NY, USA, 2011. ACM.
- [16] Yael Abarbanel, Eli Singerman, and Moshe Y. Vardi. Validation of soc firmware-hardware flows: Challenges and solution directions. In *Proceedings of the The 51st Annual Design Automation Conference on Design Automation Conference*, DAC '14, pages 2:1–2:4, New York, NY, USA, 2014. ACM.
- [17] Ho Fai Ko and N. Nicolici. Algorithms for state restoration and trace-signal selection for data acquisition in silicon debug. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 28(2):285 –297, feb. 2009.
- [18] K. Basu and P. Mishra. Efficient trace signal selection for post silicon validation and debug. In VLSI Design (VLSI Design), 2011 24th International Conference on, pages 352 –357, jan. 2011.
- [19] The gem5 simulator: A modular platform for computer-system architecture research. http://www.gem5.org/docs/html/gem5MemorySystem.html.