# COPYCAT: Controlled Instruction-Level Attacks on Enclaves for Maximal Key Extraction

Daniel Moghimi<sup>1</sup>, Jo Van Bulck<sup>2</sup>, Nadia Heninger<sup>3</sup>, Frank Piessens<sup>2</sup>, and Berk Sunar<sup>1</sup>

Worcester Polytechnic Institute, Worcester, MA, USA
 <sup>2</sup>imec-DistriNet, KU Leuven, Leuven, Belgium
 <sup>3</sup>University of California, San Diego, CA, USA

#### **Abstract**

The adversarial model presented by trusted execution environments (TEEs) has prompted researchers to investigate unusual attack vectors. One particularly powerful class of *controlled-channel* attacks abuses page-table modifications to reliably track enclave memory accesses at a page-level granularity. Crucially, in contrast to noisy microarchitectural timing leakages, this line of deterministic controlled-channel attacks abuses indispensable architectural interfaces and hence cannot be mitigated by tweaking microarchitectural resources.

We propose an innovative controlled-channel attack, named COPYCAT, that deterministically counts the number of instructions executed within a single enclave code page. We show that combining the instruction counts harvested by COPYCAT with traditional, coarse-grained page-level leakage allows to accurately reconstruct enclave control flow at a maximal instruction-level granularity. COPYCAT can identify intra-page and intra-cache line branch decisions which may ultimately differ only in a single instruction, highlighting that even extremely subtle control flow deviations can deterministically be leaked from secure enclaves. We demonstrate the improved resolution and practicality of COPYCAT on Intel SGX platforms in an extensive study of single-trace and deterministic attacks against side-channel hardened cryptographic libraries. As a result, we disclose multiple vulnerabilities in the the latest version of widely-used cryptographic libraries: WolfSSL, Libgerypt and OpenSSL, and propose novel algorithmic attacks to perform single-trace key extraction. Our findings mark the importance of stricter verification of cryptographic implementations, especially in the context of TEEs.

#### 1 Introduction

In the past years, we have seen a continuous stream of software-based side-channel attacks [4, 8, 27, 32, 34, 52, 75, 82, 83]. A first category of microarchitectural timing attacks commonly abuses optimizations in modern processors, where secret-dependent state is accumulated in various microarchitectural buffers during the victim's execution. If these buffers

are not flushed upon context switch to an attacker domain, victim secrets can be reconstructed by observing timing variations by the attacker. The success of these attacks critically relies on subtle timing differences, making them inherently non-deterministic and prone to measurement noise [32]. Moreover, this class of stateful attacks can be eliminated by isolating leaky microarchitectural resources [1, 26, 46, 50, 50, 71].

Orthogonal to the first class of microarchitectural timing attacks, recent research on controlled-channel attacks [35, 76, 77, 80] has abused the processor's privileged interface to extract fully deterministic, noise-free side-channel access patterns from enclave applications. While the operating system (OS) was traditionally not considered to be under attacker control, this assumption fundamentally changed with the rise of trusted execution environments (TEEs), such as Intel SGX. Prior research [76, 80] has identified page-table accesses and faults as privileged interfaces that can be exploited as no-noise controlled-channels to deterministically reveal enclave memory accesses at a 4 KiB page-level granularity. The paging channel has drawn considerable research attention since it abuses an intrinsic property of the x86 processor architecture without relying on microarchitectural state that can be straightforwardly flushed or partitioned. In particular, it has proven to be challenging to mitigate in a principled way, in spite of numerous defense proposals [22, 23, 29, 62, 67, 68, 70].

In this work, we show that the resolution of deterministic controlled-channel attacks extends well beyond the relatively coarse-grained 4 KiB page-level granularity. We contribute COPYCAT, an innovative *interrupt-counting* channel that can precisely reconstruct the intra-page control flow of a secure enclave at a maximal, instruction-level granularity. Our attack leverages the SGX-Step [74] framework to forcibly step into a victim enclave code exactly one instruction at a time. While high-frequency timer interrupts have previously been leveraged to boost microarchitectural timing attacks [36, 40, 48, 53, 75], we exploit the architectural interrupt interface itself as a deterministic controlled-channel. Our attacks rely on the key observation that interrupts can force the enclave to advance exactly one instruction at a time. Hence,

merely counting the number of steps immediately reveals the number of instructions executed in the victim enclave. We show that when combining our fine-grained interrupt-based counting technique with traditional, coarse-grained page-table access patterns [76, 77] as a secondary oracle, we can construct highly effective and deterministic attacks that track enclave control flow at a maximal, instruction-level granularity. Crucially, by overcoming the spatial resolution limitation of prior controlled-channel attacks with the improved temporal dimension of COPYCAT, we defeat a key assumption in some prior defenses [44, 68] that presume that adversaries can only deterministically monitor enclave memory accesses at a coarse-grained 4 KiB granularity. Further, in contrast to previous high-resolution SGX side-channels [4, 48, 52, 53, 75] that rely on timing differences from contention in some shared microarchitectural state, COPYCAT cannot be transparently mitigated by isolating microarchitectural resources.

Consequently, COPYCAT enables a class of side-channel cryptanalysis targeting secure enclaves that has previously been neglected. In an extensive study of cryptographic implementations within widely-used libraries, we discover that all of them have flaws that are deemed exploitable one way or another, when the leakage model matches the capability of COPYCAT. On top of this, we propose new algorithmic attacks that apply to multiple schemes, and an RSA key generation attack based on a branch-and-prune algorithm [37]. Our technique combined with COPYCAT results in multiple successful breaks of these implementations. We also show that when an implementation leaks partial information [31], COPYCAT maximizes the capability of the attacker as if the leakage in the implementation were introduced as a constant bias [54]. Our practical attacks on Intel SGX can trivially be reproduced without any additional assumptions or sampling noise. We hope that our findings will help developers and the community to improve protection against this class of vulnerabilities.

Contributions. In summary, our main contributions are:

- We propose COPYCAT: a novel deterministic controlledchannel attack to leak runtime control flow from Intel SGX enclaves without noise at an instruction-level granularity.
- We explore the impact of COPYCAT on non-crypto applications by defeating a state-of-the-art compiler hardening technique against branch shadowing attacks.
- We demonstrate several attacks on the latest Intel SGX hardware and empirically verify the practicality and capability of these attacks on an extensive case study on WolfSSL cryptographic library.
- We perform extensive side-channel cryptanalysis of widelyused cryptographic libraries including WolfSSL, Libgcrypt, and OpenSSL and report exploitable vulnerabilities.
- We propose new algorithmic attacks to exploit these vulnerabilities in DSA, ECDSA, and ElGamal, as well as RSA

- key generation, which result in a complete break of all these implementations in the context of Intel SGX.
- Finally, we outline requirements and pitfalls for countermeasures and mitigations in hardware and software.

Responsible Disclosure. We reported the weaknesses in WolfSSL in Nov. 2019 and provided guidelines for mitigations, tracked via CVEs 2019-1996{0,1,3} and CVE-2020-7960.<sup>1</sup>. We shared our attack with the Intel product security incident response team (iPSIRT), who acknowledged that COPYCAT leaks side-channel information, but re-iterated that protecting against side-channels requires the enclave developer to follow the constant-time coding best practices as advised by Intel [43].

# 2 Background

# 2.1 Public Key Cryptographic Schemes

**RSA.** Keys for the RSA public-key cryptographic scheme [59] are generated as the following:

- 1. Choose large prime numbers p and q, compute N = pq,
- 2. Compute the least common multiple  $\lambda(N) = lcm(p-1,q-1)$ ,
- 3. Choose *e* such that  $1 < e < \lambda(N)$  and  $gcd(e, \lambda(N)) = 1$ ,
- 4. Compute  $d = e^{-1} \mod \lambda(N)$ .

(N,e) are public and  $(p,q,\lambda(N),d)$  are private. Encryption and signature verification are computed as  $c=m^e \mod N$ , and decryption and signature generation calculate  $m=c^d \mod N$ . The Chinese remainder theorem (CRT) can further optimize RSA operations. In this case, key generation also includes computing  $d_P=d \mod (p-1)$ ,  $d_Q=d \mod (q-1)$ , and  $q_{inv}=q^{-1} \mod p$ , where  $(d_P,d_Q,q_{inv})$  are private.

**DSA and ElGamal.** In the digital signing algorithm (DSA)[30], the public parameters are a prime p, another prime n divisor of p-1 and the group generator g. The private key x is chosen randomly such that 1 < x < n-1, and the public key is  $y = g^x \pmod{p}$ . To sign a message m, the following steps are taken after computing h = hash(m) using a secure function:

- 1. Choose k randomly such that 1 < k < n 1,
- 2. Compute  $r = g^k \mod p \pmod{n}$ ,
- 3. Compute  $s = k^{-1}(h + r \cdot x) \pmod{n}$ .
- (r, s) is the ouput signature pair.

In the ElGamal signature scheme, an alternative to DSA, the first signature pair r is computed similarly, but the second pair is computed as  $s = k^{-1}(h - r \cdot x) \mod (p - 1)$ .

**ECDSA.** Elliptic-curve DSA (ECDSA) is similar to DSA. The public parameters are an elliptic curve E with scalar multiplication operation  $\times$ , a point G on the curve, and the integer order n of G over E. The secret key d is a random

 $<sup>^{\</sup>rm 1}$  As disclosure of OpenSSL and Libgcrypt is ongoing, some of our findings are censored in this preprint release.

integer satisfying 1 < d < n-1, and the public key is  $Q = d \times G$ . The signature generation for the hash h of the message m is as follows:

- 1. Choose k randomly such that 1 < k < n 1,
- 2. Compute  $(x, y) = k \times G$  and  $r = x \mod n$ ,
- 3. Compute  $s = k^{-1}(h + r \cdot d) \mod n$ .

(r,s) is the ouput signature pair.

#### 2.2 Intel SGX

Recent Intel processors include software guard extensions (SGX) [42] to allow trusted execution of critical code in so-called *enclaves* on top of a potentially compromised OS. SGX enclaves are isolated at runtime in a memory area that is transparently encrypted and can be remotely attested by the processor. Dedicated eenter and eexit instuctions switch the processor in and out of "enclave mode".

Importantly, while the confidentiality and integrity of enclaved execution is always safeguarded by the processor, traditionally privileged OS software remains in charge of availability concerns. SGX enclaves live in the virtual address space of a conventional, user-space process. To allow for demand-paging and oversubscription of the physically available encrypted memory, enclave page-table mappings are verified but remain under the explicit control of the untrusted OS. When delivering asynchronous interrupts or exceptions, the processor takes care to securely save and scrub CPU registers before exiting the enclave, which can be subsequently re-entered through the eresume instruction. Furthermore, in case of a page-fault event, the processor clears the lower bits representing the page offset in the reported address to ensure that the OS can only observe enclave memory accesses at a 4 KiB page-level granularity.

### 3 Related Work

#### 3.1 Side-Channel Attacks on Intel SGX

While Intel SGX provides strong architectural isolation, several studies have highlighted that enclave secrets may still leak through side-channel analysis. Table 1 summarizes how all previously demonstrated attacks fall into two categories<sup>2</sup>: (1) microarchitectural timing attacks, which may achieve a high granularity but are inherently prone to measurement noise, and (2) fully deterministic controlled-channel attacks that only offer a relatively coarse grained 4 KiB page-level granularity. COPYCAT proposes the only generally applicable controlled-channel attack that is both fully deterministic and offers a maximal, instruction-level granularity.

Microarchitectural Contention. Microarchitectural timing side-channel attacks exploit the fact that various re-

sources, such as caches [15, 33, 36, 53, 65], DRAM row buffers [77], branch predictors [27, 40, 48], dependency resolution logic [52], or execution ports [4] are competitively shared between sibling CPU threads or not flushed on enclave exit. This contention causes measurable timing differences in the attacker domain, allowing the attacker to infer the private control flow or data access pattern of the enclave with varying degrees of granularity. In the context of a TEE such as Intel SGX, such attacks can be mounted with less noise and improved resolution because the adversary controls the OS.

In particular, one line of work has developed interrupt-driven attacks [36, 48, 53, 74] that rely on frequent enclave preemptions to sample side-channel measurements at an improved temporal resolution. This technique has been demonstrated to amplify side-channel leakage from the cache [53], branch target buffer [48], and directional branch predictor [40]. Similar techniques have been applied to attack ARM Trust-Zone [60]. Nemesis [75] showed that while single stepping, the response time to service an interrupt may reveal which instruction is being executed in the pipeline. The SGX-Step framework [74] has been leveraged in several other microarchitectural attacks [3, 40, 64, 72, 75] to reliably single-step enclaves at a *maximal* temporal resolution by means of precise and short timer interrupt intervals.

Controlled-Channel Attacks. Xu et al. [80] first showed how privileged adversaries can revoke access rights on a specific enclave page and be deterministically notified by means of a page-fault signal when the enclave next accesses that page. They demonstrated several attacks on non-cryptographic applications by observing page-fault sequences to uniquely identify specific points in the victim's execution. Subsequent work [76, 77] developed further stealthy techniques to extract the same information by issuing inter-processor interrupts to force TLB shootdowns and querying "accessed" and "dirty" bits or monitoring the caching behavior of untrusted pagetable entries. Finally, Gyselinck et al. [35] demonstrated an alternative controlled-channel attack that abuses legacy IA32 segmentation faults. Their attack offers an improved, bytelevel granularity in the first MiB of the enclave address space, but only for the unusual case of a 32-bit enclave, and this behavior has since been fixed in recent microcode.

With COPYCAT, we contribute a novel attack technique to improve the resolution of existing controlled channels by precisely counting the number of executed instructions per code page. While prior work [77] has suggested using noisy wall-clock time as an additional dimension for page-access events to improve stealthiness and reduce the number of TLB shootdowns, we for the first time recognize instruction counting as a practical primitive to fully deterministically capture the execution trace on a single enclave code page. Recent work [47] on enclave control flow obfuscation has investigated a similar single-stepping technique in an SGX simulator to probabilistically identify software versions in an emulated enclave debug environment. However, in contrast to our work,

<sup>&</sup>lt;sup>2</sup> Transient-execution attacks [64, 72] that breach Intel SGX's confidentiality are out-of-scope for side-channel analysis, as they have been mitigated through microcode updates and recovery of the trusted computing base.

Table 1: Characterization of demonstrated Intel SGX microarchitectural side channels (top) and controlled channels (bottom). Our novel COPYCAT technique is highlighted at the bottom and combines noise-free interrupt counting measurements with deterministic page table accesses to reconstruct enclave-private control flow at a maximal, instruction-level granularity.

|                   | Attack                                                                                                                                                                                                             | Data                                                              | Granularity                                                                                                                                      | Noise                                                 |
|-------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------|
| μ-arch contention | DRAM row buffer conflicts [77] PRIME+PROBE cache conflicts [15, 33, 36, 53, 65] Read-after-write false dependencies [52] Branch prediction history buffers [27, 40, 48] Interrupt latency [75] Port contention [4] | Code + data<br>Code + data<br>Data<br>Code<br>Code + data<br>Code | X Low (1-8 KiB) X Med (64-512 B cache line/set) High (4 B) High (branch instruction) High (instruction latency class) High (μ-op execution port) | X High  ~ Med X High  ~ Low X High X High             |
| Ctrl channel      | Page faults [80] and page table A/D bits [76, 77] IA-32 segmentation faults [35] Page table FLUSH+RELOAD [76] COPYCAT                                                                                              | Code + data<br>Code + data<br>Code + data<br>Code                 | <ul> <li>X Low (4 KiB )</li> <li>X Low/high (4 KiB; 1 B for enclaves ≤ 1 MiB)</li> <li>X Low (32 KiB)</li> <li>Y High (instruction)</li> </ul>   | ✓ Deterministic ✓ Deterministic ~ Low ✓ Deterministic |

they do not demonstrate actual instruction counting attacks and leave the implementation of a deterministic instruction counting primitive as purely hypothetical.

# 3.2 Cryptographic Side-Channel Leakage

Side-channel attacks have been proposed against cryptographic schemes [18, 56] and protocols [28, 45]. In standard public-key schemes, algorithms that execute highly variable operations for each bit of a secret input like the square-and-multiply algorithm for modular exponentiation and scalar multiplication based on Montgomery ladders are highly susceptible to side-channel leakage. Such algorithms have been exploited in naive attacks [51, 81, 82, 84] where the victim is triggered many times to compensate for potential sampling noise. These attacks generally conclude with the recovery of most of the secret bits. Nowadays, most implementations have adopted constant-time algorithms like the fixed-window scalar multiplication to mitigate such attacks [58].

Key Recovery using Partial Information. Key recovery from DSA and ECDSA with partial knowledge of the nonce can be modeled as an instance of the hidden number problem (HNP) [55]. Lattice-based algorithms can efficiently solve the HNP, as shown by Boneh and Venkatesan [13]. The HNP matches the case where partial information about the nonce k is leaked through some side channel, but the adversary can collect multiple sampling and signatures. Researchers have applied lattice-based attacks to non-constant time algorithms that leak some information about k [4, 9, 16, 57]. Ryan et al. [61] identified a pattern of vulnerability arising from leakage during modular reduction. Garcia et al. [31] demonstrate an attack that recovers the sequence of divisions and subtractions from the binary extended Euclidean algorithm (BEEA) for modular inversion operation. Their attack only uses the knowledge of the least significant bits of k for a lattice-based key recovery algorithm. In contrast, we use COPYCAT to demonstrate an attack that recovers the full key

with a single run of the DSA signing even for a compact implementation of BEEA (Section 5.2).

Even subtle implementation flaws that leak the bit length of k are sufficient for multi-trace lattice-based key recovery [17, 24, 25]. In these cases, while the algorithm was implemented with enough care to avoid secret-dependent conditional statements, they leak the bit length by skipping the most significant zero bits of k. In Section 5.4, we demonstrate a lattice-attack based on the leakage of the bit-length of k for an implementation that has protection against this known attack. In contrast to previous lattice-based attacks where the recovered samples may deviate with some noise from the leakage model, our leaked information exactly matches the leakage model of the implementation.

Single-trace Attacks. Recently, single-trace side-channel attacks on RSA key generation have been demonstrated that leak the sequence of divisions and subtractions from BEEA during the computation of coprime test  $gcd(e, p-1) \stackrel{?}{=}$ 1 [5, 78] or private parameter  $d = e^{-1} \mod \lambda(N)$  [20]. These attacks assume that recovering this sequence would be enough to recover the secrets (p-1) or lcm(p-1,q-1) since e is usually small<sup>3</sup> and can be brute forced. The proposed mitigation is to increase the order of the input e by masking it with a random variable that may be hardcoded [20]. In Section 5.3, using COPYCAT to recover all the branches from BEEA, not just the sequence of divisions and subtractions, we propose a novel attack that recovers the private factors pand q from  $e^{-1} \mod \lambda(N)$ . In comparison, our attack works even for large e, essentially thwarting mitigations based on increasing the input size.

We further show the impact of our algorithm on recovering the key from a modular inversion algorithm with multiple unknowns. We demonstrate a novel end-to-end single-trace attack on the CRT computation  $q^{-1} \mod p$ . In a concurrent and independent work, Aldaya et al. [3] outline a key recovery algorithm for  $q^{-1} \mod p$ . However, they use a different algo-

 $e^{3}$  is commonly chosen as  $2^{16} + 1 = 65537$ 

rithm that is not always successful. Our single-trace attacks on RSA in Section 5.3 use a branch-and-prune algorithm inspired by Heninger and Shacham [37]. Bernstein et al. applied a variant of this algorithm to recover RSA keys from the sequence of squares and multiplies in a sliding-window modular exponentiation implementation [10]. Similarly, Yarom et al. demonstrated an attack with intra-cache line granularity on a fixed window implementation of modular exponentiation that recover fraction of bits [83].

#### 4 COPYCAT Attack

This section introduces the adversary model and explains how a deterministic single-stepping interrupt primitive for SGX enclaves can be built, before illustrating the basic principle behind COPYCAT through toy examples. Finally, we highlight the impact of COPYCAT for non-cryptographic applications by defeating a state-of-the-art hardening technique against branch shadowing attacks.

#### 4.1 Attacker Model

We assume the standard Intel SGX root adversary model [42] with full control over the untrusted OS. In particular, following prior work, we assume a remote, software-only adversary who can configure the x86 APIC timer device to precisely interrupt the enclave [36, 48, 53, 74] and can modify page-table entries to learn enclaved memory accessed at a 4 KiB granularity [68, 76, 80]. Like previous attacks, we further assume knowledge of the victim application, either through source code or application binary. We consider the enclave code to be free from memory-safety vulnerabilities [73] and the Intel SGX platforms to be properly updated against transient-execution attacks [21, 64, 72].

The adversary's goal is to learn fine-grained control-flow decisions in the victim enclave. In contrast to noisy microarchitectural side-channel leakages [4, 15, 48, 52, 53, 75] we can also target victims who process a secret only *once* in a single run (as is the case in key generation) and hence victims who cannot be forced to repeatedly perform computations on the same secret over multiple invocations. Crucially, in contrast to prior controlled-channel attacks [76, 80], COPYCAT offers intra-page granularity and we assume that conditional control flow blocks in the victim enclave are aligned "to exist entirely within a single page" as officially recommended by Intel [44].

# 4.2 Building the Interrupt Primitive

Debug features like the x86 single-step trap flag are explicitly disabled by the Intel SGX design [42] while in enclave mode. Recent research, however, has demonstrated that root adversaries may abuse APIC timer interrupts to forcibly pause a victim enclave at fixed time intervals. We build our interrupt

primitive on top of the open-source SGX-Step [74] framework, which offers maximal temporal resolution by reliably interrupting the victim enclave at most one instruction at a time. SGX-Step comes in the form of a Linux kernel driver and runtime library to configure APIC timer interrupts and untrusted page-table entries directly from userspace.

**Deterministic Single-Stepping.** We first establish a suitable value for the platform-specific SGX\_STEP\_TIMER\_INTERVAL parameter using the SGX-Step benchmark tool on our target processor. This ensures that the victim enclave always executes at most one instruction at a time. Previous studies [40, 74, 75] have reported reliable single-stepping results with SGX-Step for enclaves with several hundred thousand instructions where in the vast majority of cases (> 97%) the timer interrupt arrives within the first enclave instruction after eresume, i.e., single-step, and in all other cases the interrupt arrives within eresume itself, i.e., zero-step before an enclave instruction is ever executed. Furthermore, zero-step events can be filtered out by observing that the "accessed" bit in the untrusted page-table entry mapping the enclave code page is only ever set by the processor when the interrupt arrived after eresume and the enclave instruction has indeed been retired [75]. Hence, to achieve noiseless and deterministic single-stepping for revealing code and data access at instruction-level granularity, we rely on the observation that a properly configured timer never causes a multi-step, and we discard any zero-step events by querying the "accessed" bit in the untrusted page-table entry mapping the current enclave code page. As an optimization, we first use a coarse-grained page-fault state machine to easily advance the enclaved execution to the targeted code page, before switching to singlestepping with COPYCAT to reveal instruction-level control flow within the code page of interest. Our experimental evaluation in Section 5 confirms that our single-stepping interrupt primitive indeed behaves fully deterministically when using COPYCAT to count several millions of enclave instructions.

The Curious Case of Macro Fusion. Interestingly, we found that COPYCAT can also unveil a hidden microarchitectural optimization in recent Intel Core processors, referred to as *macro fusion* [41]. The idea behind this optimization technique is to combine certain adjacent instruction pairs in the front-end into a single micro-op which executes with a single dispatch and hence frees up space in the processor pipeline. Intel documents that fusion only takes place for some, well-defined compare-and-branch instruction pairs [41]. To the best of our knowledge, COPYCAT for the first time contributes a methodology to independently research and reverse-engineer macro fusion optimizations in Intel processors. We consider a precise understanding of the macro fusion of particular importance for automated compile-time hardening techniques that statically balance conditional code paths (cf. Section 4.4).

We experimentally found that for fusible instruction pairs, COPYCAT consistently counts only *one* interrupt, even though the enclave-private program counter has been advanced with



Figure 1: Balanced if/else statement (top), compiled to assembly (left). Precise page-aligned, intra-cache line conditional control flow can be deterministically reconstructed with instruction-granular COPYCAT page access traces (right).

two assembly instructions forming the fused pair. Our experimental observations confirm Intel's documented limitations [41], e.g., test; jo can be fused (interrupted once) but not cmp; jo (interrupted twice); and fusible pairs that are split across an exact cache line boundary are not fused (interrupted twice). Importantly, we found that macro fusion does not impact the reliability of COPYCAT as a deterministic attack primitive. That is, we consistently observed in all of our attacks that macro fusion solely depends on the architectural program state, i.e., opcode types and their alignments, and hence a given code path always results in the same, deterministic number of interrupts.

# 4.3 Instruction-Level Page Access Traces

If/Else Statement. Conditional branches are pervasive in all applications [36, 38, 48, 80], but even side-channel-hardened cryptographic software may assume that carefully aligned if/else statements or tight loops cannot be reliably reconstructed (cf. Section 5). Figure 1 provides a minimal example of an if statement that has been hardened using a balancing else branch, e.g., as in the popular Montgomery Ladder algorithm. Note that the corresponding assembly code, as compiled by gcc, indeed only differs in a single x86 instruction which trivially fits entirely within the same page and cache line. This if branch is hence indistinguishable for a page-fault or cache side-channel adversary. While finer-grained, branch prediction side-channels may still be able to reconstruct the branch outcome, these attacks typically require several runs of the victim and can be trivially addressed by flushing the branch predictor on enclave exit. Furthermore, we show in Section 4.4 that the improved resolution of COPYCAT may even defeat state-of-the-art compiler defenses [38] that harden code against branch predictor leakage.

Figure 1 illustrates how COPYCAT can deterministically reconstruct the branch outcome by merely counting the number of instructions executed on the *P*0 code page containing the if branch before control flow is eventually transferred to the *P*1 code page containing the add function, as revealed by





Figure 2: Conditional data assignments in a page-aligned switch statement (left) deterministically leak through their relative positions in the precise, instruction-granular page access traces extracted by COPYCAT (right).

probing the "accessed" bit in the corresponding page-table entry. While traditional page-fault adversaries always see the same *page fault sequence* (*P*0,*P*1), independent of the secret, COPYCAT enriches this information with precise instruction counts, resulting in distinguishable *instruction-level page access traces* (*P*0,*P*0,*P*1) vs. (*P*0,*P*0,*P*0,*P*1). Phrased differently, COPYCAT provides intra-page granularity by complementing the 4 KiB spatial resolution of previous fault-driven attacks with a fully deterministic temporal dimension.

**Switch-Case Statement.** As a further example, Figure 2 illustrates precise control-flow recovery in a switch-case statement, where the code blocks again fall entirely within a single page and cache line and the same data is accessed in every case. While traditional page-fault adversaries always observe an identical, input-independent access sequence to the code and data pages, and the tight sequence of conditional jumps poses a considerable challenge for branch prediction adversaries [48], COPYCAT fully deterministically reveals control flow through the *relative* position of the data access in the instruction-granular page access traces.

# 4.4 Defeating Branch Shadowing Defenses

To highlight the importance of COPYCAT for non-cryptographic applications, we employ its improved resolution to defeat a state-of-the-art compiler defense [38] against branch predictor leakage. This again shows that COPYCAT changes the attack landscape and requires orthogonal mitigations when compared to microarchitectural side-channels.

Branch Shadowing Mitigation. Lee et al. [48] first proposed Zigzagger, an automated compile-time approach to defend against branch-shadowing attacks by rewriting conditional branches using cmov and a tight trampoline sequence of unconditional jump instructions. However, the security of their compiler transformation critically relies on the trampoline sequences being non-interruptible, and several proof-of-concept attacks on Zigzagger have been demonstrated using precise



Figure 3: Compiler mitigation [38] for branch prediction side channels. COPYCAT reveals control flow via the number of instructions executed on the trampoline page (red, dashed).

interrupt capabilities [35, 74, 75]. In response, Hosseinzadeh et al. [38] designed an improved compiler mitigation that employs runtime randomization to dynamically shuffle jump blocks on the trampoline area, thereby effectively hiding branch targets and making branch shadowing attacks probabilistically infeasible. Figure 3 illustrates how conditional branches are redirected through randomized jump locations ① on the trampoline page, while ensuring that all jumps ② outside of the trampoline are always executed in the same order. Finally, to protect against timing attacks, trampoline code is explicitly balanced with dummy instructions ③ to compensate for skipped blocks in the instrumented code.

Case-Study Attack. We evaluated COPYCAT on the opensource<sup>4</sup> release of the compiler hardening scheme [38] based on LLVM 6.0. First, we found that the dummy instruction balancing pass is not always entirely accurate and may result in execution paths that differ slightly by one or two instructions (cf. Appendix A.2). Crucially, while such subtle deviations would indeed very likely not be exploitable through timing, as originally envisioned by the mitigation, we experimentally validated that the unbalanced paths can be fully deterministically distinguished by COPYCAT adversaries. Second, even when the code paths are perfectly balanced, Figure 3 illustrates that merely counting the *number* of instructions executed on the trampoline page deterministically reveals whether the victim is executing balancing dummy code in a trampoline block or the actual if block on the instrumented code page. Note that the compiler carefully maintains a constant jump order when moving back and forth between the trampoline area and the instrumented code, ensuring that the execution remains oblivious to classical page-fault adversaries [68, 80] who will always observe the exact same sequence of pages regardless of the actual code blocks being executed.

# 5 Unleashing COPYCAT on WolfSSL

In an extensive case study on the WolfSSL cryptographic library, we show that COPYCAT enables attacks that were not previously possible without a deterministic and fine-grained leakage model. In Section 5.1, we outline our controlled-channel attack using COPYCAT to precisely recover the full execution trace of WolfSSL's implementation of the binary extended Euclidean algorithm (BEEA), which is used in vari-

ous security-critical operations in DSA, ECDSA, and RSA. The accuracy of recovering the full execution flow of BEEA enables new single-trace algorithmic attacks on both DSA signing and RSA key generation, as demonstrated in Sections 5.2 and 5.3, respectively. Finally, we apply COPYCAT to bypass insufficient side-channel mitigations and recover deterministic partial information on ECDSA signatures which is used in an optimal lattice-based attack.

**Experimental Setup.** Our experimental setup includes a desktop Intel Core i7-7700 CPU that supports Intel SGX and is updated with the latest microcode (0xca) running Ubuntu 16.04 with kernel 4.14.0-72-generic. We use the SGX-Step [74] framework v1.4.0 to implement our attacks on the latest stable WolfSSL version 4.2.0. WolfSSL officially supports compilation for Intel SGX enclaves. SageMath version 8.8 is used to execute key recovery attacks.

#### 5.1 COPYCAT on BEEA

Computing the modular inverse or greatest common divisor (GCD) using the binary extended euclidean algorithm (BEEA) has previously exposed cryptographic implementations to side-channel attacks [2, 31, 78]. BEEA, as shown in Algorithm 1, is not a constant-time algorithm and can leak various bits of its input. However, previous attacks are limited to recovering only partial and noisy information about the secret input. This limitation stems from low spatial resolution and the presence of noise. For instance, a cache- or page-level attacker who can distinguish which arithmetic subroutines have been invoked may not determine the outcome of the comparison at line 13 since both directions of the branch generate exactly the same sequence of memory access patterns. In addition, the arithmetic functions may fit within the same page and become indistinguishable for a page-level adversary. Alternatively, a cache attacker may try to track the outcome of these branches within the same page by tracking the corresponding instruction cache lines for the BEEA subroutine. However, a compact implementation of this algorithm can fit multiple branches within the same cache line. While some microarchitectural attacks on the instruction stream may leak some of these low-level branch outcomes, they are all prone to various degrees of noise [4, 27, 40, 48, 75].

WolfSSL supports two different implementations of BEEA in subroutines fp\_invmod\_slow and fp\_invmod (cf. Listings 4 and 5 in the Appendix). The former is a straightforward implementation of the algorithm, and the latter is a compact implementation that only supports odd integers for the modulus. We analyze both of these implementations and show how to use COPYCAT to recover the runtime control flow of these implementations without noise, that is, deterministically.

Binary Layout of Modular Inversion. After compilation, the subroutines: fp\_iseven and fp\_isodd are simply inlined within the same page as their caller fp\_invmod\_slow. On the other hand, the arithmetic functions:  $A=fp_add$ ,  $C=fp_cmp$ ,

<sup>&</sup>lt;sup>4</sup>https://github.com/SSGAalto/sgx-branch-shadowing-mitigation

Algorithm 1 Modular inversion using BEEA. In the optimized and compact implementation when modulus is odd, highlighted statements (blue) are removed.

```
1: procedure MODINV(u, modulus v)
 2:
              a_i \leftarrow 1, b_i \leftarrow 0, c_i \leftarrow 0, d_i \leftarrow 1, u_i \leftarrow u, v_i = v
 3:
              while isEven(u_i) do
 4:
                     u_i \leftarrow u_i/2
 5:
                     if isOdd(b_i) then
 6:
                             a_i \leftarrow a_i + v, b_i \leftarrow b_i - u
                     a_i \leftarrow a_i/2, b_i \leftarrow b_i/2
 7:
 8:
              while isEven(v_i) do
 9.
                     v_i \leftarrow v_i/2
                     if isOdd(d_i) then
10:
11:
                             c_i \leftarrow c_i + v, d_i \leftarrow d_i - u
12:
                     c_i \leftarrow c_i/2, d_i \leftarrow d_i/2
13:
              if u_i > v_i then
14:
15:
16:
```

D=fp\_div\_2 and S=fp\_sub are external calls and reside in a new page. Analyzing these arithmetic functions (A, C, D, S), including their internal subroutines, shows that they span over 2,895 bytes. Hence, it is reasonable to assume that they can fit into a single 4 KiB page, thus avoiding a page-level attacker from distinguishing them at runtime altogether. In addition, as already mentioned, even assuming that they will not align within the same page, reconstructing the exact execution flow is still impossible. For example, the transition from S to D can result from multiple different code paths. The instructions for fp\_invmod\_slow can fit into less than 6 cache lines with multiple basic blocks<sup>5</sup> overlapping within the same line.

WolfSSL also supports a modified version of BEEA that is optimized for cases where the modulus is odd, which is the case for  $q^{-1} \mod p$  (Section 5.3) and  $k^{-1} \mod n$  (Section 5.2). The control flow and overall layout for this implementation are similar to the previous implementation but it is more compact, as some of the arithmetic statements have been removed. fp\_invmod can fit into less than 4 cache lines with multiple overlapping basic blocks.

Recovering BEEA Control-flow Transfers. We analyzed the runtime control flow of fp invmod slow at runtime by matching its disassembly with the execution trace we recovered from running COPYCAT. Figure 4 shows the control flow transfers between the arithmetic functions (circles). The weight of each arrow shows the number of instructions that are executed for fp\_invmod\_slow between each transition. The division loop for  $u_i$  (*u-loop*) and  $v_i$  (*v-loop*) have a similar control flow. Also the two blocks of substitutions after the comparison of u > v have similar control flow for both the left S1 and right S2 direction. Only certain transitions are viable from these blocks to division loops during the computation of the modular inverse, i.e. S2 always goes to v-loop



Figure 4: Control flow of the BEEA as implemented by fp\_invmod\_slow. Each circle (D=div, C=cmp, S=sub, A=add) represents a call to a function in the page that holds these arithmetic functions. We count the exact number of instructions between two consecutive invocations that hit this page. The instruction counts reveals branch outcomes.

and S1 always goes to u-loop. Since these instruction counts are distinguishable for transitions that are related to conditional statements, we can use a trace of literals related to these weights in the graph to recover the control flow.

With a list of literals for the sequence of instruction counts collected between two consecutive accesses to the page that holds the arithmetic operations (A, C, D, S), we apply a set of divide-and-conquer rules to reconstruct the control flow for fp\_invmod\_slow. These rules start by translating the recovered instruction counts to corresponding generic blocks. For example, every time the algorithm executes an iteration of a division loop (u/v-loop), we observe either the sequence  $D \rightarrow D \rightarrow D$ , or the sequence  $D \rightarrow A \rightarrow S \rightarrow D \rightarrow D$ . Each of these sequences generates a consistent set of literals. Similarly, S1 or S2 always generate sequence like  $C \rightarrow S \rightarrow S \rightarrow S$ . After translating these generic blocks, we can use the remaining transitions to distinguish the exact blocks, i.e., we can recover whether a S1 or S2 followed by a set of division loops is equal to transition from S1 to u-loop or transition from S2 to *v-loop*. These rules are summarized as follow:

- Rule 1:  $? \xrightarrow{11} ? \xrightarrow{3} ? = D \rightarrow D \rightarrow D$ . Rule 2:  $? \xrightarrow{13} ? \xrightarrow{4} ? \xrightarrow{3} ? \xrightarrow{3} ? = D \rightarrow A \rightarrow S \rightarrow D \rightarrow D$ .
- Rule 3:  $? \xrightarrow{5} ? \xrightarrow{4} ? \xrightarrow{4} ? = C \rightarrow S \rightarrow S \rightarrow S$ .
- **Rule 4:**  $S? \xrightarrow{13} ? = S2 \rightarrow v\text{-loop}.$
- Rule 5:  $S? \xrightarrow{8} ? = S1 \rightarrow u\text{-loop}.$

Therefore, we first replace the literals for Rule 1, 2, and 3, which identify if we are in a division loop (u-loop or v-loop) or a comparison and substitution block (S?). Then based on the other transitions (Rule 4 and 5), we can figure out which state of the comparison and substitution block we have moved from, and which division loop we have moved to within the trace.

For the compact implementation in *fp\_invmod*, we apply the same approach. Figure 5 shows the control flow for this implementation after runtime analysis using COPYCAT. Similarly,

<sup>&</sup>lt;sup>5</sup>Basic block is a code sequence that has no branches in and out.



Figure 5: Control flow of BEEA in fp invmod.

we define a set of rules to translate the trace of instruction counts to control flow transfers of BEEA. Based on Figure 5, we modify the first three rules as the following to support full control-flow recovery based on the same approach<sup>6</sup>:

• Rule 1:  $? \xrightarrow{7} ? = D \rightarrow D$ . • Rule 2:  $? \xrightarrow{8} ? \xrightarrow{3} ? \xrightarrow{3} ? = D \rightarrow S \rightarrow D$ .

• Rule 3:  $? \xrightarrow{5} ? \xrightarrow{4} ? = C \rightarrow S \rightarrow S$ .

#### 5.2 **Single-Trace Attack on DSA Signing**

In contrast to previous attacks on BEEA that leak partial information about the nonce [31], we show the power of COPYCAT in recovering virtually the entire control flow from the execution of this implementation with 100 percent precision. As a result, we can perform a single-trace attack on the DSA signing operation.

DSA Key Recovery. WolfSSL uses fp\_invmod to compute the modular inversion of  $k_{inv} = k^{-1} \mod n$ , where n is an odd prime. Since we can recover the exact control flow of this computation and the modulus n is public, we simply step through the execution trace of Algorithm Algorithm 1, applying each step of the computation according to the recovered trace to compute  $k_{inv}$ . After recovering  $k_{inv}$ , recovering the full nonce and private key is trivial:  $k = k_{inv}^{-1} \mod n$ ,  $x = r^{-1}(sk - h) \bmod n.$ 

**Evaluation.** We executed the described attack on the 160-bit DSA for 100 different signing operations. We use a combination of pages in a page-level controlled-channel attack to first reach to the beginning of the modular inversion operation for DSA. Then we start COPYCAT over the page that holds instructions of fp\_invmod. On average, this attack issues 22,000 IRQs and takes 75 ms to iterate over an average of 6,320 steps. Our attack successfully recovered the key using a single trace for 100 traces, implying that COPYCAT reliably reconstructs the entire execution flow in every single instance of the execution. As a result, a single-trace attack on DSA can be executed reliably without the need for multiple signatures or additional cryptanalysis.

# **Algorithm 2** Recovering p and q from trace of $q^{-1} \mod p$ .

```
procedure RECOVER_PQ(trace t, modulus N)
         h \leftarrow (-test\_t(t, 1, 1), 1, 1, 1)
3:
         while h do
               steps, b, p, q \leftarrow hpop(h)
4:
5:
              if p.q = N then return p,q
6:
               g \leftarrow (p,q), (p+2^b,q), (p,q+2^b), (p+2^b,q+2^b)
              for p_s, q_s in g do
7:
                    if mod(p_s.q_s, 2^{b+1}) = mod(N, 2^{b+1}) then
8:
                          hpush(h, (-test\_t(trace, p_s, q_s), b+1, p_s, q_s))
9:
```

# Single-Trace Attacks on RSA KeyGen

During RSA key generation, WolfSSL checks if a potential prime p is coprime with e by checking if gcd(e, p-1) = 1. This step uses the inefficient, but more secure textbook algorithm for greatest common divisor (GCD), which simply performs a series of divisions. However, in a later stage, Wolf-SSL computes  $d = e^{-1} \mod \lambda(N)$  and the CRT parameter  $q^{-1}$  mod p using the BEEA (cf. Appendix. Listing 6).

**Key Recovery from**  $q^{-1} \mod p$  **Trace.** Compared to  $k^{-1} \mod n$ , this attack is more challenging since in this case, both operands p and q are unknown. We give a novel and efficient attack that recovers the private RSA parameters p and q using COPYCAT. We use the relationship of the public variable N = pq and the execution trace of the BEEA on  $q^{-1} \mod p$  which provides enough information to recover the factorization of N. The main idea is that for a given guess for some bits of p and q, we can check whether the partial execution trace of the algorithm matches that guess, and then implement a branch-and-prune algorithm that efficiently recovers successive bits of the key [37].

We propose Algorithm 2 to recover p and q using only knowledge of N and the execution trace of the BEEA on  $q^{-1}$  mod p. The algorithm starts by initializing a list of hypotheses for values of the least significant bits of q and p. Each hypothesis keeps track of the current step, bit position b, and the hypothesized values of  $p_s = p \mod 2^b$  and  $q_s = q \mod 2^b$ . Among the four possible assignments for the  $(b+1)^{st}$  bits of p and q in Step 7, there will be two choices satisfying the constraint that  $pq \equiv N \mod 2^{b+1}$ . For these new guesses, we evaluate the BEEA algorithm up to the number of bits guessed so far, and check this deterministic algorithm evaluation on the guess against the ground truth execution trace t. We then do a depth-first search prioritized by the number of steps in which the algorithm executed correctly, and terminate when we have found a candidate for which pq = N holds.

Evaluation. We executed a similar attack to Section 5.2 to collect traces from the modular inversion of  $q^{-1}$  mod p, as it is computed by fp invmod slow. We tried this attack on 100 different 2048-bit RSA key generations. On average, we iterate over 39,400 steps by issuing 106490 IRQs in 365 ms. However, the average time to collect a trace can take up to a

<sup>&</sup>lt;sup>6</sup>Rule 4 and Rule 5 remain the same.

<sup>&</sup>lt;sup>7</sup>WolfSSL always generates the CRT parameters.

**Algorithm 3** Recovering p and q from trace of  $e^{-1} \mod \lambda$ .

```
1: procedure RECOVER_PQ(trace t, e, modulus N)
          h \leftarrow (-test\_t(t, 0, e), 1, 1, 1)
          while h do
3:
 4:
                steps, b, p, q \leftarrow hpop(h)
5:
                if p.q = N then return p, q
                g \gets (p,q), (p+2^b,q), (p,q+2^b), (p+2^b,q+2^b)
 6:
                for p_s, q_s in g do
 7:
                      if mod(p_s.q_s, 2^{b+1}) = mod(N, 2^{b+1}) then
 8:
                           \phi = (p_s - 1)(q_s - 1)
9:
10:
                           for i = 1, ..., 2^{\ell} do
                                 if p_s q_s > N or mod(\phi, 2^i) \neq 0 then
11:
                                       continue
12:
                                 \lambda = \phi/2^i
13:
                                 newsteps = test\_t\_lamda(t, \lambda, e)
14:
15:
                                 if newsteps >= b + 1 then:
                                       hpush(h, (-newsteps, b+1, p_s, q_s))
16:
          return fail
```

second depending on how fast the prime numbers are chosen. The attack takes 20 seconds to recover the key from a trace. All 100 trials of the attack could successfully recover the generated RSA keys.

Key Recovery from  $e^{-1} \mod \lambda(N)$  Trace. In contrast to previous attacks on this computation [20], we propose a different algorithmic attack that takes advantage of the fact that COPYCAT can recover the entire control flow of this algorithm. As a result, a single-trace attack can trivially be done for any chosen value of e, both large or small. In addition, we revisit the proposed masking countermeasure in [20] and show that it is insecure against our strong adversary using COPYCAT.

Our goal is to recover the RSA primes p and q using the trace of BEEA for  $d = e^{-1} \mod \lambda(N)$ . The modulus N and the public exponent *e* are known, while  $\lambda(N)$  is secret. We present a modified branch-and-prune technique in Algorithm 3 that recovers the factors p and q for a large fraction of generated RSA keys. The probabilistic nature of this attack is due the fact that the BEEA is not computed directly on the p and q, but instead on e and  $\lambda(N)$ .  $\lambda(N) = (p-1)(q-1)/\gcd(p-1)$ 1, q-1), so with high probability gcd(p-1, q-1) only has small factors and can be brute forced. We specialize to the case of  $gcd(p-1, q-1) = 2^i$  for small integer i below since it is the most straightforward, but the analysis can be extended to other candidate small primes with more effort in brute forcing. Recall that  $p_s$  and  $q_s$  are related to  $\lambda(N)$  through lcm((p-1)(q-1)). We modified the test on the partial guess  $p_s$  and  $q_s$  to consider this relationship. From a set of guesses for  $p_s$  and  $q_s$ , we first compute  $\phi_s = (p_s - 1)(q_s - 1)$  and then  $\lambda$  for a choice of i. We only use the flow leaked by the execution trace t to test a new hypothesis if  $\phi/\lambda = 2^{t}$ . The algorithm either returns p and q or it fails to recover p and qif  $\phi/\lambda(N) \neq 2^i$ .

**Analysis.** The algorithm will succeed whenever  $\phi/\lambda = 2^i$  for small *i*. For non-powers of 2 the test against the BEEA execution trace in Step 15 will likely fail, and cause this branch to be pruned. Since p = 2p' + 1 and q = 2q' + 1 for some  $p', q' \in \mathbb{Z}$ ,

we have  $\lambda(N) = lcm(p-1,q-1) = 2lcm(p',q')$ . From the prime number theorem [66], the probability that two random integers are coprime converges to  $\prod_{p \in primes} (1-1/p^2) = \frac{6}{\pi^2} \approx 61\%$  as the size of the integers increases. In other words, if we run Algorithm 3 for only i=1, it will succeed 61% of the time when p' and q' are actually co-prime. If we allow p' and q' to have even factors we obtain a probability of  $\prod_{p \in primes, p > 2} (1-1/p^2) = \frac{8}{\pi^2} \approx 81\%$ . This means that even for the modest number of iterations, e.g.  $\ell = 8$ , we have nearly 81% success probability. These estimates have also been confirmed by our experiments.

**Evaluation.** We tried this attack on 100 different key generation efforts (2048-bit key). On average, we iterate over 81,090 steps by issuing 230,050 interrupts per attack in 800ms. The average time to collect a trace is about a second and the attack takes about 30 seconds to successfully recover the key for 81% of the keys when  $lcm((p-1)(q-1)) = (p-1)(q-1)/2^i$ .

Revisiting Masking Protection. Earlier attacks have exploited the exhaustibility of e [78]. Our algorithm works for arbitrary (even full length) e. Thus increasing the size of e by choosing a bigger public exponent or masking is not sufficient to mitigate our attack. Aldaya et al. [6] proposed masking e by computing  $b = (er)^{-1} \mod \lambda(N)$  for a random r such that  $gcd(r, \lambda(N)) = 1$ . The private key then can be computed as  $d = rb \mod \lambda(N)$ . In this proposal, it is even suggested that r can be hard coded. We tested our attack for a hard coded (known) choice of r and the key recovery works trivially in this case. Alternatively, if r is not hard coded but we have a trace for the initial  $gcd(r, \lambda(N))$  computation using binary gcd, we can again decode it (with the knowledge of N) to recover r. With r recovered, the attack proceeds as before, i.e. from the execution trace of  $b = (er)^{-1} \mod \lambda(N)$  we recover p and q by running Algorithm 3 with er supplied as input instead of e. Since Algorithm 3 is agnostic with respect to the size of e, it will handle the full size er and recover p and q.

# 5.4 Breaking ECDSA Timing Protection

WolfSSL uses the subroutine wc ecc mulmod ex (Listing 1) to compute the scalar multiplication  $k \times G$  of ECDSA signing. This subroutine has built-in mitigations against side-channel attacks and fundamentally implements a always-add-and-double algorithm by arithmetizing the conditional check for the add. As a result both of the scalar operations add at Line 17/20 and double at Line Line 18/21 will be executed for all scalar bits. This stops an adversary from bit-by-bit leakage of the nonce k. The second countermeasure that is implemented in this implementation aims to protect against attacks exploiting the bit length of the nonce [17, 25]. This has been achieved by executing a series of dummy operations for all the leading zero bits. While these dummy operations mitigate side channels such as data cache- or page-level attacks and timing attacks, we use COPYCAT to distinguish branch outcome at Line 14 and leak the bit length of nonce k.

```
int wc_ecc_mulmod_ex(mp_int* k, ecc_point *G, ecc_point *R, mp_int* a, mp_int
           * modulus, int map, void* heap) { ...
   for (;;) {
   if (--bitcnt == 0) { /* grab next digit as required */
    if (digidx == -1) {
       break;
    buf = get_digit(k, digidx);
    bitcnt = (int)DIGIT_BIT;
        -digidx:
   i = (buf >> (DIGIT_BIT - 1)) & 1; /* grab the next msb from the multiplicand */
13
   mode = i; /* timing resistant - dummy operations */
    err = ecc_projective_add_point(M[1], M[2], M[2], a, modulus, mp);...
    err = ecc_projective_dbl_point(M[2], M[3], a, modulus, mp);...
17
   err = ecc_projective_add_point(M[0], M[1], M[i^1], a, modulus, mp);...
18
   err = ecc_projective_dbl_point(M[2], M[2], a, modulus, mp);...
```

Listing 1: wc\_ecc\_mulmod\_ex implements scalar multiplication using a bit-by-bit always-add-and-double algorithm. The function protects against both timing and cache attacks by executing dummy instructions. For brevity, error checking and code sections that are not relevant to our attack are removed.

Recovering **Dummy** Operations. We analyze we eee mulmod ex using COPYCAT. In this analysis, we count the number of instructions that are executed between consecutive accesses to the page that holds the ecc projective dbl point subroutine. The trace of instruction counts show that for one transition of basic blocks, we can observe 49 steps when the function is processing the dummy operations. As soon as the subroutine switches to the legitimate operations, this step count will change to 46. As a result, we can use this information to determine from a set of traces the number of dummy executions of the always-add-and-double sequence. Since this leakage only gives us the bit length of the scalar, we are only interested in observing the first few bits of the nonce. We configured our step count to only give us a trace that recovers the processing of the first 7 bits of the nonce k.

Lattice Attack using the Nonce Bit Length. For a generated list of signatures, using the bit length of the nonce  $k_i$  recovered using COPYCAT, we filter out the ones with all short nonces[13]. We follow the approach of Howgrave-Graham and Smart [39] and embedding strategy by Benger et al. [9] to formulate the ECDSA key recovery as solving the Closest Vector Problem (CVP) for a lattice. The detailed construction of this lattice is given in Appendix A.1. The desired short vector of this lattice can be found using standard lattice basis reduction algorithms like LLL [49] or BKZ [63], which leads to full recovery of the private key.

**Evaluation.** We executed this attack for 10,000 signing operations. In all cases, we verified that our attack can count the number of leading zero bits with 100 percent accuracy. On average, each attack issues 3244 IRQs to count 2542 steps of

Table 2: We give the minimum number of signature samples collected for each bias class to reach 100% recovery success for the lattice-based key recovery on wc\_ecc\_mulmod\_ex of ECDSA. L-TIME is the lattice reduction time. T-TIME is the trace collection time.

| LZBs | DIM | L-TIME | SAMPLES | IRQs  | T-TIME   |
|------|-----|--------|---------|-------|----------|
| 4    | 75  | 30 sec | 1,200   | 3.9M  | 13.3 sec |
| 5    | 58  | 5 sec  | 1,856   | 6.0M  | 20.4 sec |
| 6    | 46  | 3 sec  | 2,944   | 9.6M  | 33.7 sec |
| 7    | 42  | 2 sec  | 5,376   | 17.5M | 1 min    |

the scalar multiplication operation. Table 2 shows the results for key recovery using various nonce bit lengths. Our empirical results show that our lattice attack performs as if the bit length of the nonces was given to us without noise which is the case in our attack using COPYCAT.

# 6 Discussion and Mitigations

Architectural Changes. COPYCAT critically relies on the ability to single-step enclaved execution, which is within Intel SGX's threat model [42, 74]. While SGX enclaves remain explicitly interrupt-unaware by design, some research proposals [23, 67] retrofit hardware support for transactional memory to detect suspicious interrupt rates as a side-effect of an ongoing attack. However, such features are not commonly available on off-the-shelf SGX hardware, and they would not fundamentally address the attack surface as COPY-CAT adversaries are likely to develop more stealthy attack techniques [76, 77] that remain under the radar of heuristic defenses. Nevertheless, following a long line of microarchitectural attacks [40, 48, 53, 74, 75] abusing interrupts, our study provides strong evidence that interrupts may also amplify deterministic controlled-channel leakage and should be taken into account in the enclaved execution threat model. We advocate architectural changes in the Intel SGX design and further research to rule out interrupt-driven attack surface [19].

Static Code Balancing. We encourage future research in improved compile-time hardening techniques that may automatically rewrite conditional branches to always ensure a constant interrupt counting pattern, regardless of the executed code. The key requirement for such a defense would be to ensure that the adversary not only observes a secret-independent sequence of pages but also always counts the same number of instructions between page transitions. To achieve such a guarantee, the compiler would also have to be explicitly aware of macro fusion, as explained in Section 4.2, when balancing the observed instruction counts. We expect further challenges when dealing with secret-dependent loop bounds, as in our attacks of Section 5. To handle data accesses, control-flow balancing techniques could potentially be combined with existing solutions for data location randomization [14].

Constant-time Implementation. The best practice for cryptographic implementations is to avoid secret-depend branches and memory lookups. WolfSSL applied such a countermeasure to mitigate our attack on ECDSA (CVE-2019-19960). Bernstein and Yang proposed a constant-time GCD algorithm that can be used for applications like modular inversion[11]. However, constant-time implementation is not easy for generic non-cryptographic applications [69]. While there are tools and techniques to test [79], verify [7], and generate [12] constant-time code, the scalability and performance of these approaches is still far from settled.

Cryptographic Countermeasure. While it is preferable to avoid secret-dependent branches altogether, specific countermeasures can be applied to some cryptographic schemes. Masking the input of the modular inversion can mitigate our demonstrated attack if it is applied properly and as long as the random blinding variable itself is not leaked. WolfSSL applies this solution to mitigate our attack on DSA (CVE-2019-19963).

Alternatively, some operations can be redefined using an alternative formulation. In particular, for the demonstrated attack on  $q^{-1} \mod p$  RSA-CRT key generation (CVE-2020-7960), we suggested they use Fermat's Little Theorem:  $q^{p-2} \mod p$ . As a result, the implementation avoids modular inversion for this operation, and it relies instead on a constant-time modular exponentiation implementation.

# 7 Conclusion

Our works show that deterministic controlled-channel adversaries are not restricted to observing enclave memory accesses at the level of coarse-grained 4 KiB pages, but can also precisely reconstruct intra-page control flow at a maximal, instruction-level granularity. We demonstrated the practicality and improved resolution of COPYCAT by discovering highly dangerous single-trace key extraction attacks in several real-world, side-channel hardened cryptographic libraries. Importantly, in contrast to known microarchitectural leakage sources, the more fundamental threat of deterministic controlled-channel leakage cannot be dealt with by merely flushing or partitioning microarchitectural state and instead requires research into more principled solutions.

#### Acknowledgment

This work is partially funded by the Research Fund KU Leuven, a generous gift from Intel, and by the US National Science Foundation under grants no. 1513671, 1651344, 1814406, and 1913210. Jo Van Bulck is supported by a grant of the Research Foundation – Flanders (FWO).

#### References

- 2017 IEEE Real-Time and Embedded Technology and Applications Symposium, RTAS 2017, Pittsburg, PA, USA, April 18-21, 2017. IEEE, 2017.
- [2] Onur Aciiçmez, Çetin Kaya Koç, and Jean-Pierre Seifert. On the power of simple branch prediction analysis. In *Proceedings of the 2Nd ACM Symposium on Information, Computer and Communications Security*, ASIACCS '07, pages 312–320, New York, NY, USA, 2007. ACM.
- [3] Alejandro Cabrera Aldaya and Billy Bob Brumley. When one vulnerable primitive turns viral: Novel single-trace attacks on ecdsa and rsa. Cryptology ePrint Archive, Report 2020/055, 2020. https://eprint.iacr.org/2020/055.
- [4] Alejandro Cabrera Aldaya, Billy Bob Brumley, Sohaib ul Hassan, Cesar Pereida García, and Nicola Tuveri. Port contention for fun and profit. IACR Cryptology ePrint Archive, 2018:1060, 2018.
- [5] Alejandro Cabrera Aldaya, Cesar Pereida García, Luis Manuel Alvarez Tapia, and Billy Bob Brumley. Cache-timing attacks on rsa key generation. IACR Transactions on Cryptographic Hardware and Embedded Systems, pages 213–242, 2019.
- [6] Alejandro Cabrera Aldaya, Alejandro J Cabrera Sarmiento, and Santiago Sánchez-Solano. Spa vulnerabilities of the binary extended euclidean algorithm. *Journal of Cryptographic Engineering*, 7(4):273–285, 2017.
- [7] Jose Bacelar Almeida, Manuel Barbosa, Gilles Barthe, François Dupressoir, and Michael Emmi. Verifying constant-time implementations. In 25th USENIX Security Symposium (USENIX Security 16), pages 53–70, Austin, TX, 2016. USENIX Association.
- [8] Marc Andrysco, David Kohlbrenner, Keaton Mowery, Ranjit Jhala, Sorin Lerner, and Hovav Shacham. On subnormal floating point and abnormal timing. In Security and Privacy (SP), 2015 IEEE Symposium on, pages 623–639. IEEE, 2015.
- [9] Naomi Benger, Joop van de Pol, Nigel P. Smart, and Yuval Yarom. "ooh aah... just a little bit": A small amount of side channel can go a long way. In *Cryptographic Hardware and Embedded Systems – CHES* 2014, pages 75–92, Berlin, Heidelberg, 2014. Springer.
- [10] Daniel J Bernstein, Joachim Breitner, Daniel Genkin, Leon Groot Bruinderink, Nadia Heninger, Tanja Lange, Christine van Vredendaal, and Yuval Yarom. Sliding right into disaster: Left-to-right sliding windows leak. In *International Conference on Cryptographic Hardware and Embedded Systems*, pages 555–576. Springer, 2017.
- [11] Daniel J Bernstein and Bo-Yin Yang. Fast constant-time gcd computation and modular inversion. *IACR Transactions on Cryptographic Hardware and Embedded Systems*, pages 340–398, 2019.
- [12] Barry Bond, Chris Hawblitzel, Manos Kapritsos, K. Rustan M. Leino, Jacob R. Lorch, Bryan Parno, Ashay Rane, Srinath Setty, and Laure Thompson. Vale: Verifying high-performance cryptographic assembly code. In 26th USENIX Security Symposium (USENIX Security 17), pages 917–934, Vancouver, BC, 2017. USENIX Association.
- [13] Dan Boneh and Ramarathnam Venkatesan. Hardness of Computing the Most Significant Bits of Secret Keys in Diffie-Hellman and Related Schemes. In Neal Koblitz, editor, Advances in Cryptology — CRYPTO '96, pages 129–142, Berlin, Heidelberg, 1996. Springer Berlin Heidelberg.
- [14] Ferdinand Brasser, Srdjan Capkun, Alexandra Dmitrienko, Tommaso Frassetto, Kari Kostiainen, and Ahmad-Reza Sadeghi. Dr. sgx: automated and adjustable side-channel protection for sgx using data location randomization. In *Proceedings of the 35th Annual Computer Security Applications Conference*, pages 788–800, 2019.

- [15] Ferdinand Brasser, Urs Müller, Alexandra Dmitrienko, Kari Kostiainen, Srdjan Capkun, and Ahmad-Reza Sadeghi. Software grand exposure: SGX cache attacks are practical. In 11th USENIX Workshop on Offensive Technologies (WOOT 17), Vancouver, BC, 2017. USENIX Association
- [16] Billy Bob Brumley and Risto M Hakala. Cache-timing template attacks. In *Asiacrypt*, volume 5912, pages 667–684. Springer, 2009.
- [17] Billy Bob Brumley and Nicola Tuveri. Remote timing attacks are still practical. In *European Symposium on Research in Computer Security*, pages 355–371. Springer, 2011.
- [18] David Brumley and Dan Boneh. Remote timing attacks are practical. In Proceedings of the 12th Conference on USENIX Security Symposium - Volume 12, SSYM'03, pages 1–1, Berkeley, CA, USA, 2003. USENIX Association.
- [19] Matteo Busi, Job Noorman, Jo Van Bulck, Letterio Galletta, Pierpaolo Degano, Jan Tobias Mühlberg, and Frank Piessens. Provably secure isolation for interruptible enclaved execution on small microprocessors. In 33rd IEEE Computer Security Foundations Symposium (CSF'20), 2020.
- [20] Alejandro Cabrera Aldaya, Raudel Cuiman Márquez, Alejandro J Cabrera Sarmiento, and Santiago Sánchez-Solano. Side-channel analysis of the modular inversion step in the rsa key generation algorithm. *International Journal of Circuit Theory and Applications*, 45(2):199–213, 2017.
- [21] Guoxing Chen, Sanchuan Chen, Yuan Xiao, Yinqian Zhang, Zhiqiang Lin, and Ten H Lai. Sgxpectre attacks: Stealing intel secrets from sgx enclaves via speculative execution. arXiv preprint arXiv:1802.09085, 2018
- [22] Guoxing Chen, Wenhao Wang, Tianyu Chen, Sanchuan Chen, Yinqian Zhang, XiaoFeng Wang, Ten-Hwang Lai, and Dongdai Lin. Racing in hyperspace: Closing hyper-threading side channels on sgx with contrived data races. In *Racing in Hyperspace: Closing Hyper-Threading Side Channels on SGX with Contrived Data Races*, page 0, Washington, DC, USA, 2018. IEEE, IEEE Computer Society.
- [23] Sanchuan Chen, Xiaokuan Zhang, Michael K. Reiter, and Yinqian Zhang. Detecting privileged side-channel attacks in shielded execution with déjà vu. In *Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security*, ASIA CCS '17, pages 7–18, New York, NY, USA, 2017. ACM.
- [24] Fergus Dall, Gabrielle De Micheli, Thomas Eisenbarth, Daniel Genkin, Nadia Heninger, Ahmad Moghimi, and Yuval Yarom. Cachequote: Efficiently recovering long-term secrets of sgx epid via cache attacks. IACR Transactions on Cryptographic Hardware and Embedded Systems, 2018(2):171–191, 2018.
- [25] Thomas Eisenbarth Daniel Moghimi, Berk Sunar and Nadia Heninger. Tpm-fail: TPM meets timing and lattice attacks. In 29th USENIX Security Symposium (USENIX Security 20), Boston, MA, August 2020. USENIX Association.
- [26] Ghada Dessouky, Tommaso Frassetto, and Ahmad-Reza Sadeghi. Hybcache: Hybrid side-channel-resilient caches for trusted execution environments. arXiv preprint arXiv:1909.09599, 2019.
- [27] Dmitry Evtyushkin, Ryan Riley, Nael CSE Abu-Ghazaleh, ECE, and Dmitry Ponomarev. Branchscope: A new side-channel attack on directional branch predictor. In *Proceedings of the Twenty-Third In*ternational Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '18, pages 693–707, New York, NY, USA, 2018. ACM.

- [28] N. J. Al Fardan and K. G. Paterson. Lucky thirteen: Breaking the tls and dtls record protocols. In 2013 IEEE Symposium on Security and Privacy, pages 526–540. IEEE, May 2013.
- [29] Yangchun Fu, Erick Bauman, Raul Quinonez, and Zhiqiang Lin. SGX-LAPD: Thwarting controlled side channel attacks via enclave verifiable page faults. In *International Symposium on Research in Attacks, Intru*sions, and Defenses, pages 357–380. Springer, 2017.
- [30] Patrick Gallagher. Digital signature standard (dss). Federal Information Processing Standards Publications, volume FIPS, pages 186–3, 2013.
- [31] Cesar Pereida García and Billy Bob Brumley. Constant-time callees with variable-time callers. In 26th {USENIX} Security Symposium ({USENIX} Security 17), pages 83–98, 2017.
- [32] Qian Ge, Yuval Yarom, David Cock, and Gernot Heiser. A survey of microarchitectural timing attacks and countermeasures on contemporary hardware. *Journal of Cryptographic Engineering*, 8(1):1–27, Apr 2018.
- [33] Johannes Götzfried, Moritz Eckert, Sebastian Schinzel, and Tilo Müller. Cache attacks on intel sgx. In *Proceedings of the 10th European Workshop on Systems Security*, pages 1–6, 2017.
- [34] Daniel Gruss, Clémentine Maurice, Klaus Wagner, and Stefan Mangard. Flush+Flush: a fast and stealthy cache attack. In *Detection of Intrusions and Malware, and Vulnerability Assessment*. Springer, 2016.
- [35] Jago Gyselinck, Jo Van Bulck, Frank Piessens, and Raoul Strackx. Offlimits: Abusing legacy x86 memory segmentation to spy on enclaved execution. In *International Symposium on Engineering Secure Software* and Systems, pages 44–60. Springer, 2018.
- [36] Marcus Hähnel, Weidong Cui, and Marcus Peinado. High-resolution side channels for untrusted operating systems. In 2017 USENIX Annual Technical Conference (ATC 17), pages 299–312, 2017.
- [37] Nadia Heninger and Hovav Shacham. Reconstructing rsa private keys from random key bits. In Annual International Cryptology Conference, pages 1–17. Springer, 2009.
- [38] Shohreh Hosseinzadeh, Hans Liljestrand, Ville Leppänen, and Andrew Paverd. Mitigating branch-shadowing attacks on intel sgx using control flow randomization. In *Proceedings of the 3rd Workshop on System* Software for Trusted Execution, pages 42–47, 2018.
- [39] Nick A Howgrave-Graham and Nigel P. Smart. Lattice attacks on digital signature schemes. *Designs, Codes and Cryptography*, 23(3):283–290, 2001.
- [40] Tianlin Huo, Xiaoni Meng, Wenhao Wang, Chunliang Hao, Pei Zhao, Jian Zhai, and Mingshu Li. Bluethunder: A 2-level directional predictor based side-channel attack against sgx. *IACR Transactions on Crypto-graphic Hardware and Embedded Systems*, pages 321–347, 2020.
- [41] Intel. Intel® 64 and IA-32 Architectures Optimization Reference Manual.
- [42] Intel. Intel® 64 and IA-32 Architectures Software Developer's Manual, May 2019.
- [43] Intel. Guidelines for mitigating timing side channels against cryptographic implementations, 2020.
- [44] Intel Corporation. Intel software guard extensions developer guide, June 2017. software.intel.com/en-us/documentation/ sgx-developer-guide.

- [45] Gorka Irazoqui, Mehmet Sinan Inci, Thomas Eisenbarth, and Berk Sunar. Lucky 13 strikes back. In Proceedings of the 10th ACM Symposium on Information, Computer and Communications Security, pages 85–96, 2015.
- [46] Mehmet Kayaalp, Khaled N. Khasawneh, Hodjat Asghari Esfeden, Jesse Elwell, Nael Abu-Ghazaleh, Dmitry Ponomarev, and Aamer Jaleel. Ric: Relaxed inclusion caches for mitigating llc side-channel attacks. In Proceedings of the 54th Annual Design Automation Conference 2017, DAC '17, pages 7:1–7:6, New York, NY, USA, 2017. ACM.
- [47] Deokjin Kim, Daehee Jang, Minjoon Park, Yunjong Jeong, Jonghwan Kim, Seokjin Choi, and Brent Byunghoon Kang. Sgx-lego: Fine-grained sgx controlled-channel attack and its countermeasure. computers & security, 82:118–139, 2019.
- [48] Sangho Lee, Ming-Wei Shih, Prasun Gera, Taesoo Kim, Hyesoon Kim, and Marcus Peinado. Inferring fine-grained control flow inside SGX enclaves with branch shadowing. In 26th USENIX Security Symposium (USENIX Security 17), pages 557–574, Vancouver, BC, 2017. USENIX Association.
- [49] A. K. Lenstra, H. W. Lenstra, and L. Lovasz. Factoring polynomials with rational coefficients. MATH. ANN, 261:515–534, 1982.
- [50] Fangfei Liu, Qian Ge, Yuval Yarom, Frank Mckeen, Carlos Rozas, Gernot Heiser, and Ruby B Lee. Catalyst: Defeating last-level cache side channel attacks in cloud computing. In *High Performance Computer Architecture (HPCA)*, 2016 IEEE International Symposium on, pages 406–418. IEEE, 2016.
- [51] Fangfei Liu, Yuval Yarom, Qian Ge, Gernot Heiser, and Ruby B. Lee. Last-level cache side-channel attacks are practical. In *Proceedings* of the 2015 IEEE Symposium on Security and Privacy, SP '15, pages 605–622, Washington, DC, USA, 2015. IEEE Computer Society.
- [52] Ahmad Moghimi, Thomas Eisenbarth, and Berk Sunar. Memjam: A false dependency attack against constant-time crypto implementations in SGX. In Topics in Cryptology - CT-RSA 2018 - The Cryptographers' Track at the RSA Conference 2018, San Francisco, CA, USA, April 16-20, 2018, Proceedings, pages 21–44, 2018.
- [53] Ahmad Moghimi, Gorka Irazoqui, and Thomas Eisenbarth. CacheZoom: How SGX Amplifies the Power of Cache Attacks. In Cryptographic Hardware and Embedded Systems – CHES 2017, pages 69–90. Springer, 2017.
- [54] Nguyen and Shparlinski. The Insecurity of the Digital Signature Algorithm with Partially Known Nonces. *Journal of Cryptology*, 15(3):151–176, Jun 2002.
- [55] Phong Q Nguyen and Igor E Shparlinski. The insecurity of the elliptic curve digital signature algorithm with partially known nonces. *Designs*, codes and cryptography, 30(2):201–217, 2003.
- [56] Dag Arne Osvik, Adi Shamir, and Eran Tromer. Cache attacks and countermeasures: The case of aes. In *Topics in Cryptology – CT-RSA* 2006, pages 1–20, Berlin, Heidelberg, 2006. Springer.
- [57] Cesar Pereida García, Billy Bob Brumley, and Yuval Yarom. "make sure dsa signing exponentiations really are constant-time". In *Proceedings of* the 2016 ACM SIGSAC Conference on Computer and Communications Security, CCS '16, pages 1639–1650, New York, NY, USA, 2016. ACM.
- [58] Matthieu Rivain. Fast and regular algorithms for scalar multiplication over elliptic curves. IACR Cryptology ePrint Archive, 2011:338, 2011.
- [59] Ronald L Rivest, Adi Shamir, and Leonard Adleman. A method for obtaining digital signatures and public-key cryptosystems. *Communi*cations of the ACM, 21(2):120–126, 1978.

- [60] Keegan Ryan. Hardware-backed heist: Extracting ecdsa keys from qualcomm's trustzone. In *Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security*, pages 181–194. ACM, 2019.
- [61] Keegan Ryan. Return of the Hidden Number Problem. IACR Transactions on Cryptographic Hardware and Embedded Systems, pages 146–168, 2019.
- [62] Sajin Sasy, Sergey Gorbunov, and Christopher W Fletcher. Zerotrace: Oblivious memory primitives from intel sgx. IACR Cryptology ePrint Archive, 2017:549, 2017.
- [63] C. P. Schnorr. A hierarchy of polynomial time lattice basis reduction algorithms. *Theor. Comput. Sci.*, 53(2-3):201–224, August 1987.
- [64] Michael Schwarz, Moritz Lipp, Daniel Moghimi, Jo Van Bulck, Julian Stecklina, Thomas Prescher, and Daniel Gruss. ZombieLoad: Crossprivilege-boundary data sampling. In CCS, 2019.
- [65] Michael Schwarz, Samuel Weiser, Daniel Gruss, Clémentine Maurice, and Stefan Mangard. Malware guard extension: Using sgx to conceal cache attacks. In *Detection of Intrusions and Malware, and Vulnerabil*ity Assessment, pages 3–24. Springer, 2017.
- [66] Atle Selberg. An elementary proof of the prime-number theorem. Annals of Mathematics, pages 305–313, 1949.
- [67] Ming-Wei Shih, Sangho Lee, Taesoo Kim, and Marcus Peinado. T-SGX: eradicating controlled-channel attacks against enclave programs. In 24th Annual Network and Distributed System Security Symposium, NDSS 2017, San Diego, California, USA, February 26 - March 1, 2017, 2017.
- [68] Shweta Shinde, Zheng Leong Chua, Viswesh Narayanan, and Prateek Saxena. Preventing page faults from telling your secrets. In Proceedings of the 11th ACM on Asia Conference on Computer and Communications Security, pages 317–328. ACM, 2016.
- [69] Emil Stefanov, Marten van Dijk, Elaine Shi, Christopher Fletcher, Ling Ren, Xiangyao Yu, and Srinivas Devadas. Path oram: An extremely simple oblivious ram protocol. In *Proceedings of the 2013 ACM SIGSAC Conference on Computer &#38*; Communications Security, CCS '13, pages 299–310, New York, NY, USA, 2013. ACM.
- [70] Raoul Strackx and Frank Piessens. The heisenberg defense: Proactively defending sgx enclaves against page-table-based side-channel attacks. arXiv preprint arXiv:1712.08519, 2017.
- [71] Daniel Townley and Dmitry Ponomarev. Smt-cop: Defeating sidechannel attacks on execution units in smt processors. In *Proceedings* of the 28th International Conference on Parallel Architectures and Compilation Techniques. ACM, 2019.
- [72] Jo Van Bulck, Marina Minkin, Ofir Weisse, Daniel Genkin, Baris Kasikci, Frank Piessens, Mark Silberstein, Thomas F Wenisch, Yuval Yarom, and Raoul Strackx. Foreshadow: Extracting the keys to the intel sgx kingdom with transient out-of-order execution. In Proceedings of the 27th USENIX Security Symposium. USENIX Association, 2018.
- [73] Jo Van Bulck, David Oswald, Eduard Marin, Abdulla Aldoseri, Flavio D Garcia, and Frank Piessens. A tale of two worlds: Assessing the vulnerability of enclave shielding runtimes. In *Proceedings of the* 2019 ACM SIGSAC Conference on Computer and Communications Security, pages 1741–1758. ACM, 2019.
- [74] Jo Van Bulck, Frank Piessens, and Raoul Strackx. SGX-Step: a practical attack framework for precise enclave execution control. In *Proceedings of the 2nd Workshop on System Software for Trusted Execution*, SysTEX'17, pages 4:1–4:6, New York, NY, USA, 2017. ACM.

- [75] Jo Van Bulck, Frank Piessens, and Raoul Strackx. Nemesis: Studying microarchitectural timing leaks in rudimentary cpu interrupt logic. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security, pages 178–195. ACM, 2018.
- [76] Jo Van Bulck, Nico Weichbrodt, Rüdiger Kapitza, Frank Piessens, and Raoul Strackx. Telling your secrets without page faults: Stealthy page table-based attacks on enclaved execution. In 26th USENIX Security Symposium (USENIX Security 17), pages 1041–1056, Vancouver, BC, 2017. USENIX Association.
- [77] Wenhao Wang, Guoxing Chen, Xiaorui Pan, Yinqian Zhang, XiaoFeng Wang, Vincent Bindschaedler, Haixu Tang, and Carl A Gunter. Leaky cauldron on the dark land: Understanding memory side-channel hazards in sgx. In *Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security*, pages 2421–2434, New York, NY, USA, 2017. ACM, ACM.
- [78] Samuel Weiser, Raphael Spreitzer, and Lukas Bodner. Single trace attack against rsa key generation in intel sgx ssl. In *Proceedings of the* 2018 on Asia Conference on Computer and Communications Security, pages 575–586. ACM, 2018.
- [79] Jan Wichelmann, Ahmad Moghimi, Thomas Eisenbarth, and Berk Sunar. Microwalk: A framework for finding side channels in binaries. In *Proceedings of the 34th Annual Computer Security Applications Conference*, ACSAC '18, pages 161–173, New York, NY, USA, 2018. ACM.
- [80] Yuanzhong Xu, Weidong Cui, and Marcus Peinado. Controlled-channel attacks: Deterministic side channels for untrusted operating systems. In *Proceedings of the 2015 IEEE Symposium on Security and Privacy*, SP '15, pages 640–656, Washington, DC, USA, 2015. IEEE Computer Society.
- [81] Yuval Yarom and Naomi Benger. Recovering openssl ecdsa nonces using the flush+reload cache side-channel attack. IACR Cryptology ePrint Archive, 2014:140, 2014.
- [82] Yuval Yarom and Katrina Falkner. Flush+reload: A high resolution, low noise, 13 cache side-channel attack. In 23rd USENIX Security Symposium (USENIX Security 14), pages 719–732, San Diego, CA, 2014. USENIX Association.
- [83] Yuval Yarom, Daniel Genkin, and Nadia Heninger. Cachebleed: a timing attack on openssl constant-time rsa. *Journal of Cryptographic Engineering*, 7(2):99–112, 2017.
- [84] Yinqian Zhang, Ari Juels, Michael K Reiter, and Thomas Ristenpart. Cross-VM side channels and their use to extract private keys. In Proceedings of the 2012 ACM conference on Computer and communications security, pages 305–316, New York, NY, USA, 2012. ACM, ACM.

# A Appendix

### A.1 Lattice Construction for ECDSA Attack

We define a lattice for signature pairs  $r_i$ ,  $s_i$  and the message  $m_i$ , by rearranging the signing equations for a set of signatures as  $k_i - s_i^{-1} r_i d - s_i^{-1} H(m_i) \equiv 0 \mod n$  yielding t linear equations in t+1 unknowns (the secret key d and the nonces  $k_i$ ). We can rewrite these linear equations in the form  $k_i + A_i d + B_i = 0 \pmod{n}$  with  $A_i$  and  $B_i$  known. We denote the  $K > k_i$  as an

upper bound on the  $k_i$ . Then we use the rows of the following basis matrix to generate our target lattice

$$M = \begin{bmatrix} n\mathbf{I_t} \\ \mathbf{A_t} & K/n \\ \mathbf{B_t} & K \end{bmatrix}_{(t+2)\times(t+2)}$$
(1)

where  $\mathbf{I_t}$  denotes the t dimensional identity matrix and t-dimensional vectors  $\mathbf{A_t}$  and  $\mathbf{B_t}$  are obtained from the constant (known) coefficients of the linear equations. The first t columns are generated from each of the  $k_i + A_i d + B_i = 0 \pmod{n}$  relations. The scaled diagonal of each of these columns carries the modulus n reduction. The weights K/n and K in the last two columns ensure that the desired short vector containing the secret key will have coefficients all of approximately the same (small) size. Consdiering the short vector  $v_k = (k_1, k_2, \dots, k_t, K\alpha/n, K)$  as a member of this lattice, recovering the short vector  $v_k$  will allow us to recover the secret key d from the entry  $K\alpha/n$ . Since  $v_k$  should be short, we use a lattice reduction algorithm such as LLL (or BKZ) to find it in polynomial time [13].

#### A.2 Branch Shadow-Resistant Code Attack

Listing 2 provides an elementary example function with secret-dependent branches. We provide the corresponding assembly output in Listing 3, as produced by the LLVM-based, open-source compiler mitigation pass [38] against branch shadowing attacks, described in Section 4.4. We passed the -mllvm -x86-branch-conversion and -mllvm -x86-bc-dummy-instr options to enable both rewriting of conditional branches via the trampoline area and protection against timing attacks via dummy instruction balancing. Note that the randomizer is not integrated in the open-source release, and all code blocks on the trampoline area would still have to be randomly re-shuffled at runtime to protect against branch-shadowing attacks. To achieve sufficient entropy, trampoline areas have to be larger than 4 KiB [38], and hence the trampoline will occupy at least one separate page.

We reveal control flow in the instrumented code of Listing 3 using COPYCAT as follows. In case the secret-dependent if condition is true, the indirect branch at line 29 will execute the single-instruction <code>jmp\_if</code> block on the trampoline page, followed by 4 instructions on the instrumented code page, totalling 5 instructions before reaching the <code>end\_if</code> marker. In contrast, in case the if condition is false, the indirect branch at line 29 will transfer to the <code>skip\_if</code> block on the trampoline page, totaling 4 instructions before eventually reaching the <code>end\_if</code> marker back on the instrumented code page. Similar unbalanced instruction counts follow for the else block.

We experimentally verified that COPYCAT adversaries can deterministically learn the if condition by merely counting instructions and observing page accesses. Moreover, because the dummy instructions do not result in exactly balanced instruction counts, as explained above, merely counting the

```
void my_func(int a) {
   if (a != 0) block1++; else block2++;
   block3++;
}
```

Listing 2: Sample code snippet with conditional branching.

```
my_func
    jmp
  jmp_done:
   jmp
           done
  jmp_done2:
    jmp
           done
  skip_else:
   add
          $0x0, %r13b # compensating dummy
           jmp_done2(%rip),%r15
    lea
    jmp
           end else
11
  jmp_else:
12
    amr
  skip_if:
           $0x0,%r13b # compensating dummy
$0x0,%r13b # compensating dummy
    add
14
15
    add
           jmp_else(%rip),%r15
16
    lea
17
    jmp
           end_if
18
  jmp_if:
19
           if
    jmp
   20
  my_func:
21
           %rbp
22
    push
23
    mov
           %rsp,%rbp
           %edi,-0x4(%rbp)
24
    mov
25
    cmpl
           $0x0,-0x4(%rbp)
           jmp_if(%rip),%r15
26
    lea
           skip_if(%rip),%r13
27
    cmove %r13,%r15
28
           *%r15
29
    jmp
30
  if:
           block1(%rip),%eax
31
   mov
32
    add
           $0x1, %eax
           %eax,block1(%rip)
33
    mov
34
    lea
           skip_else(%rip),%r15
35
  end_if:
    ami
           *%r15
36
37
  else:
           block2(%rip),%eax
38
   mov
39
    add
           $0x1, %eax
           %eax,block2(%rip)
40
    mov
    lea
           jmp_done(%rip),%r15
41
42
  end_else:
43
    jmp
           *%r15
44
  done:
           block3(%rip),%eax
45
    mov
46
    add
           $0x1, %eax
47
    mov
           %eax,block3(%rip)
48
           %rbp
    pop
    ret
```

Listing 3: Hardened assembly output, corresponding to the source code in Listing 2, as produced by the open-source branch shadowing mitigation LLVM compiler pass.

total amount of executed instructions even suffices in itself without having to distinguish accesses to the trampoline page.

#### A.3 WolfSSL Vulnerable Code Path of BEEA

Implementations of BEEA in WolfSSL are given at Listing 4 and Listing 5. Listing 6 shows the places in RSA key generation where BEEA is used by invoking mp\_invmod.

```
static int fp_invmod_slow (fp_int * a, fp_int * b, fp_int * c){...
   top:
     while (fp_iseven (u) == FP_YES) { /* 4. while u is even do */
      fp_div_2 (u, u);
                              /* 4.1 u = u/2 */
      if (fp_isodd (A) == FP_YES \parallel /* 4.2 if A or B is odd then */
      fp_{isodd}(B) == FP_{YES}(B)
                                /* A = (A+v)/2 */
       fp_add (A, y, A);
       fp\_sub(B, x, B);
                               /*B = (B-x)/2 */
                                /*A = A/2 */
10
      fp_div_2 (A, A);
11
      fp_div_2 (B, B);
                               /* B = B/2 */
12
     while (fp_iseven (v) == FP_YES) { /* 5. while v is even do */
13
14
      fp_div_2 (v, v);
      if (fp_isodd (C) == FP_YES || /* 5.2 if C or D is odd then */
      fp_isodd(D) == FP_YES)  {
                               /* C = (C+y)/2 */
17
       fp_add (C, y, C);
18
       fp_sub (D, x, D);
                               /*D = (D-x)/2 */
19
      fp_div_2 (C, C);
                               /* C = C/2 */
20
      fp_div_2 (D, D);
                               /* D = D/2 */
22
    if (fp_cmp (u, v) != FP_LT) { /*6. if u >= v then */
24
      fp_sub (u, v, u);
      fp_sub (A, C, A);
                               /* A = A - C */
                               /* B = B - D */
26
      fp_sub (B, D, B);
28
      fp_sub (v, u, v);
                               /* C = C - A */
      fp_sub (C, A, C);
29
                                /* D = D - B */
30
      fp_sub (D, B, D);
31
    if (fp_iszero (u) == FP_NO) goto top; /* if not zero goto step 4 */
```

Listing 4: fp\_invmod\_slow implements modular inversion using the binary extended euclidean algorithm (BEEA). The subroutines fp\_iseven and fp\_isodd are inlined and simply check the last bit of their operand. The arithmatic subroutines: fp\_sub, fp\_div\_2, fp\_cmp can all fit within the same page.

```
/* c = 1/a \pmod{b} for odd b only */
   int fp_invmod(fp_int *a, fp_int *b, fp_int *c) {...
   top:
    while (fp_iseven (u) == FP_YES){ /* 4. while u is even do */
     fp_div_2 (u, u); /* 4.1 u = u/2 */ if (fp_isodd (B) == FP_YES){ /* 4.2 if B is odd then */
       fp_sub (B, x, B);
     fp_div_2 (B, B);
                               /* B = B/2 */
10
     while (fp_iseven (v) == FP_YES){ /* 5. while v is even do */
11
     fp_div_2 (v, v); /* 5.1 v = v/2 */ if (fp_isodd (D) == FP_YES){ /* 5.2 if D is odd then */
12
13
14
       fp_sub (D, x, D);
                               /*D = (D-x)/2 */
15
     fp_div_2 (D, D);
16
                                /* D = D/2 */
17
    18
19
     fp_sub (B, D, B);
20
21
     } else {
     fp_sub (v, u, v);
                              /*v-v-u*/
22
                                /* D = D - B */
23
     fp_sub (D, B, D);
24
    if (fp_iszero (u) == FP_NO) goto top; /* if not zero goto step 4 */
```

Listing 5: fp\_invmod implements modular inversion using the binary extended euclidean algorithm (BEEA). This function is similar to Listing 4, but only supports odd numbers as the second parameter.

```
if (err == MP_OKAY)
                           /* key -> d = 1/e \ mod \ lcm(p-1, \ q-1) */
  err = mp_invmod(\&key->e, \&tmp3, \&key->d);
if (err == MP_OKAY)
  err = mp_mul(&p, &q, &key->n);
if (err == MP_OKAY)
                           /* key -> dP = d \ mod(p-1) */
  err = mp_mod(&key-
                       >d, &tmp1, &key->dP);
if (err == MP_OKAY)
                            /* key -> dQ = d \mod(q-1) */
  err = mp\_mod(\&key-
                       >d, &tmp2, &key->dQ);
if (err == MP_OKAY)
                           /* key -> u = 1/q \ mod \ p */
  err = mp_invmod(&q, &p, &key->u);
```

Listing 6: wc\_MakeRsaKey.