# Toward Balancing Arbitrary Code

# June 27, 2019

# **Contents**

| 1 | Introduction                                                                                                                                                                                                                         | 1                                |  |  |  |  |  |  |
|---|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------|--|--|--|--|--|--|
| 2 | Power Analysis       3         2.1 Defenses against Power-Analysis attacks                                                                                                                                                           |                                  |  |  |  |  |  |  |
| 3 | Related Work                                                                                                                                                                                                                         | 5                                |  |  |  |  |  |  |
| 4 | Theory  4.1 Balancing Individual Values                                                                                                                                                                                              | 6<br>9                           |  |  |  |  |  |  |
| 5 | Implementation  5.1 LLVM  5.2 Balancing Pass  5.3 Cloning Functions  5.4 Balancing Allocations  5.5 Balancing Stores  5.6 Balancing Loads  5.7 Balancing Binary Operations  5.8 Balancing Pointer Arithmetic  5.9 Balancing Compares | 13<br>14<br>14<br>15<br>15<br>15 |  |  |  |  |  |  |
| 6 | Evaluation                                                                                                                                                                                                                           | 16                               |  |  |  |  |  |  |
| 7 | Results 7.1 Robustness                                                                                                                                                                                                               |                                  |  |  |  |  |  |  |
| 8 | Conclusion                                                                                                                                                                                                                           | 19                               |  |  |  |  |  |  |

# 1 Introduction

Unintended signal emissions are a major source of information leakage in modern processors. Especially cryptographic secrets are valuable targets for analyzing these so-called Side-Channels. While the physical access required

for Side-Channel attacks is often a hurdle, embedded devices are usually shipped to consumers, making them vulnerable to this type of attack.

One such side-channel that is especially easy to measure is power consumption. Setting a binary value in registers, main memory etc. consumes power directly related to the number of bits to be set to 1. By measuring the power consumption traces during execution an attacker can gain information about the Hamming Weight (number of 1s) of the processed data. If she knows which cryptographic operation is being performed and can control the input (both reasonable assumptions for embedded devices), she can infer the value of the cryptographic secret via statistical anlysis of the power traces.[4] A comparatively low clock rate and power traces that are low in noise due to a lack of parallelism make embedded platforms especially susceptible to such a Power-Analysis attack.

As performing cryptographic operations is *exactly* the use case of many embedded devices (e.g. SmartCards, verifying OTA updates, etc.), defenses against Power-Analysis have been amply explored. However, the most commonly used defenses are either algorithm specific, like masking, or require significant changes to the hardware, like Dual-Rail-Logic[12]. Dual-Rail-Logic is especially notable because it is algorithm independent. By computing the inverse, along with the actual value, Dual-Rail-Logic *balances* the number of 1s in intermediate values, and thus makes the power consumption constant. This makes it in theory impossible for an attacker to gain information via the power consumption.

Unfortunately this strategy suffers from multiple engineering problems, such as minute differences in clock timings between the regular and inverted path[2], or variances in the production of transistors[11]. It also requires a significant increase in circuit size, doubling the required size or more[2].

Even with these caveats, Dual-Rail-Logic still has the benefit that *any* code can be modified circuitry, without any modifications, while still experiencing increased robustness.

In my thesis I explore the possibilities of implementing a similar balancing in software. It works by only using part of the available word size for actual data, leaving the rest for balancing. Specifically, I store 8bit values, along with their balancing counterpart, in a 32bit register. This then means that the data has no influence on the power consumption anymore. I also propose an arithmetic on these balanced values, giving a balanced replacement for all integer operations required for a modern RISC instruction set. With this arithmetic and the balanced values, one can then execute *any* program, while benefiting from a massive increase in robustness against Power-Analysis attacks.

I also provide a plugin for the LLVM compiler that transforms code written for 8bit word-sizes into this balanced form. This proof-of-concept shows that it is possible to execute code using this balanced form. Additionally, it also shows that even such significant changes to the way code is executed can come at no extra cost to the programmer, and the job of generating secure code can at least in part be handed off to the compiler.

Finally I provide an evaluation of my balanced form. By running code compiled with my plugin in the QEMU emulator I can examine the Hamming Weight of values during the execution, and compare them to regular unbal-

anced code. Evaluating in such a manner simulates an extremely powerful attacker that can examine the Hamming Weight of the result of every single operation. An increased robustness in such a scenario then indicates increased robustness for every real-world attacker.

The rest of this thesis is split into these tree parts.

# 2 Power Analysis

In most cases the power consumption during execution is data-dependent. Setting a bit to 1 requires more power than setting it to 0. Power consumption is thus directly linked to the Hamming Weight of processed data. An attacker can then measure the power consumption and make inferences on the data being processed.

Performing Power-Analysis requires some setup: An attacker solders a resistor between the target processor and the ground of its power supply. She then measures the voltage difference between both ends with an oscilloscope (this voltage is directly proportional to the current flowing through the resistor). This gives her easy access to the power traces at a high resolution and for every clock cycle.

With these power traces an attacker then has the choice of multiple attack forms of varying complexity.[7][4] The simplest form is *Simple Power-Analysis* (SPA), and it involves directly examining the power consumption. As large control blocks can be identified, a data-dependent control flow can leak information this way. An example target would be RSA decryption being calculated via the square-and-multiply algorithm. The difference between the multiply and the square operation is directly observable from a single power trace. As the order of these operations is linked to the private key, identifying the control flow leaks the private key. Figure 1 shows a trace for square-and-multiply in RSA decryption, including the leaked private key bits.



Figure 1: Simple Power Analysis on square-and-multiply RSA[3]

The control flow is often not enough to leak the entire secret, and it is very hard to gain information about the actual data from only SPA. For this a more complex variant of Power-Analysis can be used, namely *Differential* 

*Power-Analysis* (DPA). DPA requires a large number of traces, with one factor for the power consumption known. For cryptographic operations this equates a *chosen plaintext* attack scenario.

DPA then attacks the individual bits of the key. The attacker considers two cases, one for each value of the current key bit. First she assumes the value of the current key bit is 0. She then chooses a bitwise operation (e.g. XOR of the plaintext with the key in the first round of AES), and splits the power traces into two sets, based on the value of the target bit in the expected result of this operation. Next she calculates the mean power consumption of both sets. As the value of the other bits, as well as other factors for the power consumption, are randomly distributed, calculating the mean will neutralize them. If her assumption was correct, the difference of both means will exhibit a spike at the time of the chosen operation. Either way, the value of the current key bit is revealed to her.

Figure 2 shows a typical DPA result with the mean power consumptions of both sets, the difference between the two, and the difference with the Y axis magnified by a factor of 15. This analysis was performed on the output of the least significant bit after the first S-box substitution in AES.



Figure 2: Difference in means during DPA[8]

Even more higher order information about the key can be found with *Correlational Power-Analysis* (CPA). CPA is the most complex of these attacks, but also offers the best results. The attacker starts by making a list of candidate values for every byte of the key. Attacking individual bytes at a time, instead of the whole key, still keeps the required effort feasible. As she knows which algorithm she is attacking, she knows which operations will take place, and can calculate expected intermediate results based on her chosen plaintexts and key candidates. She can then calculate the expected power consumptions for these intermediate results.

The attacker can now calculate the correlation between the expected power consumptions for every key byte candidate and the actual power consumption. The candidate with the highest correlation coefficient is then the most probable value for the current key byte.

#### 2.1 Defenses against Power-Analysis attacks

Power-Analysis works because processed data directly influences the power consumption. Defenses against this class of attack usually work by either adding additional factors to the power consumption, thus increasing the computational effort required for analysis, or by reducing variances in power consumption altogether. This reduced variance decreases the information an attacker can gain from the same number of power traces, giving her reduced confidence in her result or requiring her to capture more traces.

Masking[6][5] for example is an algorithm specific defensive measure that adds a third factor to the power consumption by first performing adding a masking value to the plaintext via an invertible operation. The cryptographic algorithm then works on the masked value, and only in the end unmasks the result. As such the attacker has to calculate her correlation for each possible combination of key byte and mask value. This increases the number of traces she needs to capture (to still provide the same confidence in her analysis) and the computation time of her analysis.

Other defensive measures focus on creating a worse signal to noise ratio for the entire power consumption. One technique that has gained a lot of traction is Dual-Rail-Logic[12]. It works by calculating the inverse of every intermediate result along with the actual result, thus balancing the number of 1s. This results in a constant Hamming Weight and therefore a data-independent power consumption.

Unfortunately, Dual-Rail-Logic suffers from multiple engineering problems. The power required to set the value of a bit to 1 is dependent on properties of the underlying transistors, which are subject to variances in manufacturing.[11] Minimal differences in clock timings between both paths can also reduce the security of Dual-Rail-Logic[2]. Storing the inverse also requires significantly larger circuitry, doubling the circuit size or more[2].

Even with these caveats, Dual-Rail-Logic has the major advantage that once it is applied, *any* code can be run without modifications while still benefiting from the increased robustness.

### 3 Related Work

There are multiple approaches to performing a Power-Analysis attack.

Currently there are is no perfect defense against such an attack, and all current defenses do is increase the amount of effort required for successful attack.

Power-Analysis is made possible by processed data directly affecting the power consumption. Defensive measures then either try to insert additional factors for the power consumption, or decrease the signal-to-noise ratio of the power consumption itself (by reducing variance in the power consumption).

One popular class of defense that introduces additional factors is masking

# 4 Theory

By performing Power-Analysis an attacker can gain information about the Hamming Weights during execution. This means that as long as their Hamming Weights are identical, values are indistinguishable by such an attacker. Having only perfectly balanced values then means that an attacker can gain no information via the power consumption, as all values look exactly the same.

Unfortunately, the design of the ALU does not permit such a scenario. The goal then is to come as close to this ideal as possible, i.e. having a minimal amount of unbalanced values. Balancing individual values can be done perfectly, only the balanced versions of binary operators *must* have imbalanced intermediate values. For my thesis I did not care too much about their optimality, focussing instead on their correctness. The derivation of my binary operators is already a proof of their correctness, and additionally I checked all possible combinations of 8bit values for incorrect results.

#### 4.1 Balancing Individual Values

In resemblence of Dual-Rail-Logic itself, I balance the Hamming Weight by storing the inverse in the same register as the actual value. While this could be done while still keeping 16bit for actual data, using only 8bit gives me more space to store temporary values. This gives me more freedom to balance values during binary operations. With that fixed the only remaining decision is where in the register to put x and  $\overline{x}$ . I wanted to have room between x and  $\overline{x}$  for shifts, so this left only 2 candidates (and their inverses) for what I call the *balancing schemes*. Figure 3 shows the two schemes I chose for my project.



Figure 3: Balancing Schemes

I found balanced operations for both schemes, but in the end decided to use Scheme 1 as a default because it exhibits nicer behavior for shifts, especially rotations. Both are worth mentioning however, because many of my operations will result in values formatted in Scheme 2 and require explicit transformation. By finding standardized transformations in both directions I could reuse them in the rest of my arithmetic.

#### 4.2 Balancing Binary Operations

The biggest problem of finding a balanced arithmetic was that  $\overline{x} \circ \overline{y}$  is not  $\overline{x} \circ \overline{y}$  (o here denotes any operator). As the ALU cannot execute two different operations on parts of the same register at the same time, there *must* be imbalanced temporary values during execution. My goal then was to limit the number of these imbalanced values.

For every operation I give the intermediate steps, with a single line denoting an intermediate value. The values are in the form

$$\%i = x_1$$
  $\parallel x_2$   $\parallel x_3$   $\parallel x_4$  | operation

where %i denotes the "name" of the current intermediate, and  $x_1$  through  $x_4$  are the individual bytes of a register, with  $x_1$  having the most significant, and  $x_4$  the least significant bits. The operation denotes how the current value is obtained.

#### **Transforming Scheme 1 to Scheme 2**

The transformation from Scheme 1 to Scheme 2 looks as follows:

LSL here stands for logical shift left.

#### Transforming Scheme 2 to Scheme 1

The other direction works very similar to the first, and is shown below. Note that ROR stands for rotational right shift, i.e. the values shifted out on the right are shifted back in on the left.

$$\%1 = \overline{x} \qquad \qquad \parallel 0 \qquad \parallel x \\ \%2 = \texttt{Oxff} \qquad \parallel \overline{x} \qquad \parallel 0 \qquad \parallel x \qquad \quad \mid \%1 \ \texttt{ORR} \ (\%1 \ \texttt{ROR} \ 24) \\ \%3 = 0 \qquad \qquad \parallel \overline{x} \qquad \parallel 0 \qquad \parallel x \qquad \quad \mid \%2 \ \texttt{AND} \ \texttt{Ox00ff00ff}$$

#### ORR

Before finding a balanced variant of bitwise OR, I needed to find an expression for the inverse of the result. For this I utilized DeMorgan's law:  $\overline{x \vee y} = \overline{x} \wedge \overline{y}$ . With this equality ORR looks as follows:

#### AND

As  $\overline{x \wedge y} = \overline{x} \vee \overline{y}$ , AND works almost the same as ORR, but uses different parts of the intermediate results.

#### **XOR**

XOR is at its base a combination of AND and ORR:  $x \oplus y = (\overline{x} \wedge y) \vee (x \wedge \overline{y})$ . It is better to create a balanced XOR from scratch, instead of compositioning it from ORR and AND, because both ORR and AND have the same imbalanced intermediate values.

The inverse of the result can be found through repeated application of DeMorgan's law and simplification. I will skip the details of this simple transformation, and show only the result:  $\overline{x \oplus y} = (x \wedge y) \vee (\overline{x} \wedge \overline{y})$ .

| %1 = 0                                    | $\parallel \overline{x}$                           | $\parallel 0$                               | $\parallel x$                                 |                                                 |
|-------------------------------------------|----------------------------------------------------|---------------------------------------------|-----------------------------------------------|-------------------------------------------------|
| %2 = 0                                    | $\parallel \overline{y}$                           | $\parallel 0$                               | $\parallel y$                                 |                                                 |
| $\%3 = \overline{x}$                      | $\parallel \overline{x}$                           | $\parallel x$                               | $\parallel x$                                 | $\mid \%1~\mathrm{ORR}~(\%1~\mathrm{LSL}~8)$    |
| %4 = y                                    | $\parallel \overline{y}$                           | $\parallel \overline{y}$                    | $\parallel y$                                 | $\mid \%2~\mathrm{ORR}~(\%2~\mathrm{ROR}~24)$   |
| $\%5=\overline{x}$ and $y$                | $\parallel \overline{x} \text{ and } \overline{y}$ | $\parallel x \; {\tt AND} \; \overline{y}$  | $\parallel x \; \mathtt{AND} \; y$            | $\mid \%3 \text{ and } \%4$                     |
| $\%6 = x \ \mathrm{XOR} \ y$              | $\parallel \overline{x \; \mathtt{XOR} \; y}$      | $\parallel x \ \mathtt{XOR} \ y$            | $\parallel \overline{x \; \mathtt{XOR} \; y}$ | $\mid \%5 \text{ and } (\%5 \text{ ror } 16)$   |
| $\%7 = \overline{x \; \mathrm{XOR} \; y}$ | $\parallel x \ \mathtt{XOR} \ y$                   | $\parallel \overline{x \; \mathtt{XOR}  y}$ | $\parallel x \ \mathtt{XOR} \ y$              | $ \ \%6\ {	t ROR}\ 8$                           |
| $\%8 = \overline{x \; \mathtt{XOR} \; y}$ | 0                                                  | $\parallel 0$                               | $\parallel x \ \mathtt{XOR} \ y$              | $ \ensuremath{\%7}$ AND <code>Oxff0000ff</code> |
| %9 = 0                                    | $\parallel \overline{x \; \mathtt{XOR}  y}$        | 0                                           | $\parallel x$ XOR $y$                         | $  transform_2_1(\%8)$                          |

#### **ADD**

For the inverse of arithmetic operations I utilized the definition of the negation in 2s complement:  $-x = \overline{x}+1$ . This also means that  $\overline{x} = -x-1$  and therefore:

$$\overline{x+y} = -(x+y) - 1 = -x - y - 1 = \overline{x} + 1 + \overline{y} + 1 = \overline{x} + \overline{y} + 1$$

Using associativity of addition the balanced variant of ADD looks like the following:

| %1 = 0 | $\parallel \overline{x}$     | 0              | $\parallel x$     |                                   |
|--------|------------------------------|----------------|-------------------|-----------------------------------|
| %2 = 0 | $\parallel \overline{y}$     | $\parallel 0$  | $\parallel y$     |                                   |
| %3 = 0 | $\parallel \overline{x} + 1$ | $\parallel 0$  | $\parallel x$     | %1 + 0x00010000                   |
| %4 = c | $\parallel \overline{x+y}$   | $\parallel c'$ | $\parallel x + y$ | %3 + %2                           |
| %5 = 0 | $\  \overline{x+y} \ $       | 0              | $\parallel x + y$ | $ \%4 \land \texttt{0x00ff00ff} $ |

Both c and c' denote possible carry values that need to be filtered.

#### **SUB**

For subtraction I again use the definition of 2s complement, giving me the following for the inverse result:

$$\overline{x-y} = -(x-y) - 1 = y - x - 1 = y + (-x-1) = y + \overline{x} = \overline{x} + y$$

Applying the same definition to the regular result yields

$$x - y = x + \overline{y} + 1$$

resulting in a guick and convenient balanced subtraction:

```
%1 = 0
                      \parallel \overline{x}
                                            \parallel 0
                                                           \parallel x
%2 = 0
                                            \parallel 0
                                                           \parallel y
                      \parallel \overline{y}
\%3 = 0
                                            \parallel 0
                                                           \parallel \overline{y}
                                                                                        1\%2 ROR 16
                      \parallel y
%4 = 0
                      \parallel y
                                            \parallel c
                                                           \parallel \overline{y} + 1
                                                                                       1\%3 + 0x00000001
                                           \parallel c''
\%5 = c'
                     \parallel \overline{x} + y
                                                           ||x+\overline{y}+1|
                                                                                       |\%1 + \%4|
\%6 = 0
                     \parallel \overline{x-y}
                                            \parallel 0
                                                           \parallel x-y
                                                                                       |\%5 AND 0x00ff00ff
```

#### MUL

The inverse result of multiplication can be calculated as follows:

$$\overline{x \cdot y} = -(x \cdot y) - 1 = (-x) \cdot y - 1 = (\overline{x} + 1) \cdot y = \overline{x} \cdot y + y - 1$$

Which gives us the following balanced multiplication:

#### **DIV** and **REM**

I used repeated balanced subtraction for DIV and REM operations. The code was written in C and can be found in the git of my thesis[1].

# **Shifting**

While performing logical shifts, I need to ensure that the correct bits are pushed in. When 0s are shifted in for x I have to shift in 1s for  $\overline{x}$ , and vice versa. This is done by ORRing the target value with 0xff000000 or 0x0000ff00, as needed. The shifting is performed normally and the result is then AND filtered with 0x00ff00ff to comply with Scheme 1 again.

#### 4.3 Testing for Correctness

While the individual steps for each binary operator are themselves a theoretical proof for their correctness, I still wanted to validate them. As there are only 256 possible 8bit values, I could easily brute-force every combination of them, and verify the correctness of the result. For this purpose I wrote python code that allows execution of intermediate steps. By specifying the individual steps with lambdas, and then constructing the entire balanced operation from unary and binary operations, I can execute the operation step by step. The intermediate results are then stored in numpy arrays, allowing

me to check if the results are correct, and for which values the results are incorrect, as well as where any errors happen. As an example, Listing 1 shows the intermediate steps for multiplication.

Listing 1: Step-by-step execution of balanced multiplication

The *Unary-* and *BinaryOperation* classes take the indices of the layers to operate on (0 and 1 are the inputs, all others are intermediate values), as well as the operation in form of a lambda. Executing the *MultiStepOperation* will then execute all lambdas in order and store the intermediate results in *numpy* arrays. Correctness is then tested by checking if all final results are equal to the output of a reference operation ( $x \cdot y$  in this case).

#### 4.4 Evaluating the Balancedness

The balancedness of my operations is evaluated using the same python code. As all intermediate results are stored during evaluation I can easily calculate the distribution of their Hamming Weights, as shown in Figure 4. I used these histograms to check if operations needed improvement, and if that was the case, I tried to find a different, more balanced way of performing them.

While Figure 4 shows imbalanced values in the intermediate steps, it performed faster and more robust than multiplication via repeated addition. Figure 5 shows an evaluation of both variants, evaluated over the multiplications of all possible 8bit factors.

# 5 Implementation

The transformation of normal code into my balanced form happens automatically inside the compiler. This requires no additional considerations of the programmer, while still providing increased robustness. My thesis is intended as a proof of concept and as such does not balance all variables in the code. It balances only variables on the stack, including function parameters, return values, literals and local variables. I chose the stack as it provides a clean cut with little ambiguity for the programmer, while still balancing enough to decrease variance in Hamming Weights by a measurable amount. Figure 6 shows all candidates for balancing, as well as their relationship. If one candidate has their value set by another (e.g. result of binary operators being



Figure 4: Histogram of Hamming Weights of direct balanced multiplication

stored in variables), then this introduces a dependency for this balancing. These dependencies are shown as arrows in Figure 6. Memory locations (registers, heap memory, etc.) are balanced iff all values stored in them are balanced.

With the current implementation state, loading from unbalanced memory causes unbalanced values to be stored in registers. This fact can be avoided however, by not using globals.

#### 5.1 LLVM

The LLVM compiler infrastructure project[9] contains a number of subprojects, but it is mostly known for being an extremely versatile compiler. It works by working on an intermediate representation (LLVM IR) specifically designed to be source and target independent, while allowing for easy automatic optimization. As most of the work in a compiler goes into the so-called



Figure 5: Hamming weight histograms for direct and addition-loop multiplication



Figure 6: The balancing status of my thesis project

optimization passes. This design makes the bulk of the work apllicable to all languages. A language can be added to LLVM by writing a frontend, translating source code to LLVM IR. The translation from LLVM IR to machine code for the target architecture is then done by backends, which can also easily be added. Figure 7 sketches this architecture.



Figure 7: The general architecture of the LLVM compiler

As the optimization passes take LLVM IR as input and provide it as output, they can easily be added. For this reason I implemented the transformation of code into my balanced form as an LLVM optimization pass.

## 5.2 Balancing Pass

The idea behind the balancing pass is very simple.

- 1. Change the type of all 8bit integers (int8) to 32bit integers (int32)
- 2. Balance constant initializers
- 3. Balance results of load operations if necessary
- 4. Unbalance values before store if necessary
- 5. Use balanced arithmetic operations instead of regular operators
- 6. Fix comparison directions
- 7. Fix type issues that arise in the instructions that have not been replaced

The LLVM C++ API provides an iterator over all instructions in a function. My pass uses this to go over all instructions, see if something needs to be done, and if so, performs the transformation. This is usually done by generating a new LLVM IR instruction and replacing references to the old instruction with it. LLVM provides some convenient helpers for such a process.

#### 5.3 Cloning Functions

The first types used in a function are its return type and the types of its function parameters. As these types cannot be changed for an existing function in LLVM I need to clone the functions with updated types.

Cloning functions is done in two parts. First the prototype for the new function is created. During this creation the pass goes through all parameter and changes their types from *int8* to *int32*. The same is done for the return type. This gives me a skeleton for the balanced function, which is inserted into the module, making it accessible in the future.

The content of the original function is then copied using a helper in the LLVM API called *CloneFunctionInto*. Without any additional parameters, the copied instructions will still reference function parameters of the original function, will result in broken code in the new function. To avoid this I use a so-called *Value Mapper* to replace the old parameters with the new ones everywhere they are referenced. This change alone would cause type mismatches and generates code that does not compile, but the other steps of my pass fix these problems.

# 5.4 Balancing Allocations

In order to declare and use local variables in LLVM IR the memory for them first has to be allocated using the *alloca* instruction. Even function parameters are not used directly but first copied into memory explicitly allocated for this function. Note that even though the naming is similar to C's *malloc* call, the memory for *alloca* is on the stack in this case.

The *alloca* instruction takes the type to be allocated as parameter, and returns a pointer to that type. This means that for balancing all the pass has to do is replace the *alloca* for *int8* with one for *int3*2. Allocations for local arrays work the same way, the pass just needs to extract the dimensions from the old allocation.

### 5.5 Balancing Stores

It can happen that the target code tries to store a balanced variable (*int3*2) into an unbalanced pointer (*int8*). In this case the pass unbalances the variable in a temporary before storing it.

While this does cause information leakage and a reduction in robustness, such a case can be avoided fairly easily. As only global memory is unbalanced, this does not happen when the program stores all values on the stack.

# 5.6 Balancing Loads

Balancing loads is a mirror case of balancing stores. When loading from an unbalanced pointer into a balanced variable, the pass first loads into an unbalanced temporary and then balances the value before storing it in the local variable.

# 5.7 Balancing Binary Operations

I implemented the balanced operations described in Section 4.2 in C, each as an individual function. In order to balance binary operations they need to be replaced by calls to these new functions. As all binary operations are represented by the same instruction in the LLVM API, the pass needs to examine the *opcode* of the instruction. Based on that it decides which function call to generate.

For most operations the balanced operation is a direct implementation of the respective steps in Section 4. Division, and remainder however are implemented by repeated addition/subtraction. As an example, Listing 2 shows the balanced function for the sdiv (signed division) operation in LLVM IR.

#### Listing 2: Balanced sdiv

```
1 int balanced_sdiv(int lhs, int rhs) {
    uint32_t ret = 0x00ff0000;
    uint8_t negative = 0;
    if(rhs & 0x00000080){
      negative = 1;
      rhs = balanced_negative(rhs);
10
    while (lhs <= rhs) { //~x <= ~y iff x >= y
11
      lhs = balanced_sub(lhs, rhs);
12
      ret = balanced_add(ret, 0x00fe0001);
13
14
15
16
    if(negative)
     return balanced_negative(ret);
17
18
    else
19
      return ret;
20 }
```

The semantics, especially the handling of negative values are made to be consistent with the semantics of LLVM.

### 5.8 Balancing Pointer Arithmetic

Balanced values cannot be used for array indexing directly. Therefore, whenever a balanced variable is used as index for an array access it is unbalanced before use. All array accesses use the *getelementptr* instruction in LLVM IR, so this is easy to catch. This does not handle manual arithmetic operations with pointers, but that is by design.

#### 5.9 Balancing Compares

In my main balancing scheme (Figure 3a) the inverse occupies more significant bits than the value itself. This changes the direction of comparison operations, meaning < becomes >, >= becomes <= etc.

### 6 Evaluation

QEMU does not simply interpret the guest code in a simulated processor. Instead it translates the machine code for the guest platform into machine code for the host platform, and places that "patched" machine code in memory. A second executor thread then runs that code as it becomes available.

This translation backend is called the Tiny Code Generator (TCG), which not only performs the translation but also some optimizations. Instrumenting QEMU for analysis is hard due to the fact that the TCG works through multiple layers of indirection, utilizing both helper functions and preprocessor macros, some of which are defined in different files depending on the host architecture (the specific definition file is chosen while building QEMU). As documentation is also sparse, finding a good place to put my evaluation code required a lot of time and effort.

Even after understanding all the parts of QEMU's way of emulating code, I was left with a problem. The executor thread does not know what code it is executing, it only has a pointer (the simulated program counter) to the next instruction or the next basic block. The TCG on the other hand knows which operations are being executed, but it does not know the values of the operands. It also has no way of accessing these values as they might not even be computed yet. So short of either parsing the memory at the simulated program counter or writing a symbolic execution engine (essentially replacing QEMU) I did not know how to proceed.

Luckily, QEMU offers emulation via the TCG Interpreter (TCI). The TCI does exactly what I was looking for in the first place, i.e. emulating the guest processor in C. I then placed my instrumentation code in the operator functions of the TCI, generating a histogram of Hamming Weights during the execution.

# 7 Results

In this section I will discuss the balancing results for the two main algorithms I tested the pass on: RC4 and AES. Both algorithms have been written/adapted so that they utilize the stack as much as possible, maximizing the benefits of

my balancing pass. For the evaluation of both the performance and the robustness I use histograms of the Hamming Weights over the entire execution of the code. Figures 8 and 9 show a comparison of balanced and unbalanced histograms for RC4 and AES respectively.

The balanced version of both algorithms have been compiled with my balancing pass, while the unbalanced versions were compiled with GNU ARM Cross GCC.



Figure 8: Hamming weight histograms for balanced and unbalanced RC4



Figure 9: Hamming weight histograms for balanced and unbalanced AES

#### 7.1 Robustness

For both algorithm the balancing works very well. The Hamming Weights are concentrated around 8, with other values being much less frequent. Note that for an attacker performing Power Analysis, all values with the same Hamming Weight look identical. Thus, the less evenly distributed the Hamming Weights of intermediate values are, the lower the confidence of her statistical attack. As such, a perfect scenario for the defender would be all Hamming Weights having exactly the same value.

A significant number of operations also exhibit a Hamming Weight of 9 and 10, which is probably due to carry bits in arithmetic operations. This theory is supported by the fact that these Hamming Weights are more prevalent in AES, which utilizes a lot more loops and therefore additions.

The balancing is not perfect, as some intermediate steps of my balanced operators will *always* have unbalanced values. Value unbalancing for array indexing is also a factor for the distribution of Hamming Weights in the balanced code.

#### 7.2 Performance

The number of operations is  $77\,349$  for unbalanced RC4, and  $401\,287$  for balanced RC4. That is an increase in the number of operations by a factor of 5.19. For AES the unbalanced code has  $83\,549$  operations, while the balanced code has  $2\,396\,186$  operations. This is an increase by a factor of 28.68.

For both algorithms the largest part of the performance impact is probably due to MUL, DIV and REM operations being calculated via repeated addition/subtraction. In general it is also important to note that when the full 32bit range is required for the program the performance drops by an additional factor of 4, because then every operation needs to be performed on the individual bytes of a 32bit word. This is less prevalent in cryptographic algorithms, as they mostly work on individual bytes.

However, performance was explicitly not a goal for my thesis, and I decided to focus solely on robustness.

# 8 Conclusion

In my thesis I evaluated the robustness of a pure software implementation of Dual-Rail-Logic. By writing a proof of concept implementation I explored a new perspective on hardening embedded platforms against power analysis attacks. Preliminary evaluation shows a drastic reduction in the signal to noise ratio of the power consumption, due to a decreased variance in Hamming Weights of intermediate values.

The security of balanced code is not perfect, as it is limited by the capabilities of ALUs. However, with this limitation in mind, I believe my balancing pass achieves quite good performance. Histograms of the Hamming Weights show a shift towards values around 8. This causes a relative decrease of other values, reducing the information an attacker gains from power analysis attacks. She thus needs a larger number of traces to reach the same confidence in her attack.

A major disadvantage of the approach in my thesis is the increased number of operations taking place. The number of clock cycles increases by a factor between 5 and 7 for RC4 and AES, respectively. When taking into account the reduction in word size this factor rises up to 28. Future work could reduce this performance impact by removing unnecessary transformations between balancing schemes, but the design of embedded platforms and RISC architectures in general sets a lower bound for the performance impact of my approach. While something like Intel's SIMD extensions[10] could drastically reduce the performance impact of software Dual-Rail-Logic, this is not possible for the intended target platforms.

As currently only stack values are balanced, balancing all types of variables is another avenue for future work. This would balance main memory, and thus the data bus at all times. The logical next step after this would be to balance the address bus as well, completely cutting an attacker of from getting any information via the power consumption. However, this last approach would require making major changes to the way memory is indexed, possibly changing paging controllers and (if present) cache controllers.

A third possibility for future work would be attacking actual hardware running balanced code, providing some real-world evaluation. The difficulty in this evaluation lies in the fact that my approach requires 32bit registers, which are typically only found in more powerful embedded processors running higher clock speeds, which makes power analysis harder much harder by itself, even without additional defenses.

With the way it currently is, my proof of work provides a way for programmers without explicit security knowledge to harden their code against power analysis attacks, without making too large adjustments to their code. As my compiler pass balances all code, as long as it is on the stack, even substitution boxes can be used, they simply need to be passed as function parameters.

This reduces the number of considerations a programmer has to make, handing them off to the compiler. With this I hope to help taking a step towards compilers generating secure code automatically, thus allowing for more secure applications, even when neither the money nor the expertise is present for high-quality security auditing.

#### References

- [1] URL: https://github.com/alxshine/dual-rail.
- [2] Karthik Baddam and Mark Zwolinski. "Path switching: a technique to tolerate dual rail routing imbalances". In: *Design Automation for Embedded Systems* 12.3 (2008), pp. 207–220.
- [3] Rainer Böhme. Information Security II lecture notes. 2017.
- [4] Eric Brier, Christophe Clavier, and Francis Olivier. "Correlation power analysis with a leakage model". In: *International Workshop on Cryptographic Hardware and Embedded Systems*. Springer. 2004, pp. 16–29.

- [5] Jean-Sébastien Coron and Louis Goubin. "On boolean and arithmetic masking against differential power analysis". In: *International Workshop on Cryptographic Hardware and Embedded Systems*. Springer. 2000, pp. 231–237.
- [6] Jovan D Golić and Christophe Tymen. "Multiplicative masking and power analysis of AES". In: *International Workshop on Cryptographic Hardware and Embedded Systems*. Springer. 2002, pp. 198–212.
- [7] Paul Kocher, Joshua Jaffe, Benjamin Jun, et al. *Introduction to differential power analysis and related attacks*. 1998.
- [8] Paul Kocher et al. "Introduction to differential power analysis". In: *Journal of Cryptographic Engineering* 1.1 (2011), pp. 5–27.
- [9] Chris Lattner et al. "The LLVM compiler infrastructure". In: *URL http://llvm.org* (2010).
- [10] Chris Lomont. "Introduction to intel advanced vector extensions". In: *Intel White Paper* (2011), pp. 1–21.
- [11] Alin Razafindraibe, Michel Robert, and Philippe Maurine. "Formal evaluation of the robustness of dual-rail logic against DPA attacks". In: *International Workshop on Power and Timing Modeling, Optimization and Simulation*. Springer. 2006, pp. 634–644.
- [12] Danil Sokolov et al. "Design and analysis of dual-rail circuits for security applications". In: *IEEE Transactions on Computers* 54.4 (2005), pp. 449–460.