A Minimal Trusted Computing Base (TCB) #146

JoshLind · 2021-03-16T18:04:27Z

A Minimal Trusted Computing Base (TCB)

Authors: Joshua Lind (@JoshLind), David Wong (@mimoo)
Status: Rough draft (for discussion)

1. Goals of this Document:

The goals of this document are:
- To reason (informally) about the security of the TCB and analyze the trade-offs in design of different approaches.
- To highlight the challenges one needs to address when designing a TCB for Diem.
- To propose a set of improvements for the TCB today and identify potential areas for exploration in the future.
The non-goals of this document are:
- To perform an in-depth survey of TCBs in the context of blockchains. This requires its own document.
- To discuss specific hardware and software implementations of isolated execution environments (e.g., TPMs, TEEs, VMs, cloud isolation mechanisms, etc.). This also requires its own document.

2. Preliminary Reading:

TCB Overview

What is a TCB?
- At a high-level, a trusted computing base (TCB) is a set of components (e.g., hardware and software) that together ensure a set of security properties for a system. For an attacker to violate security properties, one or more components in the TCB need to be subverted.
Do we need a TCB?
- Every system that ensures some security property has a TCB (explicit or not). In the worst case, the TCB comprises the entire system. Well designed systems, however, will make the TCB explicit and as small as possible (e.g., by separating it out as its own sub-system and/or running it in an isolated execution environment). This makes it easier to reason about security. Moreover, it allows the system to better adhere to well known security principles (e.g., the principle of least privilege and defense in depth).

Securing TCBs

What needs to be considered when securing a TCB?
- Size: The size of the TCB (in lines of code, number of components, dependencies and whatever else you deem a good metric). In general, the more that goes into a TCB, the greater the likelihood of bugs, which could lead to compromises. Simplicity/minimality are key.
- Engineering quality: The engineering quality of the code and components in the TCB. This includes: (i) the maturity, readability, simplicity and auditability of the code/components; (ii) the programming languages used and classes of bugs possible; and (iii) any formal or informal efforts to reason about or verify the security of the code/components, e.g., security audits, static/dynamic analysis, verification, proofs, hardening etc. Efforts also need to be made to ensure that quality is maintained going forward (e.g., a high bar for code reviews and third-party dependency audits).
- Interface: The interfaces exposed by the TCB, e.g., the interface with the outside world (i.e., the non-TCB world) and the interfaces between components. These could be function calls, RPC calls, explicit APIs or even hardware interfaces.
- Execution isolation/protection: As the TCB is essentially the “security kernel” of the system, it makes sense for it to execute in an isolated/protected environment. This includes trusted hardware (e.g., TPMs and HSMs), trusted execution environments (e.g., Intel SGX, Arm Trustzone, AMD SEV), containers and on-premises deployments.
- Layering (defense in depth): TCBs themselves provide an additional layer of defense for a system. If an adversary compromises the outside world, but not the TCB, they gain less than if they compromised the TCB. This is defense in depth. Along the same lines then, the TCB itself might be layered, so that partial compromises of the TCB can still provide some security guarantees.

3. Assumptions and Validator Component Abstraction (VCA)

To reason about the Diem TCB, we first make several assumptions about validators and their components in a blockchain.

Assumptions about validators:
Note: We consider it future work to challenge these assumptions (see the bottom of this document).

Failures/compromises are independent across validators. If one validator crashes or is compromised due to a bug, that same bug can’t be used to target all validators.
- This assumption can be naive when the validators share exactly the same code and deployment architectures. For this reason, it makes sense to explore alternate implementations of validators in the future.
Failures/compromises are independent across isolated execution environments. If two isolated execution environments (e.g., containers, HSMs, TEEs) are used by the TCB, their failures should be independent.
- This assumption can also be naive (e.g., if two containers are running on the same machine, or if two HSMs share the same power source). For this reason, it makes sense to explore alternate environments in the future.
Validators are queried by/communicate with different types of clients. These clients include:
- Non-verifying clients: these clients blindly trust whatever is returned by the validator (e.g., JSON RPC).
- Verifying clients: these clients verify both validator signatures and re-execute transactions to ensure that the **** data returned by the validator is correct (i.e., for both consensus and execution). This is what validator fullnodes do today.
- Verifying but non-executing clients, these clients verify only validator signatures, but do not re-execute transactions themselves (i.e., they trust the validators to execute transactions correctly). This is what fullnodes will likely do in the future.

Assumed components in a validator:
Next, we assume a simple validator component abstraction (VCA):

Consensus: Consensus is responsible for reaching agreement on a sequence of values between a set of validators.
- Consensus is assumed to operate using unique cryptographic keys held by each validator, which integrity protect and authenticate votes in the protocol (i.e., consensus keys). We assume validators know the public exponents of other validator consensus keys via the blockchain. The specific consensus protocol is immaterial.
Execution: Execution is responsible for calculating the next consensus value to agree upon between validators.
- The consensus value is assumed to be the new state (S’) computed based on a previous state (S) and a transaction to execute (T). Each validator will independently calculate S’ using S and T.
Storage: Storage is used to persist data held by each validator, this includes: cryptographic keys (e.g., the consensus key), blockchain states (e.g., S and S’) and transactions (Ts). We assume that storage is:
- Persistent for ease of validator operation as computers often crash, reboot, need to be upgraded, and restarted. Solutions for non-persistent storage may exist (e.g., in-memory only), but this is often impractical.
- Readily accessible by the validator ** (i.e., storage cannot be “cold”, e.g., a physical safe that requires human intervention to access). This is impractical from a performance/operation perspective.

4. Security Formalization

In order to analyze the security benefits of a TCB, we propose the following (informal) security definitions:

Types of compromise:

Shallow compromise: When an adversary compromises all components in the system, excluding the TCB.
Deep compromise: When an adversary performs a shallow compromise as well as compromises at least one component in the TCB (but not all components in the TCB).
Complete compromise: When an adversary has managed to compromise all components in the system, including the TCB.

Types of security impact:

Local impact: The impact of an attack on a single validator or fullnode. This generally affects the local clients and other nodes connected to that validator or fullnode (e.g., by deceiving them or performing eclipse attacks).
Global impact: The impact of an attack on the blockchain as a whole, e.g., a global attack could lead to an invalid commit being performed by all validators (for some definition of invalid, see below).

The Adversary model:
Consensus assumes that f validators are byzantine and colluding (i.e., completely compromised). We therefore consider the TCB interesting if it can still provide security properties when h additional compromises occur (shallow or deep). We consider two adversary models:

Non-Quorum. This means: f byzantine validators and h<=f shallow or deep compromises.
Quorum. This means: f byzantine validators and h>f shallow or deep compromises.

Types of Attacks:
We consider three high-level types of attacks:

Safety. Safety attacks mean that a fork can happen (i.e. two contradicting states can be committed across different validators).
Correctness. Correctness attacks extend the blockchain in a way that violates the semantics of the protocol (i.e. the correct execution of transactions and extension of the state). We define semantic extension as: given a committed state S, we get the next committed state via execution(S, transaction), i.e, this is the property we expect from the blockchain. Using this definition, correctness attacks can be divided into two categories:
- Non-semantic extension, where malicious validators extend the chain by committing an arbitrary state. One example of non-semantic extension is: given a committed state S, we can extend it to a state S', where S' is not the result of execution(S, transaction), for any S and transaction. In this attack, honest verifying clients (verifying full nodes and validators) will become stuck and unable to reach the arbitrary state.
- Non-semantic fork, where malicious validators commit a non-semantic extension but continue to behave honestly from the point of view of honest validators. In this attack, honest verifying clients will not be aware that a fork (committing to an arbitrary state) has been created.
- Note: A correctness attack can only impact verifying but non-executing clients, as well as non-verifying clients.
Liveness. We consider liveness attacks out of scope, e.g., if f validators are completely compromised, any further compromises will violate liveness globally.

5. The Incremental TCB:

To begin reasoning about the TCB in Diem, we take a step by step approach to building a TCB based on the VCA above. For each step, we reason about the security guarantees of the design.

Step 1: TCB = { Consensus key }

To begin, we move only the consensus key into the TCB and propose that consensus asks the TCB to sign data (e.g., votes). Reasoning about security, we see:

Shallow compromise: Moving the consensus key into the TCB doesn’t provide much under shallow compromise: it is almost identical to a consensus key compromise as the attacker can ask the TCB to sign anything. In the non-quorum and quorum adversary models, safety can be violated, as there are more than f byzantine nodes. In the quorum adversary model, semantic correctness can also be violated (an attacker can arbitrarily extend the state).
Deep compromise: Deep compromise is similar to shallow compromise, except that the consensus key has also been leaked, which might enable undetected attacks like long-range attacks.
Complete compromise: Identical to deep compromise (as there is only one component in the TCB).

Step 2: TCB = { Consensus key + Safety Rules }

To improve on step 1, we focus on hardening the validator against safety attacks. To do this, we partition consensus and move a subset of the consensus module into the TCB, labelled safety rules. Safety rules contains a set of verification constraints that when enforced by enough validators (>= 2f+1) prevent forks in the consensus protocol (see the Voting Rules in the Consensus specification). Reasoning about security, we now see:

Shallow compromise:
- Non-Quorum model: the attacker cannot violate safety (i.e., cannot fork), by the definition of safety rules. Moreover, the attacker cannot violate correctness (cannot extend the state arbitrarily) as this requires 2f+1 validators to certify and commit a non-semantic extension.
- Quorum model: the attacker cannot violate safety (i.e., cannot fork), by the definition of safety rules. However, the attacker can violate correctness (they can extend the state arbitrarily) as 2f+1 validators can certify and commit a non-semantic extension. This is because a compromised safety rules will agree to vote on any execution state.
Deep compromise:
- Consensus key: This is equivalent to a complete compromise in step 1.
- Safety rules: This is equivalent to a shallow compromise in step 1.
Complete compromise: This is equivalent to complete compromise in step 1.

Step 3: TCB = { Consensus Keys + Safety Rules + Execution }

To prevent attacks on correctness (as seen in step 2 above), we need to ensure that shallow compromises cannot enable voting on proposals that arbitrarily extend state. To achieve this, we observe that one can simply move the execution logic (including the Move VM) into the TCB. This will enforce correct execution of transactions. However, one still needs to ensure that execution extends the correct state. Here, one could move storage into the TCB. However, this is naive as it bloats the TCB. Instead, we observe that it is more beneficial to treat storage as untrusted and instead have execution keep track of valid state root hashes and update them within the TCB. We call this approach execution correctness.

We now reason about the security of this approach:

Security Analysis
- Shallow compromise:
  - Non-Quorum model: The attacker cannot violate safety (cannot fork) or violate correctness (cannot extend the chain arbitrarily).
  - Quorum model: The attacker cannot violate safety (cannot fork) or violate correctness (cannot extend the chain arbitrarily).
- Deep compromise:
  - Consensus key: This is equivalent to a complete compromise in step 2.
  - Safety rules: This is equivalent to a deep compromise of safety rules in step 2.
  - Execution: This is equivalent to a shallow compromise in step 2.
- Complete compromise: This is equivalent to complete compromise in step 2.

6. The Existing TCB (v1)

Today, execution correctness is still a work in progress and not part of the TCB. As such, shallow compromises defend against everything but correctness attacks (see step 2 of the incremental TCB). In this section, we take a look at various implementation details of the TCB as it stands today:

Storage is partitioned into two subsystems, secure storage and storage:
- Secure storage holds security-sensitive data, e.g., cryptographic keys and state used by Safety Rules. Secure storage is considered part of the TCB.
- Storage is used to store everything else, e.g., transactions and execution state. Storage is outside the TCB and untrusted by the TCB.
Isolated execution environments: The Diem TCB separates Safety Rules, Execution Correctness and secure storage into separate execution environments. This aims to prevent deep compromises from becoming complete compromises.
Automatic consensus key rotations: To prevent long range attacks and recover from consensus key compromises, Diem introduces a Key Manager which supports automatic consensus key rotations.
- Note: To achieve this, Diem introduces another cryptographic key (the operator key), which is held in secure storage and used to endorse (i.e., sign) the consensus key on the blockchain.
Delegated signatures: To avoid exposing keys to components in other isolated execution environment (see the different analysis on deep compromises), Diem uses delegated signatures:
- Safety rules asks secure storage to sign proposals and votes using the consensus key.
- Key manager asks secure storage to sign the rotation transaction using the operator key.

7. Proposal & Path Forward (TCB v2)

Based on the observations above, we outline the following design and implementation improvements for the TCB (v2):

Design Improvements:

Move execution correctness into the TCB (or otherwise verify execution):
- Reasoning: Diem is moving towards fullnode/client models that verify consensus signatures but do not verify execution. When this happens, validators will be the only line of defense responsible for correctly executing transactions.
- Advantages: Makes it easier to reason about security, i.e., it will prevent quorum shallow compromises from violating semantic correctness. It will also become increasingly important if fullnodes don’t re-execute transactions in the future. It also makes sense if/when we move to TEEs/secure hardware.
- Disadvantages: Performance & operational complexity (this will need to be moved into an existing TCB container, or a new container). Execution correctness is also very large, which leads to ask whether or not this is even possible/practical without really bloating the TCB.
Always export the consensus key to safety rules:
- Advantages: Performance (no more delegated signatures).
- Disadvantages: Safety rules comprises will leak the consensus key rather than just leak control of the key. However, in practice, there is little difference.
Introduce a smart secure storage layer and remove the key manager:
- Reasoning: Secure storage is closely tied to vault today. In the future, we’d like to support heterogenous backends (e.g., as required by other operators). Moreover, if the key manager is compromised today the consensus and operator keys can be swapped out to adversary controlled values. This is an unideal implementation artifact due to vault permissions (e.g., the key manager has permissions to sign anything using the operator key — like a transaction that rotates the operator key)
- Advantages: Removes the key manager and replaces it with a simple key rotator thread outside the TCB. Allows us to easily move away from vault and positions us towards heterogenous solutions.
- Disadvantages: This would require a lot more investigation as well as changes to both Diem code and deployment code.

Implementation Improvements:

Update the on-chain configs to allow single field updates:
- Reasoning: To rotate the consensus key, the key manager must fetch the on-chain config of the node from the blockchain, copy all fields, override the consensus key and set the entire config again. This opens attack vectors. Instead, the key manager should just be able to set a single field.
- Advantages: Removes the need for the TCB to read/communicate with the blockchain. Now, the TCB can simply create and sign a new transaction that changes the consensus key field.
- Disadvantages: Requires on-chain config modifications.
Remove the execution correctness key and allow safety-rules & execution to communicate directly:
- Advantages: Simplifies the deployment architecture and allows consensus and execution correctness to operate in the same isolated execution environment (possible performance gain?)
- Disadvantages: Requires investigation to see if this is possible.
Implement detection mechanisms for TCB & key compromises:
- Ideas:
  - Set alerts for key rotations so that operators are notified whenever keys are changed on-chain.
  - Monitor consensus behavor to identify when nodes are acting Byzantine...
- Advantages: Prevents an adversary from taking control of validators without being noticed.
- Disadvantages: Requires a lot more investigation to identify and propose practical solutions.
Analyze/Audit the interface between execution correctness & storage (ensure it is untrusted):
- Reasoning: It seems that the interface between execution correctness and storage makes some odd assumptions (e.g., storage doesn’t trust execution correctness, and thus recomputes results?).

8. Future Explorations for the TCB

The list below contains future explorations for the TCB. Each of these requires additional thought and analysis.

Multiple/Alternate Diem implementations: This will help to ensure failures are indeed independent between validators. Right now, we can’t really make that claim.
Further reducing the size of each TCB component (e.g., lines of code, dependencies, unsafe code, etc.): Right now, each TCB component contains logic that doesn’t really need to be in the TCB, e.g., the key manager contains logic for serializing json, communicating with the blockchain, handling JSON RPC calls, etc. This is likely the same for others.
Exploring isolated execution environments, e.g., TEEs, TPMs, HSMs, on-premises deployments, etc. This seems like an important next step given that we’re currently only using containers.
Reasoning about and hardening the interfaces of each TCB component: e.g., reasoning about the security guarantees of each API/interface and thinking about the possible attacks that can occur through misuse.
Formal code verification and analysis: e.g., dynamic and static analysis tools to ensure the TCB code is indeed doing what we think it’s doing.

The text was updated successfully, but these errors were encountered:

aching · 2021-03-16T20:27:51Z

I'd be happy to be DIP manager for this btw.

JoshLind · 2021-03-22T23:03:32Z

Responding to @aching's comments on diem/diem#7930 😄

I would also add thoughts around considering adding more scrutiny to changes (e.g. more reviewers and/or a small set of highly knowledgeable reviewers)

In my mind, this comes under engineering quality and maintaining that quality going forward. I'll add a sentence to that section as something to call out.

Some insight to the reader on why h<=f and h>f would be helpful here - this has to do with BFT properties of 2f+1. I find Quorum to be intuitive, but Majority is somewhat strange since it might not be an actual majority. Naming is hard, but the terms are just =Quorum IIUC.

Indeed, the choice of "Majority" and "Quorum" were really just to distinguish between the < Quorum and >= Quorum cases. I'll add a sentence or two to call this out. Note: Originally, I was in favour of replacing "Majority" with "Non-Quorum", but the text becomes a little verbose at times... Given that @mimoo introduced this nomenclature, I'll see if he has any concerns replacing "Majority" with "Non-Quorum", otherwise I'll just look into replacing it 😉

Strawperson?

Aah, here I was following the established nomenclature (e.g., https://arxiv.org/pdf/1907.07010.pdf). However, if you feel this may be an issue, I'll update the text to avoid it altogether (e.g., "The Incremental TCB").

Even if we move toward a world where most validators are not re-executing transactions, I expect there still will be a few that do - possibly run by the Diem Association or its members internally to verify execution. In any case, if it provides a meaningful tradeoff of performance vs security, it is worth considering to just run 1 or more privately to detect these execution issues.

Indeed, this is another approach to consider. However, the trade-offs here are that it moves the TCB to a reactive security model, as opposed to a proactive one, i.e., attacks would be "possible" at the validator layer, but "reacted to" when caught by the re-verifying nodes. Moreover, it increases operational complexity and requires the re-verifying nodes to be considered as secure/important as the TCB (otherwise they could be compromised, too). However, like you say, the advantage here is that verification is taken off the critical path (and done asynchronously), so if the performance win is significant, it might be an interesting trade-off... I think the important next step here is to really understand how much performance will be impacted by moving execution correctness into the TCB (e.g. its own container).

Given that TEEs/secure hardware are aways off, is there an intermediate recommendation? For instance, is there much value to the current implementation or should we integrate directly and then rethink upon a secure implementation?

I think this is something we'd need to explore in more detail. I agree that the current deployment model (i.e., using containers to isolate components) is pretty flimsy from a security point of view, but I think: (i) the engineering cost to "integrate directly" and then re-think in the future would probably be too high for any temporary performance/simplicity gains -- especially if we ultimately want to go down the route of TEEs/secure hardware; and (ii) there are probably a few things we could employ today that would be consistent with our future directions (e.g., on-premesis solutions, cloud virtualization, sandboxing technology, etc). It seems to me like this is a worthwhile area to spend some time looking at next.

Is the performance gain significant? Leaking the consensus key seems worse than leaking control of the key.

The performance gain is ~5-10% as reported in cluster test, so in the "real world", I'd expect the gain to be a little less than this. While I agree it seems worse to leak the key vs. control of the key, I'd argue that, in practice, our mitigations make these cases equivalent: (i) we require frequent consensus key rotations, so as to avoid long range attacks; and (ii) if a node were ever compromised, we'd always rotate the key, even if it wasn't leaked, but controlled.

At first glance, this seems to conflict with "Move execution correctness into the TCB (or otherwise verify execution):". Is this an intermediate proposal as I mentioned earlier?

Aah, this was a typo. It should have read: "Remove the execution correctness key and allow safety rules & execution to communicate directly (inside the TCB)". Will update it above.

Btw, any thoughts about marking this a markdown file in the repo itself? We could add a rationale folder in diem/documentation perhaps and it would make it easier to review the content here. =)

If you feel this would add value, we can do that. Although, it would be weird if this was the only document in there 😆

mimoo · 2021-03-23T19:40:26Z

Given that @mimoo introduced this nomenclature, I'll see if he has any concerns replacing "Majority" with "Non-Quorum", otherwise I'll just look into replacing it 😉

Yeah no problem changing these terms. The only term that mattered to me was a hawking compromise haha.

Btw, any thoughts about marking this a markdown file in the repo itself? We could add a rationale folder in diem/documentation perhaps and it would make it easier to review the content here. =)

I think a DIP is more appropriate.

BTW any private quip discussion that someone could summarize here as well for us external contributors?

mimoo · 2021-03-24T20:34:26Z

Also this part needs to be removed (step 2 of section 5):

Note that the attacker can only violate correctness for nodes that do not re-execute transactions. Honest validators and full nodes might get stuck, or might not realize that an attack has taken place (if the compromised nodes also continue to act honestly in parallel — the non-semantic fork mentioned above).

there are no non-semantic forks in this model as safety is preserved

JoshLind · 2021-04-16T02:46:01Z

Thanks @aching and @mimoo. I've updated the document based on the feedback! 😄

BTW any private quip discussion that someone could summarize here as well for us external contributors?

@mimoo, from the internal discussions there weren't any major/unexpected issues that came up. At a high-level, we agree that leaving execution correctness (EC) outside the TCB exposes us to risks around execution violations. However, the concerns with fixing this today are related to the implementation. These are:

(1) The performance implications: Current estimates seem to suggest that moving EC into the TCB will incur about a ~10-15% performance impact ('cc @zekun000). However, when we ultimately move to a more optimized execution model (e.g., decoupling consensus and execution), moving EC into the TCB will likely become the bottleneck. So we need to come to agreement on how to do this without handicapping performance. One idea is to do execution verification "off the critical path" (e.g., leave EC outside of the TCB but run a backup fullnode in a secured environment that will be able to retroactively detect execution violations).
(2) The size implications: If EC is moved into the TCB, we need to understand exactly how much larger the TCB will become (e.g., in terms of lines of code, dependencies, unsafe code, etc). If the increase is too dramatic, it raises the likelihood of security vulnerabilities significantly, which isn't great.
(3) Securing the TCB with more than just containers: As is already obvious, only using containers to protect the TCB is questionable and doesn't provide us with much additional security today. While TEEs/trusted hardware are the ultimate goal, if they aren't available now we need to explore better protection mechanisms that we can actually employ today. Without this, there really isn't much we're getting.

To answer (2), I am currently working on a document that explores and quantifies this impact so we can better understand how the TCB will be affected. For (3), I'm also going to begin exploring the options available to us (today, as well as in the future, e.g., with TEEs) to better protect the TCB. Based on the outcomes of (2) and (3), the decision around (1) should become clearer.

I will update this issue with these findings when we have them! 😄

aching · 2021-05-03T23:38:00Z

We'll let this sit until you're ready to move forward with the proposal. Thanks!

JoshLind mentioned this issue Mar 16, 2021

A Minimal Trusted Computing Base (TCB) diem/diem#7930

Closed

aching self-assigned this Mar 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A Minimal Trusted Computing Base (TCB) #146

A Minimal Trusted Computing Base (TCB) #146

JoshLind commented Mar 16, 2021 •

edited

aching commented Mar 16, 2021

JoshLind commented Mar 22, 2021

mimoo commented Mar 23, 2021 •

edited

mimoo commented Mar 24, 2021

JoshLind commented Apr 16, 2021

aching commented May 3, 2021

A Minimal Trusted Computing Base (TCB) #146

A Minimal Trusted Computing Base (TCB) #146

Comments

JoshLind commented Mar 16, 2021 • edited

A Minimal Trusted Computing Base (TCB)

1. Goals of this Document:

2. Preliminary Reading:

3. Assumptions and Validator Component Abstraction (VCA)

4. Security Formalization

5. The Incremental TCB:

6. The Existing TCB (v1)

7. Proposal & Path Forward (TCB v2)

8. Future Explorations for the TCB

aching commented Mar 16, 2021

JoshLind commented Mar 22, 2021

mimoo commented Mar 23, 2021 • edited

mimoo commented Mar 24, 2021

JoshLind commented Apr 16, 2021

aching commented May 3, 2021

JoshLind commented Mar 16, 2021 •

edited

mimoo commented Mar 23, 2021 •

edited